API

This package has the following modules:

user_input_checking

This module contains functions for checking user input.

tracker_schema

This module contains the schema for validating user input.

athr_srch_modularized

This module contains functions to complete the author_search command modularized into pieces.

athr_srch_webio

This module contains functions for author_search to interface with the internet.

athr_srch_emails_and_reports

This module contains functions to create emails and reports for author_search.

ref_srch_modularized

This module contains functions to complete the reference_search command modularized into pieces.

ref_srch_webio

This module contains functions for reference_search to interface with the internet.

ref_srch_emails_and_reports

This module contains functions to create emails and reports for reference_search.

citation_parsing

This module contains functions for parsing references and citations for reference_search.

fileio

This module contains functions for reading and writing files.

helper_functions

This module contains functions that help the other modules function. The functions do things such as fuzzy matching, regex searching, and printing.

webio

This module contains general functions for interfacing with the internet.

User Input Checking

Functions that check the user input for errors.

academic_tracker.user_input_checking.cli_inputs_check(args)[source]

Run input checking on the CLI inputs.

Uses jsonschema to validate the inputs.

Parameters

args (dict) – dict from docopt.

academic_tracker.user_input_checking.config_file_check(config_json, no_ORCID, no_GoogleScholar, no_Crossref)[source]

Check that the configuration JSON file is as expected.

The validational jsonschema is in the tracker_schema module.

Parameters
  • config_json (dict) – dict with the same structure as the configuration JSON file.

  • no_ORCID (bool) – if True delete the part of the schema that checks ORCID attributes.

  • no_GoogleScholar (bool) – if True and no_Crossref is True delete the part of the schema that checks Crossref attributes.

  • no_Crossref (bool) – if True and no_GoogleScholar is True delete the part of the schema that checks Crossref attributes.

academic_tracker.user_input_checking.config_report_check(config_json)[source]

Check that the report attributes don’t have conflicts.

Make sure that the values in sort and column_order are in columns, and that every column is in column_order.

Parameters

config_json (dict) – dict with the same structure as the configuration JSON file.

academic_tracker.user_input_checking.prev_pubs_file_check(prev_pubs)[source]

Run input checking on prev_pubs dict.

The validational jsonschema is in the tracker_schema module.

Parameters

prev_pubs (dict) – dict with the same structure as the previous publications JSON file.

academic_tracker.user_input_checking.ref_config_file_check(config_json, no_Crossref)[source]

Check that the configuration JSON file is as expected.

The validational jsonschema is in the tracker_schema module.

Parameters
  • config_json (dict) – dict with a truncated structure of the configuration JSON file.

  • no_Crossref (bool) – if True delete the part of the schema that checks Crossref attributes.

academic_tracker.user_input_checking.tok_reference_check(tok_ref)[source]

Run input checking on tok_ref dict.

The validational jsonschema is in the tracker_schema module.

Parameters

tok_ref (dict) – dict with the same structure as the tokenized reference JSON file.

academic_tracker.user_input_checking.tracker_validate(instance, schema, pattern_messages={}, cls=None, *args, **kwargs)[source]

Wrapper around jsonchema.validate to give better error messages.

Parameters
  • instance (dict) – JSON as a dict to validate

  • schema (dict) – JSON schema as a dict to validate instance against

  • pattern_messages (dict) – if the instance has a ValidationError of the pattern type then look up the attribute that failed the pattern in this dict and see if there is a custom message

Raises

jsonshcema.ValidationError – If an unexpected jsonschema error happens this is raised rather than a system exit.

Author Search Modularized

Modularized pieces of author_search.

academic_tracker.athr_srch_modularized.build_publication_dict(config_dict, prev_pubs, no_ORCID, no_GoogleScholar, no_Crossref)[source]

Query PubMed, ORCID, Google Scholar, and Crossref for publications for the authors.

Parameters
  • config_dict (dict) – Matches the Configuration file JSON schema.

  • prev_pubs (dict) – Matches the publication JSON schema. Used to ignore publications when querying.

  • no_ORCID (bool) – If True search ORCID else don’t.

  • no_GoogleScholar (bool) – if True search Google Scholar else don’t.

  • no_Crossref (bool) – If True search Crossref else don’t.

Returns

The dictionary matching the publication JSON schema. prev_pubs (dict): Same as input, but updated with the new publications found.

Return type

publication_dict (dict)

academic_tracker.athr_srch_modularized.generate_internal_data_and_check_authors(config_dict)[source]

Create authors_by_project_dict and look for authors without projects.

Parameters

config_dict (dict) – Matches the Configuration file JSON schema.

Returns

Keys are project names and values are a dictionary of authors and their attributes. config_dict (dict): same as input but with author information updated based on project information.

Return type

authors_by_project_dict (dict)

academic_tracker.athr_srch_modularized.input_reading_and_checking(config_json_filepath, no_ORCID, no_GoogleScholar, no_Crossref)[source]

Read in inputs from user and do error checking.

Parameters
  • config_json_filepath (str) – filepath to the configuration JSON.

  • no_ORCID (bool) – If True search ORCID else don’t. Reduces checking on config JSON if True.

  • no_GoogleScholar (bool) – if True search Google Scholar else don’t. Reduces checking on config JSON if True.

  • no_Crossref (bool) – If True search Crossref else don’t. Reduces checking on config JSON if True.

Returns

Matches the Configuration file JSON schema.

Return type

config_dict (dict)

academic_tracker.athr_srch_modularized.save_and_send_reports_and_emails(authors_by_project_dict, publication_dict, config_dict, test)[source]

Build the summary report and project reports and email them.

Parameters
  • authors_by_project_dict (dict) – Keys are project names and values are a dictionary of authors and their attributes.

  • publication_dict (dict) – The dictionary matching the publication JSON schema.

  • config_dict (dict) – Matches the Configuration file JSON schema.

  • test (bool) – If True save_dir_name is tracker-test instead of tracker- and emails are not sent.

Returns

Name of the directory where the emails and reports were saved.

Return type

save_dir_name (str)

Author Search Webio

Internet interfacing for author_search.

academic_tracker.athr_srch_webio.search_Crossref_for_pubs(prev_pubs, authors_json, mailto_email)[source]

Searhes Crossref for publications by each author.

For each author in authors_json Crossref is queried for the publications. The list of publications is then filtered by prev_pubs, affiliations, and cutoff_year. If the publication is in the prev_pubs then it is skipped. If the author doesn’t have at least one matching affiliation then the publication is skipped. If the publication was published before the cutoff_year then it is skipped. Each publication is then determined to have citations for any of the grants in the author’s grants.

Parameters
  • prev_pubs (dict) – dictionary of publications matching the JSON schema for publications.

  • authors_json (dict) – keys are authors and values are author attributes. Matches authors JSON schema.

  • mailto_email (str) – used in the query to Crossref

Returns

keys are pulication ids and values are a dictionary with publication attributes

Return type

publication_dict (dict)

academic_tracker.athr_srch_webio.search_Google_Scholar_for_pubs(prev_pubs, authors_json, mailto_email)[source]

Searhes Google Scholar for publications by each author.

For each author in authors_json Google Scholar is queried for the publications. The list of publications is then filtered by prev_pubs, affiliations, and cutoff_year. If the publication is in the prev_pubs then it is skipped. If the author doesn’t have at least one matching affiliation then the publication is skipped. If the publication was published before the cutoff_year then it is skipped. Each publication is then determined to have citations for any of the grants in the author’s grants.

Parameters
  • prev_pubs (dict) – dictionary of publications matching the JSON schema for publications.

  • authors_json (dict) – keys are authors and values are author attributes. Matches authors JSON schema.

  • mailto_email (str) – used in the query to Crossref when trying to find DOIs for the articles

Returns

keys are pulication ids and values are a dictionary with publication attributes

Return type

publication_dict (dict)

academic_tracker.athr_srch_webio.search_ORCID_for_pubs(prev_pubs, ORCID_key, ORCID_secret, authors_json)[source]

Searhes ORCID for publications by each author.

For each author in authors_json ORCID is queried for the publications. The list of publications is then filtered by prev_pubs, affiliations, and cutoff_year. If the publication is in the prev_pubs then it is skipped. If the author doesn’t have at least one matching affiliation then the publication is skipped. If the publication was published before the cutoff_year then it is skipped. Each publication is then determined to have citations for any of the grants in the author’s grants.

Parameters
  • prev_pubs (dict) – dictionary of publications matching the JSON schema for publications.

  • ORCID_key (str) – string of the app key ORCID gives when you register the app with them

  • ORCID_secret (str) – string of the secret ORCID gives when you register the app with them

  • authors_json (dict) – keys are authors and values are author attributes. Matches authors JSON schema.

Returns

keys are publication ids and values are a dictionary with publication attributes

Return type

publication_dict (dict)

academic_tracker.athr_srch_webio.search_PubMed_for_pubs(prev_pubs, authors_json, from_email)[source]

Searhes PubMed for publications by each author.

For each author in authors_json PubMed is queried for the publications. The list of publications is then filtered by prev_pubs, affiliations, and cutoff_year. If the publication is in the of prev_pubs then it is skipped. If the author doesn’t have at least one matching affiliation then the publication is skipped. If the publication was published before the cutoff_year then it is skipped. Each publication is then determined to have citations for any of the grants in the author’s grants.

Parameters
  • prev_pubs (dict) – dictionary of publications matching the JSON schema for publications.

  • authors_json (dict) – keys are authors and values are author attributes. Matches Authors section of configuration JSON schema.

  • from_email (str) – used in the query to PubMed

Returns

keys are pulication ids and values are a dictionary with publication attributes

Return type

publication_dict (dict)

Author Search Emails and Reports

Functions to create emails and reports for author_search.

academic_tracker.athr_srch_emails_and_reports.build_author_loop(publication_dict, config_dict, authors_by_project_dict, project_name, template_string)[source]

Replace tags in template_string with the appropriate information.

Parameters
  • publication_dict (dict) – keys and values match the publications JSON file.

  • config_dict (dict) – keys and values match the project tracking configuration JSON file.

  • authors_by_project_dict (dict) – keys are project names from the config file and values are pulled from config_dict[“Authors”].

  • project_name (str) – The name of the project.

  • template_string (str) – Template used to create the project report.

Returns

The string built by looping over the authors in authors_by_project_dict and using the template_string to build a report.

Return type

project_authors (str)

academic_tracker.athr_srch_emails_and_reports.create_collaborator_report(publication_dict, template, author, pubs, filename, save_dir_name)[source]

Create a collaborator report from a formatted string.

Loop over all of the author’s publications and create a

Parameters
  • publication_dict (dict) – keys and values match the publications JSON file.

  • author (str) – The key to the author in config_dict[“Authors”].

  • pubs (dict) – Keys are publications for the author and values are the grants associated with that pub.

  • filename (str) – filename to save the publication under.

  • save_dir_name (str) – directory to save the report in.

Returns

The text of the report or an empty string.

Return type

report (str)

academic_tracker.athr_srch_emails_and_reports.create_collaborators_reports_and_emails(publication_dict, config_dict, save_dir_name)[source]

Create a report of collaborators for authors in publication_dict.

For each author in publication_dict with an author_id create a csv file with the other authors on their publicaitons.

Parameters
  • publication_dict (dict) – keys and values match the publications JSON file.

  • config_dict (dict) – keys and values match the project tracking configuration JSON file.

  • save_dir_name (str) – directory to save the reports in.

Returns

keys and values match the email JSON file.

Return type

email_messages (dict)

academic_tracker.athr_srch_emails_and_reports.create_project_report(publication_dict, config_dict, authors_by_project_dict, project_name, template_string='<author_loop><author_first> <author_last>:<pub_loop>\n\tTitle: <title> \n\tAuthors: <authors> \n\tJournal: <journal> \n\tDOI: <DOI> \n\tPMID: <PMID> \n\tPMCID: <PMCID> \n\tGrants: <grants>\n</pub_loop>\n</author_loop>', author_first='', author_last='')[source]

Create the project report for the project.

The details of creating project reports are outlined in the documentation. Use the information in the config_dict, publication_dict, and authors_by_project_dict to fill in the information in the template_string. If author_first is given then it is assumed the report is actually for a single author and not a whole project.

Parameters
  • publication_dict (dict) – keys and values match the publications JSON file.

  • config_dict (dict) – keys and values match the project tracking configuration JSON file.

  • authors_by_project_dict (dict) – keys are project names from the config file and values are pulled from config_dict[“Authors”].

  • project_name (str) – The name of the project.

  • template_string (str) – Template used to create the project report.

  • author_first (str) – First name of the author. If not “” the report is assumed to be for 1 author.

  • author_last (str) – Last name of the author.

Returns

The template_string with the appropriate tags replaced with relevant information.

Return type

template_string (str)

academic_tracker.athr_srch_emails_and_reports.create_project_reports_and_emails(authors_by_project_dict, publication_dict, config_dict, save_dir_name)[source]

Create project reports and emails for each project.

For each project in config_dict create a report and optional email. Reports are saved in save_dir_name as they are created.

Parameters
  • authors_by_project_dict (dict) – keys are project names from the config file and values are pulled from config_dict[“Authors”].

  • publication_dict (dict) – keys and values match the publications JSON file.

  • config_dict (dict) – keys and values match the project tracking configuration JSON file.

  • save_dir_name (str) – directory to save the reports in.

Returns

keys and values match the email JSON file.

Return type

email_messages (dict)

academic_tracker.athr_srch_emails_and_reports.create_pubs_by_author_dict(publication_dict)[source]

Create a dictionary with authors as the keys and values as the pub_ids and grants

Organizes the publication information in an author focused way so other operations are easier.

Parameters

publication_dict (dict) – keys and values match the publications JSON file.

Returns

dictionary where the keys are authors and the values are a dictionary of pub_ids with thier associated grants.

Return type

pubs_by_author_dict (dict)

academic_tracker.athr_srch_emails_and_reports.create_summary_report(publication_dict, config_dict, authors_by_project_dict, template_string='<project_loop><project_name>\n<author_loop>\t<author_first> <author_last>:<pub_loop>\n\t\tTitle: <title> \n\t\tAuthors: <authors> \n\t\tJournal: <journal> \n\t\tDOI: <DOI> \n\t\tPMID: <PMID> \n\t\tPMCID: <PMCID> \n\t\tGrants: <grants>\n</pub_loop>\n</author_loop></project_loop>')[source]

Create the summary report for the run.

The details of creating summary reports are outlined in the documentation. Use the information in the config_dict, publication_dict, and authors_by_project_dict to fill in the information in the template_string.

Parameters
  • publication_dict (dict) – keys and values match the publications JSON file.

  • config_dict (dict) – keys and values match the project tracking configuration JSON file.

  • authors_by_project_dict (dict) – keys are project names from the config file and values are pulled from config_dict[“Authors”].

  • template_string (str) – Template used to create the project report.

Returns

The report built by replacing the appropriate tags in template_string with relevant information.

Return type

report_string (str)

academic_tracker.athr_srch_emails_and_reports.create_tabular_collaborator_report(publication_dict, config_dict, author, pubs, filename, file_format, save_dir_name)[source]

Create a table for a collaborator report and save as either csv or xlsx.

Parameters
  • publication_dict (dict) – keys and values match the publications JSON file.

  • config_dict (dict) – keys and values match the project tracking configuration JSON file.

  • author (str) – The key to the author in config_dict[“Authors”].

  • pubs (dict) – Keys are publications for the author and values are the grants associated with that pub.

  • filename (str) – filename to save the publication under.

  • file_format (str) – csv or xlsx, determines what format to save in.

  • save_dir_name (str) – directory to save the report in.

Returns

The text of the report, empty string, or path to the saved xlsx file. filename (str): Filename of the report. Made have had an .xlsx added to the end.

Return type

report (str)

academic_tracker.athr_srch_emails_and_reports.create_tabular_project_report(publication_dict, config_dict, authors_by_project_dict, pubs_by_author_dict, project_name, report_attributes, save_dir_name, filename)[source]

Create a pandas DataFrame and save it as Excel or CSV.

Parameters
  • publication_dict (dict) – keys and values match the publications JSON file.

  • config_dict (dict) – keys and values match the project tracking configuration JSON file.

  • authors_by_project_dict (dict) – keys are project names from the config file and values are pulled from config_dict[“Authors”].

  • pubs_by_author_dict (dict) – dictionary where the keys are authors and the values are a dictionary of pub_ids with thier associated grants.

  • project_name (str) – Name of the project.

  • report_attributes (dict) – Dictionary of the report attributes. Could come from project_descriptions or an author.

  • save_dir_name (str) – directory to save the report in.

  • filename (str) – Filename of the report.

Returns

Either the text of the report if csv or a relative filepath to where the Excel file is saved. filename (str): Filename of the report. Made have had an .xlsx added to the end.

Return type

report (str)

academic_tracker.athr_srch_emails_and_reports.create_tabular_summary_report(publication_dict, config_dict, authors_by_project_dict, save_dir_name)[source]

Create a pandas DataFrame and save it as Excel or CSV.

Parameters
  • publication_dict (dict) – keys and values match the publications JSON file.

  • config_dict (dict) – keys and values match the project tracking configuration JSON file.

  • authors_by_project_dict (dict) – keys are project names from the config file and values are pulled from config_dict[“Authors”].

  • save_dir_name (str) – directory to save the report in.

Returns

Either the text of the report if csv or a relative filepath to where the Excel file is saved. filename (str): Filename of the report. Made have had an .xlsx added to the end.

Return type

report (str)

academic_tracker.athr_srch_emails_and_reports.replace_keywords(template, publication_dict, config_dict, project_name='', author='', pub='', pub_author={})[source]

Replace keywords in the values of the template dictionary.

Parameters
  • template (dict) – keys are column names and values are what the elements of the column should be.

  • publication_dict (dict) – keys and values match the publications JSON file.

  • config_dict (dict) – keys and values match the project tracking configuration JSON file.

  • project_name (str) – the name of the project to replace.

  • author (str) – the key to the author in config_dict[“Authors”].

  • pub (str) – the key to the pub in publication_dict.

  • pub_author (dict) – The author in pub.

Returns

template with the keywords replaced in its values.

Return type

template_copy (dict)

Reference Search Modularized

Modularized pieces of reference_search.

academic_tracker.ref_srch_modularized.build_publication_dict(config_dict, tokenized_citations, no_Crossref)[source]

Query PubMed and Crossref for publications matching the citations in tokenized_citations.

Parameters
  • config_dict (dict) – Matches the Configuration file JSON schema.

  • tokenized_citations (list) – list of dicts. Matches the tokenized citations JSON schema.

  • no_Crossref (bool) – If True search Crossref else don’t. Reduces checking on config JSON if True.

Returns

The dictionary matching the publication JSON schema. tokenized_citations (list): Same list as the input but with the pud_dict_key updated to match the publication found.

Return type

publication_dict (dict)

academic_tracker.ref_srch_modularized.input_reading_and_checking(config_json_filepath, ref_path_or_URL, MEDLINE_reference, no_Crossref, prev_pub_filepath)[source]

Read in inputs from user and do error checking.

Parameters
  • config_json_filepath (str) – filepath to the configuration JSON.

  • ref_path_or_URL (str) – either a filepath to file to tokenize or a URL to tokenize.

  • MEDLINE_reference (bool) – If True re_path_or_URL is a filepath to a MEDLINE formatted file.

  • no_Crossref (bool) – If True search Crossref else don’t. Reduces checking on config JSON if True.

  • prev_pub_filepath (str or None) – filepath to the publication JSON to read in.

Returns

Matches the Configuration file JSON schema. tokenized_citations (list): list of dicts. Matches the tokenized citations JSON schema. has_previous_pubs (bool): True if a prev_pub file was input, False otherwise. prev_pubs (dict): The contents of the prev_pub file input by the user if provided.

Return type

config_dict (dict)

academic_tracker.ref_srch_modularized.save_and_send_reports_and_emails(config_dict, tokenized_citations, publication_dict, prev_pubs, has_previous_pubs, test)[source]

Build the summary report and email it.

Parameters
  • config_dict (dict) – Matches the Configuration file JSON schema.

  • tokenized_citations (list) – list of dicts. Matches the tokenized citations JSON schema.

  • publication_dict (dict) – The dictionary matching the publication JSON schema.

  • prev_pubs (dict) – The contents of the prev_pub file input by the user if provided.

  • has_previous_pubs (bool) – True if a prev_pub file was input, False otherwise.

  • test (bool) – If True save_dir_name is tracker-test instead of tracker- and emails are not sent.

Returns

Name of the directory where the emails and report were saved.

Return type

save_dir_name (str)

Reference Search Webio

Internet interfacing for reference_search.

academic_tracker.ref_srch_webio.build_pub_dict_from_PMID(PMID_list, from_email)[source]

Query PubMed for each PMID and build a dictionary of the returned data.

Parameters
  • PMID_list (list) – A list of PMIDs as strings.

  • from_email (str) – An email address to use when querying PubMed.

Returns

keys are pulication ids and values are a dictionary with publication attributes.

Return type

publication_dict (dict)

academic_tracker.ref_srch_webio.parse_myncbi_citations(url)[source]

Tokenize the citations on a MyNCBI URL.

Note that authors and title can be missing or empty from the webpage. This function assumes the url is the first page of the MyNCBI citations. The first page is tokenized and then each subsequent page is visited and tokenized.

Parameters

url (str) – the url of the MyNCBI page.

Returns

the citations tokenized in a dictionary matching the tokenized citations JSON schema.

Return type

parsed_pubs (dict)

academic_tracker.ref_srch_webio.search_references_on_Crossref(tokenized_citations, mailto_email)[source]

Searhes Crossref for publications matching the citations.

Parameters
  • tokenized_citations (list) – list of citations parsed from a source. Each citation is a dict {“authors”, “title”, “DOI”, “PMID”, “reference_line”, “pub_dict_key”}.

  • mailto_email (str) – used in the query to Crossref

Returns

keys are pulication ids and values are a dictionary with publication attributes

Return type

publication_dict (dict)

academic_tracker.ref_srch_webio.search_references_on_Google_Scholar(tokenized_citations, mailto_email)[source]

Searhes Google Scholar for publications that match the citations.

Parameters
  • tokenized_citations (list) – list of citations parsed from a source. Each citation is a dict {“authors”, “title”, “DOI”, “PMID”, “reference_line”, “pub_dict_key”}.

  • mailto_email (str) – used in the query to Crossref when trying to find DOIs for the articles

Returns

keys are pulication ids and values are a dictionary with publication attributes

Return type

publication_dict (dict)

academic_tracker.ref_srch_webio.search_references_on_PubMed(tokenized_citations, from_email)[source]

Searhes PubMed for publications matching the citations.

For each citation in tokenized_citations PubMed is queried for the publication.

Parameters
  • tokenized_citations (list) – list of citations parsed from a source. Each citation is a dict {“authors”, “title”, “DOI”, “PMID”, “reference_line”, “pub_dict_key”}.

  • from_email (str) – used in the query to PubMed

Returns

keys are pulication ids and values are a dictionary with publication attributes

Return type

publication_dict (dict)

academic_tracker.ref_srch_webio.tokenize_reference_input(reference_input, MEDLINE_reference)[source]

Tokenize the citations in reference_input.

reference_input can be a URL or filepath. MyNCBI URLs are handled special, but all other URLs are read as a text document and parsed line by line as if they were a test document. If the format of the reference is MEDLINE then set MEDLINE_reference to True and it will be parsed as such instead of line by line. Citations are expected to be 1 per line otherwise.

Parameters
  • reference_input (str) – URL or filepath

  • MEDLINE_reference (bool) – True if reference_input is in MEDLINE format

Returns

the citations tokenized in a dictionary matching the tokenized citations JSON schema.

Return type

tokenized_citations (dict)

Reference Search Emails and Reports

Functions to create emails and reports for reference_search.

academic_tracker.ref_srch_emails_and_reports.convert_tokenized_authors_to_str(authors)[source]

Combine authors into a comma separated string.

Try to do first_name last_name for each author, but if first name isn’t there then last_name initials. ex. first_name1 last_name1, last_name2 initials2

Parameters

authors (list) – a list of dictionaries [{“last”:last_name, “initials”:initials}, {“last”:last_name, “first”:first_name}]

Returns

comma separated list of authors.

Return type

authors_string (str)

academic_tracker.ref_srch_emails_and_reports.create_report_from_template(publication_dict, is_citation_in_prev_pubs_list, tokenized_citations, template_string='<pub_loop>Reference Line:\n\t<ref_line>\nTokenized Reference:\n\tAuthors: <tok_authors>\n\tTitle: <tok_title>\n\tPMID: <tok_PMID>\n\tDOI: <tok_DOI>\nQueried Information:\n\tDOI: <DOI>\n\tPMID: <PMID>\n\tPMCID: <PMCID>\n\tGrants: <grants>\n\n</pub_loop>')[source]

Create project report based on template_string.

Loop over each publication in publication_dict and build a report based on the tags in the template_string. Details about reports are in the documentation.

Parameters
  • publication_dict (dict) – keys and values match the publications JSON file.

  • is_citation_in_prev_pubs_list (list) – list of bools that indicate whether or not the citation at the same index in tokenized_citations is in the prev_pubs

  • tokenized_citations (list) – list of dicts. Matches the JSON schema for tokenized citations.

  • template_string (str) – string with tags indicated what information to put in the report.

Returns

text of the created report.

Return type

report (str)

academic_tracker.ref_srch_emails_and_reports.create_tabular_report(publication_dict, config_dict, is_citation_in_prev_pubs_list, tokenized_citations, save_dir_name)[source]

Create a pandas DataFrame and save it as Excel or CSV.

Parameters
  • publication_dict (dict) – keys and values match the publications JSON file.

  • config_dict (dict) – keys and values match the project tracking configuration JSON file.

  • is_citation_in_prev_pubs_list (list) – list of bools that indicate whether or not the citation at the same index in tokenized_citations is in the prev_pubs

  • tokenized_citations (list) – list of dicts. Matches the JSON schema for tokenized citations.

  • save_dir_name (str) – directory to save the report in.

Returns

Either the text of the report if csv or a relative filepath to where the Excel file is saved. filename (str): Filename of the report. Made have had an .xlsx added to the end.

Return type

report (str)

academic_tracker.ref_srch_emails_and_reports.create_tokenization_report(tokenized_citations)[source]

Create a report that details all the information about how a reference was tokenized.

Intended as a troubleshooting report.

Parameters

tokenized_citations (list) – list of dicts. Matches the JSON schema for tokenized citations.

Returns

report text built from tokenized_citations.

Return type

report_string (str)

academic_tracker.ref_srch_emails_and_reports.replace_keywords(template, publication_dict, pub, tokenized_citation, is_citation_in_prev_pubs, pub_author={})[source]

Replace keywords in the values of the template dictionary.

Parameters
  • template (dict) – keys are column names and values are what the elements of the column should be.

  • publication_dict (dict) – keys and values match the publications JSON file.

  • pub (str) – the key to the pub in publication_dict.

  • tokenized_citation (dict) – The tokenized citation from the reference for the publication.

  • is_citation_in_prev_pubs (bool or None) – Whether this publication is in the previous publications or not. If None then it isn’t applicable.

  • pub_author (dict) – The author in pub.

Returns

template with the keywords replaced in its values.

Return type

template_copy (dict)

Citation Parsing

Functions for parsing citations.

academic_tracker.citation_parsing.parse_MEDLINE_format(text_string)[source]

Tokenize text_string based on it being of the MEDLINE format.

Parameters

text_string (str) – The string to tokenize.

Returns

the citations tokenized in a dictionary matching the tokenized citations JSON schema.

Return type

parsed_pubs (dict)

academic_tracker.citation_parsing.parse_text_for_citations(text)[source]

Parse text line by line and tokenize it.

The function is aware of MLA, APA, Chicago, Harvard, and Vancouver style citations. Although the citation styles the function is aware of have standards for citations in reality these standards are not strictly adhered to by the public. Therefore the function uses a more heuristic approach.

Parameters

text (str) – The text to parse.

Returns

the citations tokenized in a dictionary matching the tokenized citations JSON schema.

Return type

parsed_pubs (dict)

academic_tracker.citation_parsing.tokenize_APA_or_Harvard_authors(authors_string)[source]

Tokenize authors based on APA or Harvard citation style.

Parameters

authors_string (str) – string with the authors to tokenize.

Returns

list of dictionaries with the authors last names and initials. [{“last”:lastname, “initials”:initials}, …]

Return type

(list)

academic_tracker.citation_parsing.tokenize_MLA_or_Chicago_authors(authors_string)[source]

Tokenize authors based on MLA or Chicago citation style.

Parameters

authors_string (str) – string with the authors to tokenize.

Returns

list of dictionaries with the authors first, middle, and last names. [{“first”:firstname, “middle”:middlename, “last”:lastname}, …]

Return type

(list)

academic_tracker.citation_parsing.tokenize_Vancouver_authors(authors_string)[source]

Tokenize authors based on Vancouver citation style.

Parameters

authors_string (str) – string with the authors to tokenize.

Returns

list of dictionaries with the authors last names and initials. [{“last”:lastname, “initials”:initials}, …]

Return type

(list)

academic_tracker.citation_parsing.tokenize_myncbi_citations(html)[source]

Tokenize the citations on a MyNCBI HTML page.

Note that authors and title can be missing or empty from the webpage.

Parameters

html (str) – the html of the MyNCBI page.

Returns

the citations tokenized in a dictionary matching the tokenized citations JSON schema.

Return type

parsed_pubs (dict)

Fileio

This module contains the functions that read and write files.

academic_tracker.fileio.load_json(filepath)[source]

Adds error checking around loading a json file.

Parameters

filepath (str) – filepath to the json file

Returns

json read from file in a dictionary

Return type

internal_data (dict)

Raises

Exception – If file opening has a problem will raise an exception.

academic_tracker.fileio.read_csv(doc_path)[source]

Read csv into a pandas dataframe.

Parameters

doc_path (str) – path to the csv file to read in.

Returns

Pandas dataframe of the csv contents.

Return type

df (DataFrame)

Raises

Exception – If file opening has a problem will raise an exception.

academic_tracker.fileio.read_previous_publications(filepath)[source]

Read in the previous publication json file.

If the prev_pub option was given by the user then that filepath is used to read in the file and it is checked to make sure the json is a list and each value is a string. If the prev_pub option was not given then look for a “tracker-timestamp” directory in the current working directory and if it has a publications.json file then read in that file. If no previous publications are found then an empty dict is returned for prev_pubs.

Parameters

filepath (str or None) – path to the publications JSON to read in.

Returns

True means that a previous publications file was found prev_pubs (dict): dict where keys are publication ids and values are a dict of publication attributes

Return type

has_previous_pubs (bool)

academic_tracker.fileio.read_text_from_docx(doc_path)[source]

Open docx file at doc_path and read contents into a string.

Parameters

doc_path (str) – path to docx file.

Returns

A string of the contents of the docx file. Each line concatenated with a newline character.

Return type

(str)

Raises

Exception – If file opening has a problem will raise an exception.

academic_tracker.fileio.read_text_from_txt(doc_path)[source]

Open txt or csv file at doc_path and read contents into a string.

Parameters

doc_path (str) – path to txt or csv file.

Returns

A string of the contents of the txt or csv file. Each line concatenated with a newline character.

Return type

(str)

Raises

Exception – If file opening has a problem will raise an exception.

academic_tracker.fileio.save_emails_to_file(email_messages, save_dir_name)[source]

Save email_messages to “emails.json” in save_dir_name in the current working directory.

Parameters
  • email_messages (dict) – keys are author names and values are the of the email

  • save_dir_name (str) – directory name to append to the current working directory to save the emails.json file in

academic_tracker.fileio.save_json_to_file(save_dir_name, file_name, json_dict)[source]

Saves the json_dict to file_name in save_dir_name in the current working directory.

Parameters
  • save_dir_name (str) – directory name to append to the current working directory to save the json_dict in.

  • json_dict (dict or list) – data to save to file.

academic_tracker.fileio.save_publications_to_file(save_dir_name, publication_dict, prev_pubs)[source]

Saves the publication_dict to “publications.json” in save_dir_name in the current working directory.

prev_pubs and publication_dict will be combined before saving.

Parameters
  • save_dir_name (str) – directory name to append to the current working directory to save the publications.json file in

  • publication_dict (dict) – dictionary with publication ids as the keys to the dict

  • prev_pubs (list) – List of publication ids that are publications previously found.

academic_tracker.fileio.save_string_to_file(save_dir_name, file_name, text_to_save)[source]

Save a string to file.

Parameters
  • save_dir_name (str) – directory in the current working directory to save the string to.

  • file_name (str) – string to name the file.

  • text_to_save (str) – the string to put in the file contents.

Helper Functions

This module contains helper functions, such as printing, and regex searching.

academic_tracker.helper_functions.adjust_author_attributes(authors_by_project_dict, config_dict)[source]

Modifies config_dict with values from authors_by_project_dict

Go through the authors in authors_by_project_dict and find the lowest cutoff_year. Also find affiliations and grants and create a union of them across projects. Update the authors in config_dict[“Authors”].

Parameters
  • authors_by_project_dict (dict) – keys are the projects in the config_dict and the values are the authors associated with that project from config_dict[“Authors”].

  • config_dict (dict) – schema matches the JSON Project Tracking Configuration file.

Returns

schema matches the JSON Project Tracking Configuration file.

Return type

config_dict (dict)

academic_tracker.helper_functions.are_citations_in_pub_dict(tokenized_citations, pub_dict)[source]

Determine which citations in tokenized_citations are in pub_dict.

For each citation in tokenized_citations see if it is in pub_dict. Will be True for a citation if the PMID matches, DOI matches, or the title is similar enough.

Parameters
  • tokenized_citation (list) – list of dictionaries where each dictionary is a citation. Matches the tokenized_reference.json schema.

  • pub_dict (dict) – schema matches the publication.json schema.

Returns

list of bools, True if the citation at that index is in pub_dict, False otherwise.

Return type

(list)

academic_tracker.helper_functions.create_authors_by_project_dict(config_dict)[source]

Create the authors_by_project_dict dict from the config_dict.

Creates a dict where the keys are the projects in the config_dict and the values are the authors associated with that project from config_dict[“Authors”].

Parameters

config_dict (dict) – schema matches the JSON Project Tracking Configuration file.

Returns

keys are the projects in the config_dict and the values are the authors associated with that project from config_dict[“Authors”].

Return type

authors_by_project_dict (dict)

academic_tracker.helper_functions.find_duplicate_citations(tokenized_citations)[source]

Find citations that are duplicates of each other in tokenized_citations.

Citations can be duplicates in 3 ways. Same PMID, same DOI, or similar enough titles. The function goes through each citation and looks for matches on these criteria. Then the matches are compared to create unique sets. For instance if citation 1 matches the PMID in citation 2, and citation 2 matches the DOI in citation 3, but citation 1 and 3 don’t match a duplicate set containing all 3 is created. The unique duplicate sets are returned as a list of sorted lists.

Parameters

tokenized_citation (list) – list of dictionaries where each dictionary is a citation. Matches the tokenized_reference.json schema.

Returns

list of lists where each element is a list of indexes in tokenized_citations that match each other. The list of indexes is sorted in ascending order.

Return type

unique_duplicate_sets (list)

academic_tracker.helper_functions.fuzzy_matches_to_list(str_to_match, list_to_match)[source]

Return strings and indexes for strings with match ratio that is 90 or higher.

Parameters
  • str_to_match (str) – string to compare with list

  • list_to_match (list) – list of strings to compare with str_to_match

Returns

list of matches (tuples) with each element being the string and its index in list_to_match. [(9, “title 1”), …]

Return type

(list)

academic_tracker.helper_functions.is_fuzzy_match_to_list(str_to_match, list_to_match)[source]

True if string is a 90 or higher ratio match to any string in list, False otherwise.

Parameters
  • str_to_match (str) – string to compare with list

  • list_to_match (list) – list of strings to compare with str_to_match

Returns

True if str_to_match is a match to any string in list_tp_match, False otherwise.

Return type

(bool)

academic_tracker.helper_functions.is_pub_in_publication_dict(pub_id, title, publication_dict, titles=[])[source]

True if pub_id is in publication_dict or title is a fuzzy match to titles in titles.

Check whether the pub_id is in publication_dict. If it isn’t then see if there is a fuzzy match in titles. If titles is not provided then get a list of titles from publication_dict.

Parameters
  • pub_id (str) – pub_id to check against in publication_dict to see if it already exists.

  • title (str) – title corresponding to pub_id to check against titles in publication_dict.

  • publication_dict (dict) – keys are pub_ids and values are pub attributes.

  • titles (list) – list of strings that should be titles to fuzzy match to title.

Returns

True if the pub_id is in publication_dict or title is fuzzy matched in titles, False otherwise

Return type

(bool)

academic_tracker.helper_functions.match_authors_in_pub_Crossref(authors_json, author_list)[source]

Look for matching authors in Crossref pub data.

Goes through the author list from Crossref and tries matching to an author in authors_json using firstname, lastname, and affiliations, or ORCID if ORCID is present.

Parameters
  • authors_json (dict) – keys are authors and values are author attributes. Matches authors JSON schema.

  • author_list (list) – list of dicts where each dict is attributes of an author.

Returns

either the author list with matched authors containing an additional author_id attribute, or an empty list if no authors were matched.

Return type

author_list (list)

academic_tracker.helper_functions.match_authors_in_pub_PubMed(authors_json, author_list)[source]

Look for matching authors in PubMed pub data.

Goes through the author list from PubMed and tries matching to an author in authors_json using firstname, lastname, and affiliations.

Parameters
  • authors_json (dict) – keys are authors and values are author attributes. Matches authors JSON schema.

  • author_list (list) – list of dicts where each dict is attributes of an author.

Returns

either the author list with matched authors containing an additional author_id attribute, or an empty list if no authors were matched.

Return type

author_list (list)

academic_tracker.helper_functions.modify_pub_dict_for_saving(pub)[source]

Convert pymed.PubMedArticle to a dictionary and modify it for saving.

Converts a pymed.PubMedArticle to a dictionary, deletes the “xml” key, and converts the “publication_date” key to a string.

Parameters

pub (pymed.PubMedArticle) – publication to convert to a dictionary.

Returns

pub converted to a dictionary. Keys are “pubmed_id”, “title”, “abstract”, “keywords”, “journal”, “publication_date”, “authors”, “methods”, “conclusions”, “results”, “copyrights”, and “doi”

Return type

pub_dict (dict)

academic_tracker.helper_functions.regex_group_return(regex_groups, group_index)[source]

Return the group in the regex_groups indicated by group_index if it exists, else return empty string.

If group_index is out of range of the regex_groups an empty string is retruned.

Parameters
  • regex_groups (tuple) – A tuple returned from a matched regex.groups() call.

  • group_number (int) – The index of the regex_groups to return.

Returns

Either emtpy string or the group string matched by the regex.

Return type

(str)

academic_tracker.helper_functions.regex_match_return(regex, string_to_match)[source]

Return the groups matched in the regex if the regex matches.

regex is delivered to re.match() with string_to_match, and if there is a match the match.groups() is returned, otherwise an empty tuple is returned.

Parameters
  • regex (str) – A string with a regular expression to be delivered to re.match().

  • string_to_match (str) – The string to match with the regex.

Returns

either the tuple of the matched groups in the regex or an empty tuple if a match wasn’t found.

Return type

(tuple)

academic_tracker.helper_functions.regex_search_return(regex, string_to_search)[source]

Return the groups matched in the regex if the regex matches.

regex is delivered to re.search() with string_to_search, and if there is a match the match.groups() is returned, otherwise an empty tuple is returned.

Parameters
  • regex (str) – A string with a regular expression to be delivered to re.search().

  • string_to_search (str) – The string to match with the regex.

Returns

either the tuple of the matched groups in the regex or an empty tuple if a match wasn’t found.

Return type

(tuple)

academic_tracker.helper_functions.vprint(*args, verbosity=0)[source]

Print depending on the state of VERBOSE, SILENT, and verbosity.

If the global SILENT is True don’t print anything. If verbosity is 0 then print. If verbosity is 1 then VERBOSE must be True to print.

Parameters

verbosity (int) – Either 0 or 1 for different levels of verbosity.

Webio

General functions that interface with the internet.

academic_tracker.webio.check_doi_for_grants(doi, grants)[source]

Searches DOI webpage for grants.

Concatenates “https://doi.org/” with the doi, visits the page and looks for the given grants on that page.

Parameters
  • doi (str) – DOI for the publication.

  • grants (list) – list of str for each grant to look for.

Returns

list of str with each grant that was found on the page.

Return type

found_grants (list)

academic_tracker.webio.clean_tags_from_url(url)[source]

Remove tags from webpage.

Remove tags from a webpage so it looks more like what a user would see in a browser.

Parameters

url (str) – the URL to query.

Returns

webpage contents cleaned of tags.

Return type

clean_url (str)

academic_tracker.webio.download_pdf(pdf_url)[source]
academic_tracker.webio.get_DOI_from_Crossref(title, mailto_email)[source]

Search title on Crossref and try to find a DOI for it.

Parameters
  • title (str) – string of the title of the journal article to search for.

  • mailto_email (str) – an email address needed to search Crossref more effectively.

Returns

Either None or the DOI of the article title. The DOI will not be a URL.

Return type

doi (str)

academic_tracker.webio.get_grants_from_Crossref(title, mailto_email, grants)[source]

Search title on Crossref and try to find the grants associated with it.

Only the grants in the grants parameter are searched for because trying to find all grants associated with the article is too difficult.

Parameters
  • title (str) – string of the title of the journal article to search for.

  • mailto_email (str) – an email address needed to search Crossref more effectively.

  • grants (list) – a list of the grants to try and find for the article.

Returns

Either None or a list of grants found for the article.

Return type

found_grants (str)

academic_tracker.webio.get_url_contents_as_str(url)[source]

Query the url and return it’s contents as a string.

Parameters

url (str) – the URL to query.

Returns

Either the website as a string or None if an error occurred.

Return type

(str)

academic_tracker.webio.scrape_url_for_DOI(url)[source]

Searches url for DOI.

Uses the regex “(?i).*doi:s*([^s]+w).*” to look for a DOI on the provided url. The DOI is visited to confirm it is a proper DOI.

Parameters

url (str) – url to search.

Returns

string of the DOI found on the webpage. Is empty string if DOI is not found.

Return type

DOI (str)

academic_tracker.webio.search_Google_Scholar_for_ids(authors_json)[source]

Query Google Scholar with author names and get Scholar IDs.

If an author already has a scholar_id, or doesn’t have affiliations they are skipped.

Parameters

authors_json (dict) – JSON matching the Authors section of the Configuration file.

Returns

the authors_json modified with any ORCID IDs found.

Return type

authors_json (dict)

academic_tracker.webio.search_ORCID_for_ids(ORCID_key, ORCID_secret, authors_json)[source]

Query ORCID with author names and get ORCID IDs.

If an author already has an ORCID, or doesn’t have affiliations they are skipped.

Parameters
  • ORCID_key (str) – key assigned to your registered application from ORCID.

  • ORCID_secret (str) – secret given to you by ORCID.

  • authors_json (dict) – JSON matching the Authors section of the Configuration file.

Returns

the authors_json modified with any ORCID IDs found.

Return type

authors_json (dict)

academic_tracker.webio.send_emails(email_messages)[source]

Uses sendmail to send email_messages to authors.

Only works on systems with sendmail installed.

Parameters

email_messages (dict) – keys are author names and values are the message