API

This package has the following modules:

user_input_checking

This module contains functions for checking user input.

tracker_schema

This module contains the schema for validating user input.

athr_srch_modularized

This module contains functions to complete the author_search command modularized into pieces.

athr_srch_webio

This module contains functions for author_search to interface with the internet.

athr_srch_emails_and_reports

This module contains functions to create emails and reports for author_search.

ref_srch_modularized

This module contains functions to complete the reference_search command modularized into pieces.

ref_srch_webio

This module contains functions for reference_search to interface with the internet.

ref_srch_emails_and_reports

This module contains functions to create emails and reports for reference_search.

citation_parsing

This module contains functions for parsing references and citations for reference_search.

fileio

This module contains functions for reading and writing files.

helper_functions

This module contains functions that help the other modules function. The functions do things such as fuzzy matching, regex searching, and printing.

emails_and_reports_helpers

This module contains functions that help create emails and reports.

webio

This module contains general functions for interfacing with the internet.

User Input Checking

Functions that check the user input for errors.

academic_tracker.user_input_checking.cli_inputs_check(args)[source]

Run input checking on the CLI inputs.

Uses jsonschema to validate the inputs.

Parameters:

args (dict) – dict from docopt.

academic_tracker.user_input_checking.config_file_check(config_json, no_ORCID, no_GoogleScholar, no_Crossref, no_PubMed)[source]

Check that the configuration JSON file is as expected.

The validational jsonschema is in the tracker_schema module.

Parameters:
  • config_json (dict) – dict with the same structure as the configuration JSON file.

  • no_ORCID (bool) – if True delete the part of the schema that checks ORCID attributes.

  • no_GoogleScholar (bool) – if True and no_Crossref is True delete the part of the schema that checks Crossref attributes.

  • no_Crossref (bool) – if True and no_GoogleScholar is True delete the part of the schema that checks Crossref attributes.

  • no_PubMed (bool) – if True delete the part of the schema that checks PubMed attributes.

academic_tracker.user_input_checking.config_report_check(config_json)[source]

Check that the report attributes don’t have conflicts.

Make sure that the values in sort and column_order are in columns, and that every column is in column_order.

Parameters:

config_json (dict) – dict with the same structure as the configuration JSON file.

academic_tracker.user_input_checking.prev_pubs_file_check(prev_pubs)[source]

Run input checking on prev_pubs dict.

The validational jsonschema is in the tracker_schema module.

Parameters:

prev_pubs (dict) – dict with the same structure as the previous publications JSON file.

academic_tracker.user_input_checking.ref_config_file_check(config_json, no_Crossref, no_PubMed)[source]

Check that the configuration JSON file is as expected.

The validational jsonschema is in the tracker_schema module.

Parameters:
  • config_json (dict) – dict with a truncated structure of the configuration JSON file.

  • no_Crossref (bool) – if True delete the part of the schema that checks Crossref attributes.

  • no_PubMed (bool) – if True delete the part of the schema that checks PubMed attributes.

academic_tracker.user_input_checking.tok_reference_check(tok_ref)[source]

Run input checking on tok_ref dict.

The validational jsonschema is in the tracker_schema module.

Parameters:

tok_ref (dict) – dict with the same structure as the tokenized reference JSON file.

academic_tracker.user_input_checking.tracker_validate(instance, schema, pattern_messages={}, cls=None, *args, **kwargs)[source]

Wrapper around jsonchema.validate to give better error messages.

Parameters:
  • instance (dict) – JSON as a dict to validate

  • schema (dict) – JSON schema as a dict to validate instance against

  • pattern_messages (dict) – if the instance has a ValidationError of the pattern type then look up the attribute that failed the pattern in this dict and see if there is a custom message

Raises:

jsonshcema.ValidationError – If an unexpected jsonschema error happens this is raised rather than a system exit.

Author Search Modularized

Modularized pieces of author_search.

academic_tracker.athr_srch_modularized.build_publication_dict(config_dict, prev_pubs, no_ORCID, no_GoogleScholar, no_Crossref, no_PubMed)[source]

Query PubMed, ORCID, Google Scholar, and Crossref for publications for the authors.

Parameters:
  • config_dict (dict) – Matches the Configuration file JSON schema.

  • prev_pubs (dict) – Matches the publication JSON schema. Used to ignore publications when querying.

  • no_ORCID (bool) – If True search ORCID else don’t.

  • no_GoogleScholar (bool) – if True search Google Scholar else don’t.

  • no_Crossref (bool) – If True search Crossref else don’t.

  • no_PubMed (bool) – If True search PubMed else don’t.

Returns:

The dictionary matching the publication JSON schema. all_queries (dict): The pubs searched for each source and each author. {“PubMed”:{“author1”:[pub1, …], …}, “ORCID”:{“author1”:[pub1, …], …}, “Google Scholar”:{“author1”:[pub1, …], …}, “Crossref”:{“author1”:[pub1, …], …}}

Return type:

running_pubs (dict)

academic_tracker.athr_srch_modularized.generate_internal_data_and_check_authors(config_dict)[source]

Create authors_by_project_dict and look for authors without projects.

Parameters:

config_dict (dict) – Matches the Configuration file JSON schema.

Returns:

Keys are project names and values are a dictionary of authors and their attributes. config_dict (dict): same as input but with author information updated based on project information.

Return type:

authors_by_project_dict (dict)

academic_tracker.athr_srch_modularized.input_reading_and_checking(config_json_filepath, no_ORCID, no_GoogleScholar, no_Crossref, no_PubMed)[source]

Read in inputs from user and do error checking.

Parameters:
  • config_json_filepath (str) – filepath to the configuration JSON.

  • no_ORCID (bool) – If True search ORCID else don’t. Reduces checking on config JSON if True.

  • no_GoogleScholar (bool) – if True search Google Scholar else don’t. Reduces checking on config JSON if True.

  • no_Crossref (bool) – If True search Crossref else don’t. Reduces checking on config JSON if True.

  • no_PubMed (bool) – If True search PubMed else don’t. Reduces checking on config JSON if True.

Returns:

Matches the Configuration file JSON schema.

Return type:

config_dict (dict)

academic_tracker.athr_srch_modularized.save_and_send_reports_and_emails(authors_by_project_dict, publication_dict, config_dict, test)[source]

Build the summary report and project reports and email them.

Parameters:
  • authors_by_project_dict (dict) – Keys are project names and values are a dictionary of authors and their attributes.

  • publication_dict (dict) – The dictionary matching the publication JSON schema.

  • config_dict (dict) – Matches the Configuration file JSON schema.

  • test (bool) – If True save_dir_name is tracker-test instead of tracker- and emails are not sent.

Returns:

Name of the directory where the emails and reports were saved.

Return type:

save_dir_name (str)

Author Search Webio

Internet interfacing for author_search.

academic_tracker.athr_srch_webio.search_Crossref_for_pubs(running_pubs, authors_json, mailto_email, prev_query=None)[source]

Searhes Crossref for publications by each author.

For each author in authors_json Crossref is queried for the publications. The list of publications is then filtered by affiliations and cutoff_year. If the author doesn’t have at least one matching affiliation, then the publication is skipped. If the publication was published before the cutoff_year, then it is skipped. Each publication is then determined to have citations for any of the grants in the author’s grants. If prev_query is given, then publications will be taken from it instead of querying Crossref again.

Parameters:
  • running_pubs (dict) – dictionary of publications matching the JSON schema for publications.

  • authors_json (dict) – keys are authors and values are author attributes. Matches authors JSON schema.

  • mailto_email (str) – used in the query to Crossref.

  • prev_query (dict|None) – a dictionary containing publications from a previous call to this function. {author1: [pub1, …], …}

Returns:

keys are pulication ids and values are a dictionary with publication attributes all_pubs (dict): a dictionary where the keys are the authors in authors_json and the values are a list of the publications queried for them.

Return type:

running_pubs (dict)

academic_tracker.athr_srch_webio.search_Google_Scholar_for_pubs(running_pubs, authors_json, mailto_email, prev_query=None)[source]

Searhes Google Scholar for publications by each author.

For each author in authors_json Google Scholar is queried for the publications. The list of publications is then filtered by affiliations and cutoff_year. If the author doesn’t have at least one matching affiliation, then the publication is skipped. If the publication was published before the cutoff_year, then it is skipped. If prev_query is given, then publications will be taken from it instead of querying Google Scholar again.

Parameters:
  • running_pubs (dict) – dictionary of publications matching the JSON schema for publications.

  • authors_json (dict) – keys are authors and values are author attributes. Matches authors JSON schema.

  • mailto_email (str) – used in the query to Crossref when trying to find DOIs for the articles.

  • prev_query (dict|None) – a dictionary containing publications from a previous call to this function. {author1: [pub1, …], …}

Returns:

keys are pulication ids and values are a dictionary with publication attributes all_pubs (dict): a dictionary where the keys are the authors in authors_json and the values are a list of the publications queried for them.

Return type:

running_pubs (dict)

academic_tracker.athr_srch_webio.search_ORCID_for_pubs(running_pubs, ORCID_key, ORCID_secret, authors_json, prev_query=None)[source]

Searhes ORCID for publications by each author.

For each author in authors_json ORCID is queried for the publications. The list of publications is then filtered by affiliations and cutoff_year. If the author doesn’t have at least one matching affiliation, then the publication is skipped. If the publication was published before the cutoff_year, then it is skipped. If prev_query is given, then publications will be taken from it instead of querying ORCID again.

Parameters:
  • running_pubs (dict) – dictionary of publications matching the JSON schema for publications.

  • ORCID_key (str) – string of the app key ORCID gives when you register the app with them

  • ORCID_secret (str) – string of the secret ORCID gives when you register the app with them

  • authors_json (dict) – keys are authors and values are author attributes. Matches authors JSON schema.

  • prev_query (dict|None) – a dictionary containing publications from a previous call to this function. {author1: [pub1, …], …}

Returns:

keys are publication ids and values are a dictionary with publication attributes all_pubs (dict): a dictionary where the keys are the authors in authors_json and the values are a list of the publications queried for them.

Return type:

running_pubs (dict)

academic_tracker.athr_srch_webio.search_PubMed_for_pubs(running_pubs, authors_json, from_email, prev_query=None)[source]

Searhes PubMed for publications by each author.

For each author in authors_json PubMed is queried for the publications. The list of publications is then filtered by affiliations and cutoff_year. If the publication is in the of running_pubs then it tries to fill in missing information from this source. If the author doesn’t have at least one matching affiliation then the publication is skipped. If the publication was published before the cutoff_year then it is skipped. If prev_query is given, then publications will be taken from it instead of querying PubMed again.

Parameters:
  • running_pubs (dict) – dictionary of publications matching the JSON schema for publications.

  • authors_json (dict) – keys are authors and values are author attributes. Matches Authors section of configuration JSON schema.

  • from_email (str) – used in the query to PubMed.

  • prev_query (dict|None) – a dictionary containing publications from a previous call to this function. {author1: [pub1, …], …}

Returns:

keys are publication ids and values are a dictionary with publication attributes all_pubs (dict): a dictionary where the keys are the authors in authors_json and the values are a list of the publications queried for them.

Return type:

running_pubs (dict)

Author Search Emails and Reports

Functions to create emails and reports for author_search.

academic_tracker.athr_srch_emails_and_reports.build_author_loop(publication_dict, config_dict, authors_by_project_dict, project_name, template_string)[source]

Replace tags in template_string with the appropriate information.

Parameters:
  • publication_dict (dict) – keys and values match the publications JSON file.

  • config_dict (dict) – keys and values match the project tracking configuration JSON file.

  • authors_by_project_dict (dict) – keys are project names from the config file and values are pulled from config_dict[“Authors”].

  • project_name (str) – The name of the project.

  • template_string (str) – Template used to create the project report.

Returns:

The string built by looping over the authors in authors_by_project_dict and using the template_string to build a report.

Return type:

project_authors (str)

academic_tracker.athr_srch_emails_and_reports.create_collaborator_report(publication_dict, template, author, pubs, filename, save_dir_name)[source]

Create a collaborator report from a formatted string.

Loop over all of the author’s publications and create a

Parameters:
  • publication_dict (dict) – keys and values match the publications JSON file.

  • author (str) – The key to the author in config_dict[“Authors”].

  • pubs (dict) – Keys are publications for the author and values are the grants associated with that pub.

  • filename (str) – filename to save the publication under.

  • save_dir_name (str) – directory to save the report in.

Returns:

The text of the report or an empty string.

Return type:

report (str)

academic_tracker.athr_srch_emails_and_reports.create_collaborators_reports_and_emails(publication_dict, config_dict, save_dir_name)[source]

Create a report of collaborators for authors in publication_dict.

For each author in publication_dict with an author_id create a csv file with the other authors on their publicaitons.

Parameters:
  • publication_dict (dict) – keys and values match the publications JSON file.

  • config_dict (dict) – keys and values match the project tracking configuration JSON file.

  • save_dir_name (str) – directory to save the reports in.

Returns:

keys and values match the email JSON file.

Return type:

email_messages (dict)

academic_tracker.athr_srch_emails_and_reports.create_project_report(publication_dict, config_dict, authors_by_project_dict, project_name, template_string='<author_loop><author_first> <author_last>:<pub_loop>\n\tTitle: <title> \n\tAuthors: <authors> \n\tJournal: <journal> \n\tDOI: <DOI> \n\tPMID: <PMID> \n\tPMCID: <PMCID> \n\tGrants: <grants>\n</pub_loop>\n</author_loop>', author_first='', author_last='')[source]

Create the project report for the project.

The details of creating project reports are outlined in the documentation. Use the information in the config_dict, publication_dict, and authors_by_project_dict to fill in the information in the template_string. If author_first is given then it is assumed the report is actually for a single author and not a whole project.

Parameters:
  • publication_dict (dict) – keys and values match the publications JSON file.

  • config_dict (dict) – keys and values match the project tracking configuration JSON file.

  • authors_by_project_dict (dict) – keys are project names from the config file and values are pulled from config_dict[“Authors”].

  • project_name (str) – The name of the project.

  • template_string (str) – Template used to create the project report.

  • author_first (str) – First name of the author. If not “” the report is assumed to be for 1 author.

  • author_last (str) – Last name of the author.

Returns:

The template_string with the appropriate tags replaced with relevant information.

Return type:

template_string (str)

academic_tracker.athr_srch_emails_and_reports.create_project_reports_and_emails(authors_by_project_dict, publication_dict, config_dict, save_dir_name)[source]

Create project reports and emails for each project.

For each project in config_dict create a report and optional email. Reports are saved in save_dir_name as they are created.

Parameters:
  • authors_by_project_dict (dict) – keys are project names from the config file and values are pulled from config_dict[“Authors”].

  • publication_dict (dict) – keys and values match the publications JSON file.

  • config_dict (dict) – keys and values match the project tracking configuration JSON file.

  • save_dir_name (str) – directory to save the reports in.

Returns:

keys and values match the email JSON file.

Return type:

email_messages (dict)

academic_tracker.athr_srch_emails_and_reports.create_pubs_by_author_dict(publication_dict)[source]

Create a dictionary with authors as the keys and values as the pub_ids and grants

Organizes the publication information in an author focused way so other operations are easier.

Parameters:

publication_dict (dict) – keys and values match the publications JSON file.

Returns:

dictionary where the keys are authors and the values are a dictionary of pub_ids with thier associated grants.

Return type:

pubs_by_author_dict (dict)

academic_tracker.athr_srch_emails_and_reports.create_summary_report(publication_dict, config_dict, authors_by_project_dict, template_string='<project_loop><project_name>\n<author_loop>\t<author_first> <author_last>:<pub_loop>\n\t\tTitle: <title> \n\t\tAuthors: <authors> \n\t\tJournal: <journal> \n\t\tDOI: <DOI> \n\t\tPMID: <PMID> \n\t\tPMCID: <PMCID> \n\t\tGrants: <grants>\n</pub_loop>\n</author_loop></project_loop>')[source]

Create the summary report for the run.

The details of creating summary reports are outlined in the documentation. Use the information in the config_dict, publication_dict, and authors_by_project_dict to fill in the information in the template_string.

Parameters:
  • publication_dict (dict) – keys and values match the publications JSON file.

  • config_dict (dict) – keys and values match the project tracking configuration JSON file.

  • authors_by_project_dict (dict) – keys are project names from the config file and values are pulled from config_dict[“Authors”].

  • template_string (str) – Template used to create the project report.

Returns:

The report built by replacing the appropriate tags in template_string with relevant information.

Return type:

report_string (str)

academic_tracker.athr_srch_emails_and_reports.create_tabular_collaborator_report(publication_dict, config_dict, author, pubs, filename, file_format, save_dir_name)[source]

Create a table for a collaborator report and save as either csv or xlsx.

Parameters:
  • publication_dict (dict) – keys and values match the publications JSON file.

  • config_dict (dict) – keys and values match the project tracking configuration JSON file.

  • author (str) – The key to the author in config_dict[“Authors”].

  • pubs (dict) – Keys are publications for the author and values are the grants associated with that pub.

  • filename (str) – filename to save the publication under.

  • file_format (str) – csv or xlsx, determines what format to save in.

  • save_dir_name (str) – directory to save the report in.

Returns:

The text of the report, empty string, or path to the saved xlsx file. filename (str): Filename of the report. Made have had an .xlsx added to the end.

Return type:

report (str)

academic_tracker.athr_srch_emails_and_reports.create_tabular_project_report(publication_dict, config_dict, authors_by_project_dict, pubs_by_author_dict, project_name, report_attributes, save_dir_name, filename)[source]

Create a pandas DataFrame and save it as Excel or CSV.

Parameters:
  • publication_dict (dict) – keys and values match the publications JSON file.

  • config_dict (dict) – keys and values match the project tracking configuration JSON file.

  • authors_by_project_dict (dict) – keys are project names from the config file and values are pulled from config_dict[“Authors”].

  • pubs_by_author_dict (dict) – dictionary where the keys are authors and the values are a dictionary of pub_ids with thier associated grants.

  • project_name (str) – Name of the project.

  • report_attributes (dict) – Dictionary of the report attributes. Could come from project_descriptions or an author.

  • save_dir_name (str) – directory to save the report in.

  • filename (str) – Filename of the report.

Returns:

Either the text of the report if csv or a relative filepath to where the Excel file is saved. filename (str): Filename of the report. Made have had an .xlsx added to the end.

Return type:

report (str)

academic_tracker.athr_srch_emails_and_reports.create_tabular_summary_report(publication_dict, config_dict, authors_by_project_dict, save_dir_name)[source]

Create a pandas DataFrame and save it as Excel or CSV.

Parameters:
  • publication_dict (dict) – keys and values match the publications JSON file.

  • config_dict (dict) – keys and values match the project tracking configuration JSON file.

  • authors_by_project_dict (dict) – keys are project names from the config file and values are pulled from config_dict[“Authors”].

  • save_dir_name (str) – directory to save the report in.

Returns:

Either the text of the report if csv or a relative filepath to where the Excel file is saved. filename (str): Filename of the report. Made have had an .xlsx added to the end.

Return type:

report (str)

Reference Search Modularized

Modularized pieces of reference_search.

academic_tracker.ref_srch_modularized.build_publication_dict(config_dict, tokenized_citations, no_Crossref, no_PubMed)[source]

Query PubMed and Crossref for publications matching the citations in tokenized_citations.

Parameters:
  • config_dict (dict) – Matches the Configuration file JSON schema.

  • tokenized_citations (list) – list of dicts. Matches the tokenized citations JSON schema.

  • no_Crossref (bool) – If True search Crossref else don’t. Reduces checking on config JSON if True.

  • no_PubMed (bool) – If True search PubMed else don’t. Reduces checking on config JSON if True.

Returns:

The dictionary matching the publication JSON schema. tokenized_citations (list): Same list as the input but with the pud_dict_key updated to match the publication found. all_queries (dict): for each source searched a list of lists, each index is the pubs searched through after querying until the citation was matched, {“PubMed”:[[pub1, …], …], “Crossref”:[[pub1, …], …]}

Return type:

running_pubs (dict)

academic_tracker.ref_srch_modularized.input_reading_and_checking(config_json_filepath, ref_path_or_URL, MEDLINE_reference, no_Crossref, no_PubMed, prev_pub_filepath, remove_duplicates)[source]

Read in inputs from user and do error checking.

Parameters:
  • config_json_filepath (str) – filepath to the configuration JSON.

  • ref_path_or_URL (str) – either a filepath to file to tokenize or a URL to tokenize.

  • MEDLINE_reference (bool) – If True re_path_or_URL is a filepath to a MEDLINE formatted file.

  • no_Crossref (bool) – If True search Crossref else don’t. Reduces checking on config JSON if True.

  • no_PubMed (bool) – If True search PubMed else don’t. Reduces checking on config JSON if True.

  • prev_pub_filepath (str or None) – filepath to the publication JSON to read in.

  • remove_duplicates (bool) – if True, remove duplicate entries in tokenized citations.

Returns:

Matches the Configuration file JSON schema. tokenized_citations (list): list of dicts. Matches the tokenized citations JSON schema. has_previous_pubs (bool): True if a prev_pub file was input, False otherwise. prev_pubs (dict): The contents of the prev_pub file input by the user if provided.

Return type:

config_dict (dict)

academic_tracker.ref_srch_modularized.save_and_send_reports_and_emails(config_dict, tokenized_citations, publication_dict, prev_pubs, has_previous_pubs, test)[source]

Build the summary report and email it.

Parameters:
  • config_dict (dict) – Matches the Configuration file JSON schema.

  • tokenized_citations (list) – list of dicts. Matches the tokenized citations JSON schema.

  • publication_dict (dict) – The dictionary matching the publication JSON schema.

  • prev_pubs (dict) – The contents of the prev_pub file input by the user if provided.

  • has_previous_pubs (bool) – True if a prev_pub file was input, False otherwise.

  • test (bool) – If True save_dir_name is tracker-test instead of tracker- and emails are not sent.

Returns:

Name of the directory where the emails and report were saved.

Return type:

save_dir_name (str)

Reference Search Webio

Internet interfacing for reference_search.

academic_tracker.ref_srch_webio.build_pub_dict_from_PMID(PMID_list, from_email)[source]

Query PubMed for each PMID and build a dictionary of the returned data.

Parameters:
  • PMID_list (list) – A list of PMIDs as strings.

  • from_email (str) – An email address to use when querying PubMed.

Returns:

keys are pulication ids and values are a dictionary with publication attributes.

Return type:

publication_dict (dict)

academic_tracker.ref_srch_webio.parse_myncbi_citations(url)[source]

Tokenize the citations on a MyNCBI URL.

Note that authors and title can be missing or empty from the webpage. This function assumes the url is the first page of the MyNCBI citations. The first page is tokenized and then each subsequent page is visited and tokenized.

Parameters:

url (str) – the url of the MyNCBI page.

Returns:

the citations tokenized in a dictionary matching the tokenized citations JSON schema.

Return type:

parsed_pubs (dict)

academic_tracker.ref_srch_webio.search_references_on_source(source, running_pubs, tokenized_citations, mailto_email, prev_query=None)[source]

Searhes source for publications matching the citations.

For each citation in tokenized_citations the source is queried for the publication. If the publication is already in running_pubs then missing information will be filled in if possible.

Possible sources are “Crossref” or “PubMed”.

Parameters:
  • source (str) – must be one of “Crossref” or “PubMed”.

  • running_pubs (dict) – dictionary of publications matching the JSON schema for publications.

  • tokenized_citations (list) – list of citations parsed from a source. Each citation is a dict {“authors”, “title”, “DOI”, “PMID”, “reference_line”, “pub_dict_key”}.

  • mailto_email (str) – email provided to the source when querying.

  • prev_query (list|None) – a list of lists containing publications from a previous call to this function. [[pub1, …], [pub1, …], …]

Returns:

keys are pulication ids and values are a dictionary with publication attributes matching_key_for_citation (list): list of keys to the publication matching the citation at the same index all_pubs (list): list of lists, each index is the pubs searched through after querying until the citation was matched

Return type:

running_pubs (dict)

academic_tracker.ref_srch_webio.tokenize_reference_input(reference_input, MEDLINE_reference, remove_duplicates=True)[source]

Tokenize the citations in reference_input.

reference_input can be a URL or filepath. MyNCBI URLs are handled special, but all other URLs are read as a text document and parsed line by line as if they were a test document. If the format of the reference is MEDLINE then set MEDLINE_reference to True and it will be parsed as such instead of line by line. Citations are expected to be 1 per line otherwise.

Parameters:
  • reference_input (str) – URL or filepath.

  • MEDLINE_reference (bool) – True if reference_input is in MEDLINE format.

  • remove_duplicates (bool) – if True, remove duplicate entries in tokenized citations.

Returns:

the citations tokenized in a dictionary matching the tokenized citations JSON schema.

Return type:

tokenized_citations (dict)

Reference Search Emails and Reports

Functions to create emails and reports for reference_search.

academic_tracker.ref_srch_emails_and_reports.convert_tokenized_authors_to_str(authors)[source]

Combine authors into a comma separated string.

Try to do first_name last_name for each author, but if first name isn’t there then last_name initials. ex. first_name1 last_name1, last_name2 initials2

Parameters:

authors (list) – a list of dictionaries [{“last”:last_name, “initials”:initials}, {“last”:last_name, “first”:first_name}]

Returns:

comma separated list of authors.

Return type:

authors_string (str)

academic_tracker.ref_srch_emails_and_reports.create_report_from_template(publication_dict, is_citation_in_prev_pubs_list, tokenized_citations, template_string='<pub_loop>Reference Line:\n\t<ref_line>\nTokenized Reference:\n\tAuthors: <tok_authors>\n\tTitle: <tok_title>\n\tPMID: <tok_PMID>\n\tDOI: <tok_DOI>\nQueried Information:\n\tDOI: <DOI>\n\tPMID: <PMID>\n\tPMCID: <PMCID>\n\tGrants: <grants>\n\n</pub_loop>')[source]

Create project report based on template_string.

Loop over each publication in publication_dict and build a report based on the tags in the template_string. Details about reports are in the documentation.

Parameters:
  • publication_dict (dict) – keys and values match the publications JSON file.

  • is_citation_in_prev_pubs_list (list) – list of bools that indicate whether or not the citation at the same index in tokenized_citations is in the prev_pubs

  • tokenized_citations (list) – list of dicts. Matches the JSON schema for tokenized citations.

  • template_string (str) – string with tags indicated what information to put in the report.

Returns:

text of the created report.

Return type:

report (str)

academic_tracker.ref_srch_emails_and_reports.create_tabular_report(publication_dict, config_dict, is_citation_in_prev_pubs_list, tokenized_citations, save_dir_name)[source]

Create a pandas DataFrame and save it as Excel or CSV.

Parameters:
  • publication_dict (dict) – keys and values match the publications JSON file.

  • config_dict (dict) – keys and values match the project tracking configuration JSON file.

  • is_citation_in_prev_pubs_list (list) – list of bools that indicate whether or not the citation at the same index in tokenized_citations is in the prev_pubs

  • tokenized_citations (list) – list of dicts. Matches the JSON schema for tokenized citations.

  • save_dir_name (str) – directory to save the report in.

Returns:

Either the text of the report if csv or a relative filepath to where the Excel file is saved. filename (str): Filename of the report. Made have had an .xlsx added to the end.

Return type:

report (str)

academic_tracker.ref_srch_emails_and_reports.create_tokenization_report(tokenized_citations)[source]

Create a report that details all the information about how a reference was tokenized.

Intended as a troubleshooting report.

Parameters:

tokenized_citations (list) – list of dicts. Matches the JSON schema for tokenized citations.

Returns:

report text built from tokenized_citations.

Return type:

report_string (str)

Citation Parsing

Functions for parsing citations.

academic_tracker.citation_parsing.parse_MEDLINE_format(text_string)[source]

Tokenize text_string based on it being of the MEDLINE format.

Parameters:

text_string (str) – The string to tokenize.

Returns:

the citations tokenized in a dictionary matching the tokenized citations JSON schema.

Return type:

parsed_pubs (dict)

academic_tracker.citation_parsing.parse_text_for_citations(text)[source]

Parse text line by line and tokenize it.

The function is aware of MLA, APA, Chicago, Harvard, and Vancouver style citations. Although the citation styles the function is aware of have standards for citations in reality these standards are not strictly adhered to by the public. Therefore the function uses a more heuristic approach.

Parameters:

text (str) – The text to parse.

Returns:

the citations tokenized in a dictionary matching the tokenized citations JSON schema.

Return type:

parsed_pubs (dict)

academic_tracker.citation_parsing.tokenize_APA_or_Harvard_authors(authors_string)[source]

Tokenize authors based on APA or Harvard citation style.

Parameters:

authors_string (str) – string with the authors to tokenize.

Returns:

list of dictionaries with the authors last names and initials. [{“last”:lastname, “initials”:initials}, …]

Return type:

(list)

academic_tracker.citation_parsing.tokenize_MLA_or_Chicago_authors(authors_string)[source]

Tokenize authors based on MLA or Chicago citation style.

Parameters:

authors_string (str) – string with the authors to tokenize.

Returns:

list of dictionaries with the authors first, middle, and last names. [{“first”:firstname, “middle”:middlename, “last”:lastname}, …]

Return type:

(list)

academic_tracker.citation_parsing.tokenize_Vancouver_authors(authors_string)[source]

Tokenize authors based on Vancouver citation style.

Parameters:

authors_string (str) – string with the authors to tokenize.

Returns:

list of dictionaries with the authors last names and initials. [{“last”:lastname, “initials”:initials}, …]

Return type:

(list)

academic_tracker.citation_parsing.tokenize_myncbi_citations(html)[source]

Tokenize the citations on a MyNCBI HTML page.

Note that authors and title can be missing or empty from the webpage.

Parameters:

html (str) – the html of the MyNCBI page.

Returns:

the citations tokenized in a dictionary matching the tokenized citations JSON schema.

Return type:

parsed_pubs (dict)

Fileio

This module contains the functions that read and write files.

academic_tracker.fileio.load_json(filepath)[source]

Adds error checking around loading a json file.

Parameters:

filepath (str) – filepath to the json file

Returns:

json read from file in a dictionary

Return type:

internal_data (dict)

Raises:

Exception – If file opening has a problem will raise an exception.

academic_tracker.fileio.read_csv(doc_path)[source]

Read csv into a pandas dataframe.

Parameters:

doc_path (str) – path to the csv file to read in.

Returns:

Pandas dataframe of the csv contents.

Return type:

df (DataFrame)

Raises:

Exception – If file opening has a problem will raise an exception.

academic_tracker.fileio.read_previous_publications(filepath)[source]

Read in the previous publication json file.

If the prev_pub option was given by the user then that filepath is used to read in the file and it is checked to make sure the json is a list and each value is a string. If the prev_pub option was not given then look for a “tracker-timestamp” directory in the current working directory and if it has a publications.json file then read in that file. If no previous publications are found then an empty dict is returned for prev_pubs.

Parameters:

filepath (str or None) – path to the publications JSON to read in.

Returns:

True means that a previous publications file was found prev_pubs (dict): dict where keys are publication ids and values are a dict of publication attributes

Return type:

has_previous_pubs (bool)

academic_tracker.fileio.read_text_from_docx(doc_path)[source]

Open docx file at doc_path and read contents into a string.

Parameters:

doc_path (str) – path to docx file.

Returns:

A string of the contents of the docx file. Each line concatenated with a newline character.

Return type:

(str)

Raises:

Exception – If file opening has a problem will raise an exception.

academic_tracker.fileio.read_text_from_txt(doc_path)[source]

Open txt or csv file at doc_path and read contents into a string.

Parameters:

doc_path (str) – path to txt or csv file.

Returns:

A string of the contents of the txt or csv file. Each line concatenated with a newline character.

Return type:

(str)

Raises:

Exception – If file opening has a problem will raise an exception.

academic_tracker.fileio.save_emails_to_file(email_messages, save_dir_name)[source]

Save email_messages to “emails.json” in save_dir_name in the current working directory.

Parameters:
  • email_messages (dict) – keys are author names and values are the of the email

  • save_dir_name (str) – directory name to append to the current working directory to save the emails.json file in

academic_tracker.fileio.save_json_to_file(save_dir_name, file_name, json_dict, sort_keys=True)[source]

Saves the json_dict to file_name in save_dir_name in the current working directory.

Parameters:
  • save_dir_name (str) – directory name to append to the current working directory to save the json_dict in.

  • file_name (str) – the name to give the file, should have ‘.json’ as the extension.

  • json_dict (dict or list) – data to save to file.

  • sort_keys (bool) – passed to json.dumps, if True sort the dictionary keys before saving.

academic_tracker.fileio.save_publications_to_file(save_dir_name, publication_dict, prev_pubs)[source]

Saves the publication_dict to “publications.json” in save_dir_name in the current working directory.

prev_pubs and publication_dict will be combined before saving.

Parameters:
  • save_dir_name (str) – directory name to append to the current working directory to save the publications.json file in

  • publication_dict (dict) – dictionary with publication ids as the keys to the dict

  • prev_pubs (list) – List of publication ids that are publications previously found.

academic_tracker.fileio.save_string_to_file(save_dir_name, file_name, text_to_save)[source]

Save a string to file.

Parameters:
  • save_dir_name (str) – directory in the current working directory to save the string to.

  • file_name (str) – string to name the file.

  • text_to_save (str) – the string to put in the file contents.

Helper Functions

This module contains helper functions, such as printing, and regex searching.

academic_tracker.helper_functions.adjust_author_attributes(authors_by_project_dict, config_dict)[source]

Modifies config_dict with values from authors_by_project_dict

Go through the authors in authors_by_project_dict and find the lowest cutoff_year. Also find affiliations and grants and create a union of them across projects. Update the authors in config_dict[“Authors”].

Parameters:
  • authors_by_project_dict (dict) – keys are the projects in the config_dict and the values are the authors associated with that project from config_dict[“Authors”].

  • config_dict (dict) – schema matches the JSON Project Tracking Configuration file.

Returns:

schema matches the JSON Project Tracking Configuration file.

Return type:

config_dict (dict)

academic_tracker.helper_functions.are_citations_in_pub_dict(tokenized_citations, pub_dict)[source]

Determine which citations in tokenized_citations are in pub_dict.

For each citation in tokenized_citations see if it is in pub_dict. Will be True for a citation if the PMID matches, DOI matches, or the title is similar enough.

Parameters:
  • tokenized_citation (list) – list of dictionaries where each dictionary is a citation. Matches the tokenized_reference.json schema.

  • pub_dict (dict) – schema matches the publication.json schema.

Returns:

list of bools, True if the citation at that index is in pub_dict, False otherwise.

Return type:

(list)

academic_tracker.helper_functions.create_authors_by_project_dict(config_dict)[source]

Create the authors_by_project_dict dict from the config_dict.

Creates a dict where the keys are the projects in the config_dict and the values are the authors associated with that project from config_dict[“Authors”].

Parameters:

config_dict (dict) – schema matches the JSON Project Tracking Configuration file.

Returns:

keys are the projects in the config_dict and the values are the authors associated with that project from config_dict[“Authors”].

Return type:

authors_by_project_dict (dict)

academic_tracker.helper_functions.create_pub_dict_for_saving_Crossref(work, prev_query)[source]

Create the standard pub_dict from a Crossref query work dict.

Parameters:
  • work (dict) – the dictionary for a publication returned in a Crossref query.

  • prev_query (dict|None) – a dictionary containing publications from a previous query, used for message printing.

Returns:

the ID of the publication (DOI, PMID, or URL). If None, an ID couldn’t be determined. pub_dict (dict|None): the standard pub_dict with values filled in from the Crossref publication. If None, an ID couldn’t be determined.

Return type:

pub_id (str|None)

academic_tracker.helper_functions.create_pub_dict_for_saving_PubMed(pub, include_xml=False)[source]

Convert pymed.PubMedArticle to a dictionary and modify it for saving.

Converts a pymed.PubMedArticle to a dictionary, deletes the “xml” key if include_xml is False, and converts the “publication_date” key to a string.

Parameters:
  • pub (pymed.PubMedArticle) – publication to convert to a dictionary.

  • include_xml (bool) – if True, include the raw XML query in the key “xml”.

Returns:

the ID of the publication (DOI or PMID). pub_dict (dict): pub converted to a dictionary. Keys are “pubmed_id”, “title”, “abstract”, “keywords”, “journal”, “publication_date”, “authors”, “methods”, “conclusions”, “results”, “copyrights”, and “doi”

Return type:

pub_id (str)

academic_tracker.helper_functions.do_strings_fuzzy_match(string1, string2, match_ratio=90)[source]

Fuzzy match the 2 strings and if the ratio is greater than or equal to match_ratio, return True.

Parameters:
  • string1 (str|None) – a string to fuzzy match.

  • string2 (str|None) – a string to fuzzy match.

  • match_ratio (int) – the ratio (0-100) that the match must be greater than to return True.

Returns:

True if strings match, False otherwise.

Return type:

(bool)

academic_tracker.helper_functions.extract_ORCID_from_string(string)[source]

Extract an ORCID ID from a string.

Parameters:

string (str) – the string to extract the ID from.

Returns:

either the extracted ID as a string or None.

Return type:

(str|None)

academic_tracker.helper_functions.find_common_subphrases(str1, str2, min_len=2)[source]

Find all common subphrases between str1 and str2 longer than min_len.

Modified from https://stackoverflow.com/a/63337541/19957088. Find all common subphrases between the 2 strings, but filer out common subphrases between subphrases. For example, if “sand” is common between the 2 strings the function will not return “and” unless there is another instance of “and” between the 2 strings. A phrase is a string that must end in a space or be at the end of the string. So “sand asdf” and “sand awer” will only match “sand “ and not “sand a”. Spaces are expected to be meaningful. It is recommended to remove punctuation from the strings.

Parameters:
  • str1 (str) – one of the 2 strings to look for common substrings in.

  • str2 (str) – one of the 2 strings to look for common substrings in.

  • min_len (int) – if the length of a substring is less than this, then ignore it.

Returns:

a list of the common substrings.

Return type:

cs_array (list)

academic_tracker.helper_functions.find_duplicate_citations(tokenized_citations)[source]

Find citations that are duplicates of each other in tokenized_citations.

Citations can be duplicates in 3 ways. Same PMID, same DOI, or similar enough titles. The function goes through each citation and looks for matches on these criteria. Then the matches are compared to create unique sets. For instance if citation 1 matches the PMID in citation 2, and citation 2 matches the DOI in citation 3, but citation 1 and 3 don’t match a duplicate set containing all 3 is created. The unique duplicate sets are returned as a list of sorted lists.

Parameters:

tokenized_citation (list) – list of dictionaries where each dictionary is a citation. Matches the tokenized_reference.json schema.

Returns:

list of lists where each element is a list of indexes in tokenized_citations that match each other. The list of indexes is sorted in ascending order.

Return type:

unique_duplicate_sets (list)

academic_tracker.helper_functions.fuzzy_matches_to_list(str_to_match, list_to_match)[source]

Return strings and indexes for strings with match ratio that is 90 or higher.

Parameters:
  • str_to_match (str) – string to compare with list

  • list_to_match (list) – list of strings to compare with str_to_match

Returns:

list of matches (tuples) with each element being the string and its index in list_to_match. [(9, “title 1”), …]

Return type:

(list)

academic_tracker.helper_functions.get_pub_id_in_publication_dict(pub_id, title, publication_dict)[source]

Get the pub_id in publication_dict for the publication that matches the given pub_id or fuzzy matches a title.

Check whether the pub_id is in publication_dict. If it isn’t then see if there is a fuzzy match in titles. It is assumed every dictionary in publication_dict will have a “title” key with a string value.

Parameters:
  • pub_id (str) – pub_id to check against in publication_dict to see if it already exists.

  • title (str) – title corresponding to pub_id to check against titles in publication_dict.

  • publication_dict (dict) – keys are pub_ids and values are pub attributes.

Returns:

the pub_id matched in publication_dict or None if nothing was found.

Return type:

(str|None)

academic_tracker.helper_functions.is_fuzzy_match_to_list(str_to_match, list_to_match)[source]

True if string is a 90 or higher ratio match to any string in list, False otherwise.

Parameters:
  • str_to_match (str) – string to compare with list

  • list_to_match (list) – list of strings to compare with str_to_match

Returns:

True if str_to_match is a match to any string in list_tp_match, False otherwise.

Return type:

(bool)

academic_tracker.helper_functions.is_pub_in_publication_dict(pub_id, title, publication_dict, titles=None)[source]

True if pub_id is in publication_dict or title is a fuzzy match to titles in titles.

Check whether the pub_id is in publication_dict. If it isn’t then see if there is a fuzzy match in titles. If titles is not provided then get a list of titles from publication_dict.

Parameters:
  • pub_id (str) – pub_id to check against in publication_dict to see if it already exists.

  • title (str) – title corresponding to pub_id to check against titles in publication_dict.

  • publication_dict (dict) – keys are pub_ids and values are pub attributes.

  • titles (list|None) – list of strings that should be titles to fuzzy match to title.

Returns:

True if the pub_id is in publication_dict or title is fuzzy matched in titles, False otherwise

Return type:

(bool)

academic_tracker.helper_functions.match_authors_in_prev_pub(prev_author_list, new_author_list)[source]

Look for matching authors in previous pub data.

Goes through the new_author_list and tries to find a match for each author in the prev_author_list. Any authors that aren’t matched are added to a combined list. Both lists are expected to be a list of a dicts.

{“firstname”: author’s first name,

“lastname”: author’s last name, “author_id” : author’s ID, “ORCID”: ORCID ID}

{“collectivename”: collective name,

“author_id”: author’s ID, “ORCID”: ORCID ID}

If author_id is missing or None in the prev_author_list, it will be updated in the combined_author_list if the matched author in new_author_list has it. The same can be said for the ORCID attribute.

Parameters:
  • prev_author_list (list) – list of dicts where each dict is the attributes of an author.

  • new_author_list (list) – list of dicts where each dict is the attributes of an author.

Returns:

the prev_author_list updated with “author_id” for dictionaries in the list that matched the given author.

Return type:

combined_author_list (list)

academic_tracker.helper_functions.match_pub_authors_to_citation_authors(citation_authors, author_list)[source]

Try to match authors in pub data to authors in citation data.

Goes through the author_list from a publication and tries matching to an author in citation_authors using last name or ORCID if ORCID is present.

Parameters:
  • citation_authors (list) – list of dicts where each dict can have last name and ORCID or collective_name and ORCID.

  • author_list (list) – list of dicts where each dict is the attributes of an author.

Returns:

True if an author was matched, False otherwise.

Return type:

(bool)

academic_tracker.helper_functions.match_pub_authors_to_config_authors(authors_json, author_list)[source]

Try to match authors in pub data to authors in config data.

Goes through the author_list from a publication and tries matching to an author in authors_json using firstname, lastname, and affiliations, or ORCID if ORCID is present.

Parameters:
  • authors_json (dict) – keys are authors and values are author attributes. Matches authors JSON schema.

  • author_list (list) – list of dicts where each dict is the attributes of an author.

Returns:

either the author list with matched authors containing an additional author_id and/or ORCID attribute, or an empty list if no authors were matched.

Return type:

author_list (list)

academic_tracker.helper_functions.normalize_DOI(doi_string)[source]
academic_tracker.helper_functions.regex_group_return(regex_groups, group_index)[source]

Return the group in the regex_groups indicated by group_index if it exists, else return empty string.

If group_index is out of range of the regex_groups an empty string is retruned.

Parameters:
  • regex_groups (tuple) – A tuple returned from a matched regex.groups() call.

  • group_number (int) – The index of the regex_groups to return.

Returns:

Either emtpy string or the group string matched by the regex.

Return type:

(str)

academic_tracker.helper_functions.regex_match_return(regex, string_to_match)[source]

Return the groups matched in the regex if the regex matches.

regex is delivered to re.match() with string_to_match, and if there is a match the match.groups() is returned, otherwise an empty tuple is returned.

Parameters:
  • regex (str) – A string with a regular expression to be delivered to re.match().

  • string_to_match (str) – The string to match with the regex.

Returns:

either the tuple of the matched groups in the regex or an empty tuple if a match wasn’t found.

Return type:

(tuple)

academic_tracker.helper_functions.regex_search_return(regex, string_to_search)[source]

Return the groups matched in the regex if the regex matches.

regex is delivered to re.search() with string_to_search, and if there is a match the match.groups() is returned, otherwise an empty tuple is returned.

Parameters:
  • regex (str) – A string with a regular expression to be delivered to re.search().

  • string_to_search (str) – The string to match with the regex.

Returns:

either the tuple of the matched groups in the regex or an empty tuple if a match wasn’t found.

Return type:

(tuple)

academic_tracker.helper_functions.vprint(*args, verbosity=0)[source]

Print depending on the state of VERBOSE, SILENT, and verbosity.

If the global SILENT is True don’t print anything. If verbosity is 0 then print. If verbosity is 1 then VERBOSE must be True to print.

Parameters:

verbosity (int) – Either 0 or 1 for different levels of verbosity.

Webio

General functions that interface with the internet.

academic_tracker.webio.clean_tags_from_url(url)[source]

Remove tags from webpage.

Remove tags from a webpage so it looks more like what a user would see in a browser.

Parameters:

url (str) – the URL to query.

Returns:

webpage contents cleaned of tags.

Return type:

clean_url (str)

academic_tracker.webio.get_DOI_from_Crossref(title, mailto_email)[source]

Search title on Crossref and try to find a DOI for it.

Parameters:
  • title (str) – string of the title of the journal article to search for.

  • mailto_email (str) – an email address needed to search Crossref more effectively.

Returns:

Either None or the DOI of the article title. The DOI will not be a URL.

Return type:

doi (str)

academic_tracker.webio.get_url_contents_as_str(url)[source]

Query the url and return it’s contents as a string.

Parameters:

url (str) – the URL to query.

Returns:

Either the website as a string or None if an error occurred.

Return type:

(str)

academic_tracker.webio.search_Google_Scholar_for_ids(authors_json)[source]

Query Google Scholar with author names and get Scholar IDs.

If an author already has a scholar_id, or doesn’t have affiliations they are skipped.

Parameters:

authors_json (dict) – JSON matching the Authors section of the Configuration file.

Returns:

the authors_json modified with any ORCID IDs found.

Return type:

authors_json (dict)

academic_tracker.webio.search_ORCID_for_ids(ORCID_key, ORCID_secret, authors_json)[source]

Query ORCID with author names and get ORCID IDs.

If an author already has an ORCID, or doesn’t have affiliations they are skipped.

Parameters:
  • ORCID_key (str) – key assigned to your registered application from ORCID.

  • ORCID_secret (str) – secret given to you by ORCID.

  • authors_json (dict) – JSON matching the Authors section of the Configuration file.

Returns:

the authors_json modified with any ORCID IDs found.

Return type:

authors_json (dict)

academic_tracker.webio.send_emails(email_messages)[source]

Uses sendmail to send email_messages to authors.

Only works on systems with sendmail installed.

Parameters:

email_messages (dict) – keys are author names and values are the message

Emails and Reports Helpers

Functions to create emails and reports that are in common for both author_search and ref_search.