API
This package has the following modules:
user_input_checking
This module contains functions for checking user input.
tracker_schema
This module contains the schema for validating user input.
athr_srch_modularized
This module contains functions to complete the author_search command modularized into pieces.
athr_srch_webio
This module contains functions for author_search to interface with the internet.
athr_srch_emails_and_reports
This module contains functions to create emails and reports for author_search.
ref_srch_modularized
This module contains functions to complete the reference_search command modularized into pieces.
ref_srch_webio
This module contains functions for reference_search to interface with the internet.
ref_srch_emails_and_reports
This module contains functions to create emails and reports for reference_search.
citation_parsing
This module contains functions for parsing references and citations for reference_search.
fileio
This module contains functions for reading and writing files.
helper_functions
This module contains functions that help the other modules function. The functions do things such as fuzzy matching, regex searching, and printing.
emails_and_reports_helpers
This module contains functions that help create emails and reports.
webio
This module contains general functions for interfacing with the internet.
User Input Checking
Functions that check the user input for errors.
- academic_tracker.user_input_checking.cli_inputs_check(args)[source]
Run input checking on the CLI inputs.
Uses jsonschema to validate the inputs.
- Parameters:
args (dict) – dict from docopt.
- academic_tracker.user_input_checking.config_file_check(config_json, no_ORCID, no_GoogleScholar, no_Crossref, no_PubMed)[source]
Check that the configuration JSON file is as expected.
The validational jsonschema is in the tracker_schema module.
- Parameters:
config_json (dict) – dict with the same structure as the configuration JSON file.
no_ORCID (bool) – if True delete the part of the schema that checks ORCID attributes.
no_GoogleScholar (bool) – if True and no_Crossref is True delete the part of the schema that checks Crossref attributes.
no_Crossref (bool) – if True and no_GoogleScholar is True delete the part of the schema that checks Crossref attributes.
no_PubMed (bool) – if True delete the part of the schema that checks PubMed attributes.
- academic_tracker.user_input_checking.config_report_check(config_json)[source]
Check that the report attributes don’t have conflicts.
Make sure that the values in sort and column_order are in columns, and that every column is in column_order.
- Parameters:
config_json (dict) – dict with the same structure as the configuration JSON file.
- academic_tracker.user_input_checking.prev_pubs_file_check(prev_pubs)[source]
Run input checking on prev_pubs dict.
The validational jsonschema is in the tracker_schema module.
- Parameters:
prev_pubs (dict) – dict with the same structure as the previous publications JSON file.
- academic_tracker.user_input_checking.ref_config_file_check(config_json, no_Crossref, no_PubMed)[source]
Check that the configuration JSON file is as expected.
The validational jsonschema is in the tracker_schema module.
- academic_tracker.user_input_checking.tok_reference_check(tok_ref)[source]
Run input checking on tok_ref dict.
The validational jsonschema is in the tracker_schema module.
- Parameters:
tok_ref (dict) – dict with the same structure as the tokenized reference JSON file.
- academic_tracker.user_input_checking.tracker_validate(instance, schema, pattern_messages={}, cls=None, *args, **kwargs)[source]
Wrapper around jsonchema.validate to give better error messages.
- Parameters:
- Raises:
jsonshcema.ValidationError – If an unexpected jsonschema error happens this is raised rather than a system exit.
- academic_tracker.athr_srch_modularized.build_publication_dict(config_dict, prev_pubs, no_ORCID, no_GoogleScholar, no_Crossref, no_PubMed)[source]
Query PubMed, ORCID, Google Scholar, and Crossref for publications for the authors.
- Parameters:
config_dict (dict) – Matches the Configuration file JSON schema.
prev_pubs (dict) – Matches the publication JSON schema. Used to ignore publications when querying.
no_ORCID (bool) – If True search ORCID else don’t.
no_GoogleScholar (bool) – if True search Google Scholar else don’t.
no_Crossref (bool) – If True search Crossref else don’t.
no_PubMed (bool) – If True search PubMed else don’t.
- Returns:
The dictionary matching the publication JSON schema. all_queries (dict): The pubs searched for each source and each author. {“PubMed”:{“author1”:[pub1, …], …}, “ORCID”:{“author1”:[pub1, …], …}, “Google Scholar”:{“author1”:[pub1, …], …}, “Crossref”:{“author1”:[pub1, …], …}}
- Return type:
running_pubs (dict)
- academic_tracker.athr_srch_modularized.generate_internal_data_and_check_authors(config_dict)[source]
Create authors_by_project_dict and look for authors without projects.
- Parameters:
config_dict (dict) – Matches the Configuration file JSON schema.
- Returns:
Keys are project names and values are a dictionary of authors and their attributes. config_dict (dict): same as input but with author information updated based on project information.
- Return type:
authors_by_project_dict (dict)
- academic_tracker.athr_srch_modularized.input_reading_and_checking(config_json_filepath, no_ORCID, no_GoogleScholar, no_Crossref, no_PubMed)[source]
Read in inputs from user and do error checking.
- Parameters:
config_json_filepath (str) – filepath to the configuration JSON.
no_ORCID (bool) – If True search ORCID else don’t. Reduces checking on config JSON if True.
no_GoogleScholar (bool) – if True search Google Scholar else don’t. Reduces checking on config JSON if True.
no_Crossref (bool) – If True search Crossref else don’t. Reduces checking on config JSON if True.
no_PubMed (bool) – If True search PubMed else don’t. Reduces checking on config JSON if True.
- Returns:
Matches the Configuration file JSON schema.
- Return type:
config_dict (dict)
- academic_tracker.athr_srch_modularized.save_and_send_reports_and_emails(authors_by_project_dict, publication_dict, config_dict, test)[source]
Build the summary report and project reports and email them.
- Parameters:
authors_by_project_dict (dict) – Keys are project names and values are a dictionary of authors and their attributes.
publication_dict (dict) – The dictionary matching the publication JSON schema.
config_dict (dict) – Matches the Configuration file JSON schema.
test (bool) – If True save_dir_name is tracker-test instead of tracker- and emails are not sent.
- Returns:
Name of the directory where the emails and reports were saved.
- Return type:
save_dir_name (str)
- academic_tracker.athr_srch_webio.search_Crossref_for_pubs(running_pubs, authors_json, mailto_email, prev_query=None)[source]
Searhes Crossref for publications by each author.
For each author in authors_json Crossref is queried for the publications. The list of publications is then filtered by affiliations and cutoff_year. If the author doesn’t have at least one matching affiliation, then the publication is skipped. If the publication was published before the cutoff_year, then it is skipped. Each publication is then determined to have citations for any of the grants in the author’s grants. If prev_query is given, then publications will be taken from it instead of querying Crossref again.
- Parameters:
running_pubs (dict) – dictionary of publications matching the JSON schema for publications.
authors_json (dict) – keys are authors and values are author attributes. Matches authors JSON schema.
mailto_email (str) – used in the query to Crossref.
prev_query (dict|None) – a dictionary containing publications from a previous call to this function. {author1: [pub1, …], …}
- Returns:
keys are pulication ids and values are a dictionary with publication attributes all_pubs (dict): a dictionary where the keys are the authors in authors_json and the values are a list of the publications queried for them.
- Return type:
running_pubs (dict)
- academic_tracker.athr_srch_webio.search_Google_Scholar_for_pubs(running_pubs, authors_json, mailto_email, prev_query=None)[source]
Searhes Google Scholar for publications by each author.
For each author in authors_json Google Scholar is queried for the publications. The list of publications is then filtered by affiliations and cutoff_year. If the author doesn’t have at least one matching affiliation, then the publication is skipped. If the publication was published before the cutoff_year, then it is skipped. If prev_query is given, then publications will be taken from it instead of querying Google Scholar again.
- Parameters:
running_pubs (dict) – dictionary of publications matching the JSON schema for publications.
authors_json (dict) – keys are authors and values are author attributes. Matches authors JSON schema.
mailto_email (str) – used in the query to Crossref when trying to find DOIs for the articles.
prev_query (dict|None) – a dictionary containing publications from a previous call to this function. {author1: [pub1, …], …}
- Returns:
keys are pulication ids and values are a dictionary with publication attributes all_pubs (dict): a dictionary where the keys are the authors in authors_json and the values are a list of the publications queried for them.
- Return type:
running_pubs (dict)
- academic_tracker.athr_srch_webio.search_ORCID_for_pubs(running_pubs, ORCID_key, ORCID_secret, authors_json, prev_query=None)[source]
Searhes ORCID for publications by each author.
For each author in authors_json ORCID is queried for the publications. The list of publications is then filtered by affiliations and cutoff_year. If the author doesn’t have at least one matching affiliation, then the publication is skipped. If the publication was published before the cutoff_year, then it is skipped. If prev_query is given, then publications will be taken from it instead of querying ORCID again.
- Parameters:
running_pubs (dict) – dictionary of publications matching the JSON schema for publications.
ORCID_key (str) – string of the app key ORCID gives when you register the app with them
ORCID_secret (str) – string of the secret ORCID gives when you register the app with them
authors_json (dict) – keys are authors and values are author attributes. Matches authors JSON schema.
prev_query (dict|None) – a dictionary containing publications from a previous call to this function. {author1: [pub1, …], …}
- Returns:
keys are publication ids and values are a dictionary with publication attributes all_pubs (dict): a dictionary where the keys are the authors in authors_json and the values are a list of the publications queried for them.
- Return type:
running_pubs (dict)
- academic_tracker.athr_srch_webio.search_PubMed_for_pubs(running_pubs, authors_json, from_email, prev_query=None)[source]
Searhes PubMed for publications by each author.
For each author in authors_json PubMed is queried for the publications. The list of publications is then filtered by affiliations and cutoff_year. If the publication is in the of running_pubs then it tries to fill in missing information from this source. If the author doesn’t have at least one matching affiliation then the publication is skipped. If the publication was published before the cutoff_year then it is skipped. If prev_query is given, then publications will be taken from it instead of querying PubMed again.
- Parameters:
running_pubs (dict) – dictionary of publications matching the JSON schema for publications.
authors_json (dict) – keys are authors and values are author attributes. Matches Authors section of configuration JSON schema.
from_email (str) – used in the query to PubMed.
prev_query (dict|None) – a dictionary containing publications from a previous call to this function. {author1: [pub1, …], …}
- Returns:
keys are publication ids and values are a dictionary with publication attributes all_pubs (dict): a dictionary where the keys are the authors in authors_json and the values are a list of the publications queried for them.
- Return type:
running_pubs (dict)
- academic_tracker.athr_srch_emails_and_reports.build_author_loop(publication_dict, config_dict, authors_by_project_dict, project_name, template_string)[source]
Replace tags in template_string with the appropriate information.
- Parameters:
publication_dict (dict) – keys and values match the publications JSON file.
config_dict (dict) – keys and values match the project tracking configuration JSON file.
authors_by_project_dict (dict) – keys are project names from the config file and values are pulled from config_dict[“Authors”].
project_name (str) – The name of the project.
template_string (str) – Template used to create the project report.
- Returns:
The string built by looping over the authors in authors_by_project_dict and using the template_string to build a report.
- Return type:
project_authors (str)
- academic_tracker.athr_srch_emails_and_reports.create_collaborator_report(publication_dict, template, author, pubs, filename, save_dir_name)[source]
Create a collaborator report from a formatted string.
Loop over all of the author’s publications and create a
- Parameters:
publication_dict (dict) – keys and values match the publications JSON file.
author (str) – The key to the author in config_dict[“Authors”].
pubs (dict) – Keys are publications for the author and values are the grants associated with that pub.
filename (str) – filename to save the publication under.
save_dir_name (str) – directory to save the report in.
- Returns:
The text of the report or an empty string.
- Return type:
report (str)
- academic_tracker.athr_srch_emails_and_reports.create_collaborators_reports_and_emails(publication_dict, config_dict, save_dir_name)[source]
Create a report of collaborators for authors in publication_dict.
For each author in publication_dict with an author_id create a csv file with the other authors on their publicaitons.
- Parameters:
- Returns:
keys and values match the email JSON file.
- Return type:
email_messages (dict)
- academic_tracker.athr_srch_emails_and_reports.create_project_report(publication_dict, config_dict, authors_by_project_dict, project_name, template_string='<author_loop><author_first> <author_last>:<pub_loop>\n\tTitle: <title> \n\tAuthors: <authors> \n\tJournal: <journal> \n\tDOI: <DOI> \n\tPMID: <PMID> \n\tPMCID: <PMCID> \n\tGrants: <grants>\n</pub_loop>\n</author_loop>', author_first='', author_last='')[source]
Create the project report for the project.
The details of creating project reports are outlined in the documentation. Use the information in the config_dict, publication_dict, and authors_by_project_dict to fill in the information in the template_string. If author_first is given then it is assumed the report is actually for a single author and not a whole project.
- Parameters:
publication_dict (dict) – keys and values match the publications JSON file.
config_dict (dict) – keys and values match the project tracking configuration JSON file.
authors_by_project_dict (dict) – keys are project names from the config file and values are pulled from config_dict[“Authors”].
project_name (str) – The name of the project.
template_string (str) – Template used to create the project report.
author_first (str) – First name of the author. If not “” the report is assumed to be for 1 author.
author_last (str) – Last name of the author.
- Returns:
The template_string with the appropriate tags replaced with relevant information.
- Return type:
template_string (str)
- academic_tracker.athr_srch_emails_and_reports.create_project_reports_and_emails(authors_by_project_dict, publication_dict, config_dict, save_dir_name)[source]
Create project reports and emails for each project.
For each project in config_dict create a report and optional email. Reports are saved in save_dir_name as they are created.
- Parameters:
authors_by_project_dict (dict) – keys are project names from the config file and values are pulled from config_dict[“Authors”].
publication_dict (dict) – keys and values match the publications JSON file.
config_dict (dict) – keys and values match the project tracking configuration JSON file.
save_dir_name (str) – directory to save the reports in.
- Returns:
keys and values match the email JSON file.
- Return type:
email_messages (dict)
- academic_tracker.athr_srch_emails_and_reports.create_pubs_by_author_dict(publication_dict)[source]
Create a dictionary with authors as the keys and values as the pub_ids and grants
Organizes the publication information in an author focused way so other operations are easier.
- academic_tracker.athr_srch_emails_and_reports.create_summary_report(publication_dict, config_dict, authors_by_project_dict, template_string='<project_loop><project_name>\n<author_loop>\t<author_first> <author_last>:<pub_loop>\n\t\tTitle: <title> \n\t\tAuthors: <authors> \n\t\tJournal: <journal> \n\t\tDOI: <DOI> \n\t\tPMID: <PMID> \n\t\tPMCID: <PMCID> \n\t\tGrants: <grants>\n</pub_loop>\n</author_loop></project_loop>')[source]
Create the summary report for the run.
The details of creating summary reports are outlined in the documentation. Use the information in the config_dict, publication_dict, and authors_by_project_dict to fill in the information in the template_string.
- Parameters:
publication_dict (dict) – keys and values match the publications JSON file.
config_dict (dict) – keys and values match the project tracking configuration JSON file.
authors_by_project_dict (dict) – keys are project names from the config file and values are pulled from config_dict[“Authors”].
template_string (str) – Template used to create the project report.
- Returns:
The report built by replacing the appropriate tags in template_string with relevant information.
- Return type:
report_string (str)
- academic_tracker.athr_srch_emails_and_reports.create_tabular_collaborator_report(publication_dict, config_dict, author, pubs, filename, file_format, save_dir_name)[source]
Create a table for a collaborator report and save as either csv or xlsx.
- Parameters:
publication_dict (dict) – keys and values match the publications JSON file.
config_dict (dict) – keys and values match the project tracking configuration JSON file.
author (str) – The key to the author in config_dict[“Authors”].
pubs (dict) – Keys are publications for the author and values are the grants associated with that pub.
filename (str) – filename to save the publication under.
file_format (str) – csv or xlsx, determines what format to save in.
save_dir_name (str) – directory to save the report in.
- Returns:
The text of the report, empty string, or path to the saved xlsx file. filename (str): Filename of the report. Made have had an .xlsx added to the end.
- Return type:
report (str)
- academic_tracker.athr_srch_emails_and_reports.create_tabular_project_report(publication_dict, config_dict, authors_by_project_dict, pubs_by_author_dict, project_name, report_attributes, save_dir_name, filename)[source]
Create a pandas DataFrame and save it as Excel or CSV.
- Parameters:
publication_dict (dict) – keys and values match the publications JSON file.
config_dict (dict) – keys and values match the project tracking configuration JSON file.
authors_by_project_dict (dict) – keys are project names from the config file and values are pulled from config_dict[“Authors”].
pubs_by_author_dict (dict) – dictionary where the keys are authors and the values are a dictionary of pub_ids with thier associated grants.
project_name (str) – Name of the project.
report_attributes (dict) – Dictionary of the report attributes. Could come from project_descriptions or an author.
save_dir_name (str) – directory to save the report in.
filename (str) – Filename of the report.
- Returns:
Either the text of the report if csv or a relative filepath to where the Excel file is saved. filename (str): Filename of the report. Made have had an .xlsx added to the end.
- Return type:
report (str)
- academic_tracker.athr_srch_emails_and_reports.create_tabular_summary_report(publication_dict, config_dict, authors_by_project_dict, save_dir_name)[source]
Create a pandas DataFrame and save it as Excel or CSV.
- Parameters:
publication_dict (dict) – keys and values match the publications JSON file.
config_dict (dict) – keys and values match the project tracking configuration JSON file.
authors_by_project_dict (dict) – keys are project names from the config file and values are pulled from config_dict[“Authors”].
save_dir_name (str) – directory to save the report in.
- Returns:
Either the text of the report if csv or a relative filepath to where the Excel file is saved. filename (str): Filename of the report. Made have had an .xlsx added to the end.
- Return type:
report (str)
Reference Search Modularized
Modularized pieces of reference_search.
- academic_tracker.ref_srch_modularized.build_publication_dict(config_dict, tokenized_citations, no_Crossref, no_PubMed)[source]
Query PubMed and Crossref for publications matching the citations in tokenized_citations.
- Parameters:
config_dict (dict) – Matches the Configuration file JSON schema.
tokenized_citations (list) – list of dicts. Matches the tokenized citations JSON schema.
no_Crossref (bool) – If True search Crossref else don’t. Reduces checking on config JSON if True.
no_PubMed (bool) – If True search PubMed else don’t. Reduces checking on config JSON if True.
- Returns:
The dictionary matching the publication JSON schema. tokenized_citations (list): Same list as the input but with the pud_dict_key updated to match the publication found. all_queries (dict): for each source searched a list of lists, each index is the pubs searched through after querying until the citation was matched, {“PubMed”:[[pub1, …], …], “Crossref”:[[pub1, …], …]}
- Return type:
running_pubs (dict)
- academic_tracker.ref_srch_modularized.input_reading_and_checking(config_json_filepath, ref_path_or_URL, MEDLINE_reference, no_Crossref, no_PubMed, prev_pub_filepath, remove_duplicates)[source]
Read in inputs from user and do error checking.
- Parameters:
config_json_filepath (str) – filepath to the configuration JSON.
ref_path_or_URL (str) – either a filepath to file to tokenize or a URL to tokenize.
MEDLINE_reference (bool) – If True re_path_or_URL is a filepath to a MEDLINE formatted file.
no_Crossref (bool) – If True search Crossref else don’t. Reduces checking on config JSON if True.
no_PubMed (bool) – If True search PubMed else don’t. Reduces checking on config JSON if True.
prev_pub_filepath (str or None) – filepath to the publication JSON to read in.
remove_duplicates (bool) – if True, remove duplicate entries in tokenized citations.
- Returns:
Matches the Configuration file JSON schema. tokenized_citations (list): list of dicts. Matches the tokenized citations JSON schema. has_previous_pubs (bool): True if a prev_pub file was input, False otherwise. prev_pubs (dict): The contents of the prev_pub file input by the user if provided.
- Return type:
config_dict (dict)
- academic_tracker.ref_srch_modularized.save_and_send_reports_and_emails(config_dict, tokenized_citations, publication_dict, prev_pubs, has_previous_pubs, test)[source]
Build the summary report and email it.
- Parameters:
config_dict (dict) – Matches the Configuration file JSON schema.
tokenized_citations (list) – list of dicts. Matches the tokenized citations JSON schema.
publication_dict (dict) – The dictionary matching the publication JSON schema.
prev_pubs (dict) – The contents of the prev_pub file input by the user if provided.
has_previous_pubs (bool) – True if a prev_pub file was input, False otherwise.
test (bool) – If True save_dir_name is tracker-test instead of tracker- and emails are not sent.
- Returns:
Name of the directory where the emails and report were saved.
- Return type:
save_dir_name (str)
Reference Search Webio
Internet interfacing for reference_search.
- academic_tracker.ref_srch_webio.build_pub_dict_from_PMID(PMID_list, from_email)[source]
Query PubMed for each PMID and build a dictionary of the returned data.
- academic_tracker.ref_srch_webio.parse_myncbi_citations(url)[source]
Tokenize the citations on a MyNCBI URL.
Note that authors and title can be missing or empty from the webpage. This function assumes the url is the first page of the MyNCBI citations. The first page is tokenized and then each subsequent page is visited and tokenized.
- academic_tracker.ref_srch_webio.search_references_on_source(source, running_pubs, tokenized_citations, mailto_email, prev_query=None)[source]
Searhes source for publications matching the citations.
For each citation in tokenized_citations the source is queried for the publication. If the publication is already in running_pubs then missing information will be filled in if possible.
Possible sources are “Crossref” or “PubMed”.
- Parameters:
source (str) – must be one of “Crossref” or “PubMed”.
running_pubs (dict) – dictionary of publications matching the JSON schema for publications.
tokenized_citations (list) – list of citations parsed from a source. Each citation is a dict {“authors”, “title”, “DOI”, “PMID”, “reference_line”, “pub_dict_key”}.
mailto_email (str) – email provided to the source when querying.
prev_query (list|None) – a list of lists containing publications from a previous call to this function. [[pub1, …], [pub1, …], …]
- Returns:
keys are pulication ids and values are a dictionary with publication attributes matching_key_for_citation (list): list of keys to the publication matching the citation at the same index all_pubs (list): list of lists, each index is the pubs searched through after querying until the citation was matched
- Return type:
running_pubs (dict)
- academic_tracker.ref_srch_webio.tokenize_reference_input(reference_input, MEDLINE_reference, remove_duplicates=True)[source]
Tokenize the citations in reference_input.
reference_input can be a URL or filepath. MyNCBI URLs are handled special, but all other URLs are read as a text document and parsed line by line as if they were a test document. If the format of the reference is MEDLINE then set MEDLINE_reference to True and it will be parsed as such instead of line by line. Citations are expected to be 1 per line otherwise.
- Parameters:
- Returns:
the citations tokenized in a dictionary matching the tokenized citations JSON schema.
- Return type:
tokenized_citations (dict)
Reference Search Emails and Reports
Functions to create emails and reports for reference_search.
- academic_tracker.ref_srch_emails_and_reports.convert_tokenized_authors_to_str(authors)[source]
Combine authors into a comma separated string.
Try to do first_name last_name for each author, but if first name isn’t there then last_name initials. ex. first_name1 last_name1, last_name2 initials2
- academic_tracker.ref_srch_emails_and_reports.create_report_from_template(publication_dict, is_citation_in_prev_pubs_list, tokenized_citations, template_string='<pub_loop>Reference Line:\n\t<ref_line>\nTokenized Reference:\n\tAuthors: <tok_authors>\n\tTitle: <tok_title>\n\tPMID: <tok_PMID>\n\tDOI: <tok_DOI>\nQueried Information:\n\tDOI: <DOI>\n\tPMID: <PMID>\n\tPMCID: <PMCID>\n\tGrants: <grants>\n\n</pub_loop>')[source]
Create project report based on template_string.
Loop over each publication in publication_dict and build a report based on the tags in the template_string. Details about reports are in the documentation.
- Parameters:
publication_dict (dict) – keys and values match the publications JSON file.
is_citation_in_prev_pubs_list (list) – list of bools that indicate whether or not the citation at the same index in tokenized_citations is in the prev_pubs
tokenized_citations (list) – list of dicts. Matches the JSON schema for tokenized citations.
template_string (str) – string with tags indicated what information to put in the report.
- Returns:
text of the created report.
- Return type:
report (str)
- academic_tracker.ref_srch_emails_and_reports.create_tabular_report(publication_dict, config_dict, is_citation_in_prev_pubs_list, tokenized_citations, save_dir_name)[source]
Create a pandas DataFrame and save it as Excel or CSV.
- Parameters:
publication_dict (dict) – keys and values match the publications JSON file.
config_dict (dict) – keys and values match the project tracking configuration JSON file.
is_citation_in_prev_pubs_list (list) – list of bools that indicate whether or not the citation at the same index in tokenized_citations is in the prev_pubs
tokenized_citations (list) – list of dicts. Matches the JSON schema for tokenized citations.
save_dir_name (str) – directory to save the report in.
- Returns:
Either the text of the report if csv or a relative filepath to where the Excel file is saved. filename (str): Filename of the report. Made have had an .xlsx added to the end.
- Return type:
report (str)
- academic_tracker.ref_srch_emails_and_reports.create_tokenization_report(tokenized_citations)[source]
Create a report that details all the information about how a reference was tokenized.
Intended as a troubleshooting report.
Citation Parsing
Functions for parsing citations.
- academic_tracker.citation_parsing.parse_MEDLINE_format(text_string)[source]
Tokenize text_string based on it being of the MEDLINE format.
- academic_tracker.citation_parsing.parse_text_for_citations(text)[source]
Parse text line by line and tokenize it.
The function is aware of MLA, APA, Chicago, Harvard, and Vancouver style citations. Although the citation styles the function is aware of have standards for citations in reality these standards are not strictly adhered to by the public. Therefore the function uses a more heuristic approach.
- academic_tracker.citation_parsing.tokenize_APA_or_Harvard_authors(authors_string)[source]
Tokenize authors based on APA or Harvard citation style.
- academic_tracker.citation_parsing.tokenize_MLA_or_Chicago_authors(authors_string)[source]
Tokenize authors based on MLA or Chicago citation style.
- academic_tracker.citation_parsing.tokenize_Vancouver_authors(authors_string)[source]
Tokenize authors based on Vancouver citation style.
- academic_tracker.citation_parsing.tokenize_myncbi_citations(html)[source]
Tokenize the citations on a MyNCBI HTML page.
Note that authors and title can be missing or empty from the webpage.
Fileio
This module contains the functions that read and write files.
- academic_tracker.fileio.load_json(filepath)[source]
Adds error checking around loading a json file.
- academic_tracker.fileio.read_previous_publications(filepath)[source]
Read in the previous publication json file.
If the prev_pub option was given by the user then that filepath is used to read in the file and it is checked to make sure the json is a list and each value is a string. If the prev_pub option was not given then look for a “tracker-timestamp” directory in the current working directory and if it has a publications.json file then read in that file. If no previous publications are found then an empty dict is returned for prev_pubs.
- academic_tracker.fileio.read_text_from_docx(doc_path)[source]
Open docx file at doc_path and read contents into a string.
- academic_tracker.fileio.read_text_from_txt(doc_path)[source]
Open txt or csv file at doc_path and read contents into a string.
- academic_tracker.fileio.save_emails_to_file(email_messages, save_dir_name)[source]
Save email_messages to “emails.json” in save_dir_name in the current working directory.
- academic_tracker.fileio.save_json_to_file(save_dir_name, file_name, json_dict, sort_keys=True)[source]
Saves the json_dict to file_name in save_dir_name in the current working directory.
- academic_tracker.fileio.save_publications_to_file(save_dir_name, publication_dict, prev_pubs)[source]
Saves the publication_dict to “publications.json” in save_dir_name in the current working directory.
prev_pubs and publication_dict will be combined before saving.
- academic_tracker.fileio.save_string_to_file(save_dir_name, file_name, text_to_save)[source]
Save a string to file.
Helper Functions
This module contains helper functions, such as printing, and regex searching.
- academic_tracker.helper_functions.adjust_author_attributes(authors_by_project_dict, config_dict)[source]
Modifies config_dict with values from authors_by_project_dict
Go through the authors in authors_by_project_dict and find the lowest cutoff_year. Also find affiliations and grants and create a union of them across projects. Update the authors in config_dict[“Authors”].
- Parameters:
- Returns:
schema matches the JSON Project Tracking Configuration file.
- Return type:
config_dict (dict)
- academic_tracker.helper_functions.are_citations_in_pub_dict(tokenized_citations, pub_dict)[source]
Determine which citations in tokenized_citations are in pub_dict.
For each citation in tokenized_citations see if it is in pub_dict. Will be True for a citation if the PMID matches, DOI matches, or the title is similar enough.
- Parameters:
- Returns:
list of bools, True if the citation at that index is in pub_dict, False otherwise.
- Return type:
(list)
- academic_tracker.helper_functions.create_authors_by_project_dict(config_dict)[source]
Create the authors_by_project_dict dict from the config_dict.
Creates a dict where the keys are the projects in the config_dict and the values are the authors associated with that project from config_dict[“Authors”].
- academic_tracker.helper_functions.create_pub_dict_for_saving_Crossref(work, prev_query)[source]
Create the standard pub_dict from a Crossref query work dict.
- Parameters:
- Returns:
the ID of the publication (DOI, PMID, or URL). If None, an ID couldn’t be determined. pub_dict (dict|None): the standard pub_dict with values filled in from the Crossref publication. If None, an ID couldn’t be determined.
- Return type:
pub_id (str|None)
- academic_tracker.helper_functions.create_pub_dict_for_saving_PubMed(pub, include_xml=False)[source]
Convert pymed.PubMedArticle to a dictionary and modify it for saving.
Converts a pymed.PubMedArticle to a dictionary, deletes the “xml” key if include_xml is False, and converts the “publication_date” key to a string.
- Parameters:
pub (pymed.PubMedArticle) – publication to convert to a dictionary.
include_xml (bool) – if True, include the raw XML query in the key “xml”.
- Returns:
the ID of the publication (DOI or PMID). pub_dict (dict): pub converted to a dictionary. Keys are “pubmed_id”, “title”, “abstract”, “keywords”, “journal”, “publication_date”, “authors”, “methods”, “conclusions”, “results”, “copyrights”, and “doi”
- Return type:
pub_id (str)
- academic_tracker.helper_functions.do_strings_fuzzy_match(string1, string2, match_ratio=90)[source]
Fuzzy match the 2 strings and if the ratio is greater than or equal to match_ratio, return True.
- academic_tracker.helper_functions.extract_ORCID_from_string(string)[source]
Extract an ORCID ID from a string.
- academic_tracker.helper_functions.find_common_subphrases(str1, str2, min_len=2)[source]
Find all common subphrases between str1 and str2 longer than min_len.
Modified from https://stackoverflow.com/a/63337541/19957088. Find all common subphrases between the 2 strings, but filer out common subphrases between subphrases. For example, if “sand” is common between the 2 strings the function will not return “and” unless there is another instance of “and” between the 2 strings. A phrase is a string that must end in a space or be at the end of the string. So “sand asdf” and “sand awer” will only match “sand “ and not “sand a”. Spaces are expected to be meaningful. It is recommended to remove punctuation from the strings.
- academic_tracker.helper_functions.find_duplicate_citations(tokenized_citations)[source]
Find citations that are duplicates of each other in tokenized_citations.
Citations can be duplicates in 3 ways. Same PMID, same DOI, or similar enough titles. The function goes through each citation and looks for matches on these criteria. Then the matches are compared to create unique sets. For instance if citation 1 matches the PMID in citation 2, and citation 2 matches the DOI in citation 3, but citation 1 and 3 don’t match a duplicate set containing all 3 is created. The unique duplicate sets are returned as a list of sorted lists.
- Parameters:
tokenized_citation (list) – list of dictionaries where each dictionary is a citation. Matches the tokenized_reference.json schema.
- Returns:
list of lists where each element is a list of indexes in tokenized_citations that match each other. The list of indexes is sorted in ascending order.
- Return type:
unique_duplicate_sets (list)
- academic_tracker.helper_functions.fuzzy_matches_to_list(str_to_match, list_to_match)[source]
Return strings and indexes for strings with match ratio that is 90 or higher.
- academic_tracker.helper_functions.get_pub_id_in_publication_dict(pub_id, title, publication_dict)[source]
Get the pub_id in publication_dict for the publication that matches the given pub_id or fuzzy matches a title.
Check whether the pub_id is in publication_dict. If it isn’t then see if there is a fuzzy match in titles. It is assumed every dictionary in publication_dict will have a “title” key with a string value.
- Parameters:
- Returns:
the pub_id matched in publication_dict or None if nothing was found.
- Return type:
(str|None)
- academic_tracker.helper_functions.is_fuzzy_match_to_list(str_to_match, list_to_match)[source]
True if string is a 90 or higher ratio match to any string in list, False otherwise.
- academic_tracker.helper_functions.is_pub_in_publication_dict(pub_id, title, publication_dict, titles=None)[source]
True if pub_id is in publication_dict or title is a fuzzy match to titles in titles.
Check whether the pub_id is in publication_dict. If it isn’t then see if there is a fuzzy match in titles. If titles is not provided then get a list of titles from publication_dict.
- Parameters:
pub_id (str) – pub_id to check against in publication_dict to see if it already exists.
title (str) – title corresponding to pub_id to check against titles in publication_dict.
publication_dict (dict) – keys are pub_ids and values are pub attributes.
titles (list|None) – list of strings that should be titles to fuzzy match to title.
- Returns:
True if the pub_id is in publication_dict or title is fuzzy matched in titles, False otherwise
- Return type:
(bool)
- academic_tracker.helper_functions.match_authors_in_prev_pub(prev_author_list, new_author_list)[source]
Look for matching authors in previous pub data.
Goes through the new_author_list and tries to find a match for each author in the prev_author_list. Any authors that aren’t matched are added to a combined list. Both lists are expected to be a list of a dicts.
- {“firstname”: author’s first name,
“lastname”: author’s last name, “author_id” : author’s ID, “ORCID”: ORCID ID}
- {“collectivename”: collective name,
“author_id”: author’s ID, “ORCID”: ORCID ID}
If author_id is missing or None in the prev_author_list, it will be updated in the combined_author_list if the matched author in new_author_list has it. The same can be said for the ORCID attribute.
- Parameters:
- Returns:
the prev_author_list updated with “author_id” for dictionaries in the list that matched the given author.
- Return type:
combined_author_list (list)
- academic_tracker.helper_functions.match_pub_authors_to_citation_authors(citation_authors, author_list)[source]
Try to match authors in pub data to authors in citation data.
Goes through the author_list from a publication and tries matching to an author in citation_authors using last name or ORCID if ORCID is present.
- academic_tracker.helper_functions.match_pub_authors_to_config_authors(authors_json, author_list)[source]
Try to match authors in pub data to authors in config data.
Goes through the author_list from a publication and tries matching to an author in authors_json using firstname, lastname, and affiliations, or ORCID if ORCID is present.
- Parameters:
- Returns:
either the author list with matched authors containing an additional author_id and/or ORCID attribute, or an empty list if no authors were matched.
- Return type:
author_list (list)
- academic_tracker.helper_functions.regex_group_return(regex_groups, group_index)[source]
Return the group in the regex_groups indicated by group_index if it exists, else return empty string.
If group_index is out of range of the regex_groups an empty string is retruned.
- academic_tracker.helper_functions.regex_match_return(regex, string_to_match)[source]
Return the groups matched in the regex if the regex matches.
regex is delivered to re.match() with string_to_match, and if there is a match the match.groups() is returned, otherwise an empty tuple is returned.
- academic_tracker.helper_functions.regex_search_return(regex, string_to_search)[source]
Return the groups matched in the regex if the regex matches.
regex is delivered to re.search() with string_to_search, and if there is a match the match.groups() is returned, otherwise an empty tuple is returned.
- academic_tracker.helper_functions.vprint(*args, verbosity=0)[source]
Print depending on the state of VERBOSE, SILENT, and verbosity.
If the global SILENT is True don’t print anything. If verbosity is 0 then print. If verbosity is 1 then VERBOSE must be True to print.
- Parameters:
verbosity (int) – Either 0 or 1 for different levels of verbosity.
Webio
General functions that interface with the internet.
- academic_tracker.webio.clean_tags_from_url(url)[source]
Remove tags from webpage.
Remove tags from a webpage so it looks more like what a user would see in a browser.
- academic_tracker.webio.get_DOI_from_Crossref(title, mailto_email)[source]
Search title on Crossref and try to find a DOI for it.
- academic_tracker.webio.get_url_contents_as_str(url)[source]
Query the url and return it’s contents as a string.
- academic_tracker.webio.search_Google_Scholar_for_ids(authors_json)[source]
Query Google Scholar with author names and get Scholar IDs.
If an author already has a scholar_id, or doesn’t have affiliations they are skipped.
- academic_tracker.webio.search_ORCID_for_ids(ORCID_key, ORCID_secret, authors_json)[source]
Query ORCID with author names and get ORCID IDs.
If an author already has an ORCID, or doesn’t have affiliations they are skipped.
- Parameters:
- Returns:
the authors_json modified with any ORCID IDs found.
- Return type:
authors_json (dict)
- academic_tracker.webio.send_emails(email_messages)[source]
Uses sendmail to send email_messages to authors.
Only works on systems with sendmail installed.
- Parameters:
email_messages (dict) – keys are author names and values are the message
Emails and Reports Helpers
Functions to create emails and reports that are in common for both author_search and ref_search.