JSON Schema
===========
Academic Tracker uses and produces several JSON files in its execution. This
document describes their structure and gives examples. `JSON Schema `_ is used
here to describe the structure and used in the program to validate inputs.
Configuration JSON
~~~~~~~~~~~~~~~~~~
The configuration JSON is required for any command used in Academic Tracker. It
contains all the information necessary to run a command. Not every command will
use every section of the configuration JSON, and those sections are not required
for those commands.
Sections
--------
project_descriptions
++++++++++++++++++++
This section contains information about the funding projects your authors are
apart of. Some of the information in this section is used during search and some
is only used when constructing reports.
The grants, cutoff_year, and affiliatons attributes are all used during search.
grants is used to specifically look for the given grant strings to see if the
queried publication is associated with them. cutoff_year is used to filter out
publications that were published before the given year. affiliations is used
when matching the author under search with the queried author. Each of these can
be set for authors individually, but are presented here for convenience.
The authors attribute is used to specify which authors belong to the project. If
the attribute is not present then it assumes that all authors are associated with
the project. Entries for this attribute must match an entry in the Authors section
of the configuration JSON.
project_report is used to specify that creating a project report is desired and
how to construct and email it. If project_report is missing then no report is
created for that project. Additional attributes within the project_report attribute
specify how to construct the report and whether to email it.
Details about reporting are in the :doc:`reporting` section.
The from_email attribute within project_report is used to specify what email
address the email with the report attached should be sent from. If this attribute
is missing then no email will be sent. The to_email attribute is used to specify
which emails to send the report to. If it is missing, but from_email is not then
reports are made for each author associated with the project rather than one
report for the whole project. Similarly, cc_email is used to specify which emails
to cc the report to. email_body is used to specify what should be in the body of
the email, and email_subject is used to specify what should be in the subject of
the email.
collaborator_report is used to specify that creating a collaborator report is
desired and how to construct and email it. If collaborator_report is missing
then no report is created for the authors of that project. Additional attributes
within the collaborator_report attribute specify how to construct the report
and whether to email it.
Details about reporting are in the :doc:`reporting` section.
The from_email attribute within collaborator_report is used to specify what email
address the email with the report attached should be sent from. If this attribute
is missing then no email will be sent. The to_email attribute is used to specify
which emails to send the report to. If it is missing, but from_email is not then
the author's email attribute is used instead. Similarly, cc_email is used to
specify which emails to cc the report to. email_body is used to specify what should
be in the body of the email, and email_subject is used to specify what should be
in the subject of the email.
Commands Requiring This Section:
author_search
gen_reports_and_emails_auth
ORCID_search
++++++++++++
This section contains information necessary to use ORCID's API. The ORCID_key and
ORCID_secret correspond to a key and secret you can get from ORCID after `registering `_
for their public API.
Commands Requiring This Section:
author_search # Unless the --no_ORCID option is used
find_ORCID
PubMed_search
+++++++++++++
This section simply contains an email address that is necessary to use when using
PubMed's API. This allows PubMed to inform you if you are using their API in a
way they don't like, and allows you to change the behavior before they blacklist
you.
Commands Requiring This Section:
author_search
reference_search
Crossref_search
+++++++++++++++
Similar to PubMed_search this section simply contains an email address to use
with Crossref's API. It serves a similar function as PubMed's in that they can
contact you about unwanted behavior and also allows you into a better request
pool that has faster response times.
Commands Requiring This Section:
author_search # Unless the --no_Crossref and --no_GoogleScholar options are used
reference_search # Unless the --no_Crossref option is used
summary_report
++++++++++++++
summary_report is used to specify that creating a summary report is desired and
how to construct and email it. If summary_report is missing then no report is
created for that run. Additional attributes within the summary_report attribute
specify how to construct the report and whether to email it.
Details about reporting are in the :doc:`reporting` section.
The from_email attribute within summary_report is used to specify what email
address the email with the report attached should be sent from. If this attribute
is missing then no email will be sent. The to_email attribute is used to specify
which emails to send the report to. Similarly, cc_email is used to specify which emails
to cc the report to. email_body is used to specify what should be in the body of
the email, and email_subject is used to specify what should be in the subject of
the email.
Authors
+++++++
The Authors section is where all of the information about the authors you want
to search for goes. Every author in this section will be queried during author_search.
The first_name and last_name attributes are for the author's first and last names
respectively, and are used to validate that the author under search is the same
as the queried author. There is a special type of author known as collective authors.
These are not individuals, but are instead a collective and are published that way.
Use the collective_name attribute to indicate that an author is a collective. This
attribute takes priority, so if it is present the author will be treated as a collective
author even if they have first_name and last_name attributes.
pubmed_name_search is used as the query string when querying sources. This is so
the user can specify exactly what to query rather than simply querying the first
and last name.
email is used to send individual project reports and collaborator reports to
authors about their publications if the user chooses to do so.
ORCID is the ORCID ID of the author and is required to search an author's publications
in ORCID's database. If this is not present then the author will be skipped when
searching ORCID.
The grants, cutoff_year, affiliations, project_report, and collaborator_report
attributes from the project_description section can also be included individually
for an author. They are in the project_description section so it is easier to
specify these fields en masse, but it can be done on an individual level as well.
Commands Requiring This Section:
author_search
find_ORCID
find_GoogleScholar
add_authors
gen_reports_and_emails_auth
Validating Schema
-----------------
.. literalinclude:: ../src/academic_tracker/tracker_schema.py
:start-at: config_schema
:end-before: ## config_end
:language: none
Example
-------
.. code-block:: console
{
"project_descriptions" : {
"" : {
"grants" : [ "P42ES007380", "P42 ES007380" ],
"cutoff_year" : 2019, # optional
"affiliations" : [ "kentucky" ],
"project_report" : { # optional
"template": "", #optional
"to_email": [], #optional
"cc_email": [] #optional
"from_email": "", #optional
"email_body": "", #optional
"email_subject": "", #optional
},
"authors" : [], # optional
},...
},
"ORCID_search" : {
"ORCID_key": "",
"ORCID_secret": ""
},
"PubMed_search": {
"PubMed_email": ""
},
"Crossref_search": {
"mailto_email": ""
},
"summary_report" : { # optional
"template": "", #optional
"to_email": [], #optional
"cc_email": [] #optional
"from_email": "", #optional
"email_body": "", #optional
"email_subject": "", #optional
},
"Authors" : {
"Author 1": {
"first_name" : "",
"last_name" : "",
"pubmed_name_search" : "",
"email": "email@uky.edu", #optional
"ORCID": "" #optional
"affiliations" : ["", ""] #optional
},
"Author 2": {
"first_name" : "",
"last_name" : "",
"pubmed_name_search" : "",
"email": "email@uky.edu", #optional
"ORCID": "" #optional
"affiliations" : ["", ""] #optional
},
}
}
Publications JSON
~~~~~~~~~~~~~~~~~
The publications JSON is one of the outputs of the program. It is based on the
default JSON created by the pymed package from the PubMed XML. PubMed is the most
data rich source that is queried so publications from other sources have their
information conformed to this. Therefore publications from other sources will
have mostly empty fields.
The keys for each publication will either be a DOI web address, a PMID, or an
external URL to the publication. When deciding which type of key to use for a
publication when querying the preference is DOI, PMID, then URL. So if the DOI
is unavailable then the PMID is used, and if the DOI and PMID are unavailable the
URL is used.
Validating Schema
-----------------
.. code-block:: console
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "Publications JSON",
"description": "Input file that contains information about publications previously found by Academic Tracker.",
"type": "object",
"additionalProperties": {
"type": "object",
"properties": {
"abstract": {"type":["string", "null"]},
"authors": {"type":"array",
"minItems":1,
"items": {"type": "object",
"properties": {
"affiliation": {"type": ["string", "null"]},
"firstname": {"type": ["string", "null"]},
"initials": {"type": ["string", "null"]},
"lastname": {"type": ["string", "null"]},
"author_id" : {"type": "string"} # optional, only put in if author detected and validated
},
"required": ["affiliation", "firstname", "lastname", "initials"]
}
},
"conclusions": {"type": ["string", "null"]},
"copyrights": {"type": ["string", "null"]},
"doi": {"type": ["string", "null"]},
"journal": {"type": ["string", "null"]},
"keywords": {"type": ["array", "null"], "items":{"type": ["string", "null"]}},
"methods": {"type": ["string", "null"]},
"publication_date": {"type": "object",
"properties":{"year": {"type": ["integer", "null"]},
"month": {"type": ["integer", "null"]},
"day": {"type": ["integer", "null"]}},
"required":["year", "month", "day"]},
"pubmed_id": {"type": ["string", "null"]},
"results": {"type": ["string", "null"]},
"title": {"type": ["string", "null"]},
"grants": {"type": ["array", "null"], "items":{"type": ["string", "null"]}},
"PMCID": {"type": ["string", "null"]},
},
"required" : ["abstract", "authors", "conclusions", "copyrights", "doi", "journal", "keywords", "methods", "publication_date", "pubmed_id", "results", "title"]
}
}
Example
-------
.. code-block:: console
{
"": {
"abstract": "",
"authors": [
{
"affiliation": "",
"firstname": "",
"initials": "",
"lastname": "",
"author_id" : "" # optional, only put it if author detected and validated
},
],
"conclusions": "",
"copyrights": "",
"doi": "DOI string",
"journal": "",
"keywords": ["keyword 1", "keyword 2"],
"methods": "",
"publication_date": {"year":yyyy, "month":mm, "day":dd},
"pubmed_id": "",
"results": "",
"title": "",
"grants": ["grant1", "grant2"],
"PMCID": ""
},
}
Email JSON
~~~~~~~~~~
The email JSON is an output of the program. It is provided purely as a record
and is not used as input for any commands. Since it is not an input there is
no associated JSON schema to validate it. The top level has 2 keys "creation_date"
and "emails". creation_date is a simple timestamp for when the JSON was created.
emails is a list of emails broken into thier parts. Each part is a string.
Example
-------
.. code-block:: console
{
"creation_date" : "",
"emails" : [
{
"body" : "",
"cc" : "",
"from" : "",
"subject": "",
"to": "",
"author" : "" #only present if email is for a specific author from author_search
},
{
"body" : "",
"cc" : "",
"from" : "",
"subject": "",
"to": "",
"author" : "" #only present if email is for a specific author from author_search
},
]
}
Tokenized References JSON
~~~~~~~~~~~~~~~~~~~~~~~~~
The tokenized references JSON is an output of the program when working with references.
It can also be an input, so a schema is needed for validation. It is simply a list
of references where each reference is an object with attributes for its tokens
and other properties. It is largely an output for the purpose of troubleshooting.
The most important thing to understand about the information in this JSON is that
the information in it is Academic Tracker's best attempt at parsing and tokenizing
the references, so some information may be incorrect.
The "authors" property is a list of authors where each author is an object that
has attributes for thier first, middle, and last names as well as initials. Only
the last name is required though since common citation styles vary on how to name
authors.
The "title" property is what was tokenized as the title of the publication in the
reference line.
The "PMID" property is what was tokenized as the PMID of the publication in the
reference line. To pull a PMID out Academic Tracker looks for "pmid: " in
the tail end of the reference line. Where case is not sensitive.
The "DOI" property is what was tokenized as the DOI of the publication in the
reference line. To pull a DOI out Academic Tracker looks for "doi: " in
the tail end of the reference line. Where case is not sensitive.
The "reference_line" property is the raw string that was tokenized into the other
properties.
The "pub_dict_key" property is the key to the matching publication in the publication
JSON that was found during reference_search queries. This can be empty if there
was no matching publication found or if the tokenized reference JSON was generated
on its own.
Validating Schema
-----------------
.. code-block:: console
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "Tokenized Citations JSON",
"description": "Input file that contains the tokenized data of a reference file.",
"type": "array",
"items": {"type": "object",
"minItems":1,
"properties": {"authors": {"type": "array",
"items": {"type": "object",
"properties": {"last": {"type":["string", "null"]},
"initials": {"type":["string", "null"]},
"first": {"type":["string", "null"]},
"middle": {"type":["string", "null"]}},
"required": ["last"]}},
"title": {"type":["string", "null"]},
"PMID": {"type":["string", "null"]},
"DOI": {"type":["string", "null"]},
"reference_line": {"type":["string", "null"]},
"pub_dict_key": {"type":["string", "null"]}},
"required": ["authors", "title", "PMID", "DOI", "reference_line", "pub_dict_key"]}
}
Example
-------
.. code-block:: console
[
{
"DOI": "10.3390/metabo11030163",
"PMID": "",
"authors": [
{
"initials": "C",
"last": "Powell"
},
{
"initials": "H",
"last": "Moseley"
}
],
"pub_dict_key": "https://doi.org/10.3390/metabo11030163",
"reference_line": "Powell C, Moseley H. The mwtab Python Library for RESTful Access and Enhanced Quality Control, Deposition, and Curation of the Metabolomics Workbench Data Repository. Metabolites. 2021 March; 11(3):163-. doi: 10.3390/metabo11030163.",
"title": "The mwtab Python Library for RESTful Access and Enhanced Quality Control, Deposition, and Curation of the Metabolomics Workbench Data Repository."
},
{
"DOI": "10.3390/metabo10090368",
"PMID": "",
"authors": [
{
"initials": "H",
"last": "Jin"
},
{
"initials": "J",
"last": "Mitchell"
},
{
"initials": "H",
"last": "Moseley"
}
],
"pub_dict_key": "https://doi.org/10.3390/metabo10090368",
"reference_line": "Jin H, Mitchell J, Moseley H. Atom Identifiers Generated by a Neighborhood-Specific Graph Coloring Method Enable Compound Harmonization across Metabolic Databases. Metabolites. 2020 September; 10(9):368-. doi: 10.3390/metabo10090368.",
"title": "Atom Identifiers Generated by a Neighborhood-Specific Graph Coloring Method Enable Compound Harmonization across Metabolic Databases."
}
]
.. _jsonschema: https://json-schema.org/