API
This package contains the following modules:
extract
validate
convert
extract
Extract data from Excel workbooks, csv files, and JSON files.
- Usage:
messes extract <metadata_source>… [–delete <metadata_section>…] [options] messes extract –help
- <metadata_source> - tagged input metadata source as csv/json filename or
xlsx_filename[:worksheet_name|regular_expression] or google_sheets_url[:worksheet_name|regular_expression] “#export” worksheet name is the default.
- Options:
- -h, --help
show this help documentation.
- -v, --version
show the version.
- --silent
print no warning messages.
- --output <filename_json>
output json filename.
- --compare <filename_json>
compare extracted metadata to given JSONized metadata.
- --modify <source>
modification directives worksheet name, regular expression, csv/json filename, or
xlsx_filename:[worksheet_name|regular_expression] or google_sheets_url[:worksheet_name|regular_expression] [default: #modify].
- --end-modify <source>
apply modification directives after all metadata merging. Requires csv/json filename or
xlsx_filename:[worksheet_name|regular_expression] or google_sheets_url[:worksheet_name|regular_expression].
- --automate <source>
automation directives worksheet name, regular expression, csv/json filename, or
xlsx_filename:[worksheet_name|regular_expression] or google_sheets_url[:worksheet_name|regular_expression] [default: #automate].
- --save-directives <filename_json>
output filename with modification and automation directives in JSON format.
- --save-export <filetype>
output export worksheet with suffix “_export” and with the indicated xlsx/csv format extension.
- --show <show_option>
show a part of the metadata. See options below.
–delete <metadata_section>… - delete a section of the JSONized metadata. Section format is tableKey or tableKey,IDKey or tableKey,IDKey,fieldName. These can be regular expressions. –keep <metadata_tables> - only keep the selected tables. Delete the rest. Table format is tableKey,tableKey,… The tableKey can be a regular expression. –file-cleaning <remove_regex> - a string or regular expression to remove characters in input files, removes unicode and
characters by default, enter “None” to disable [default: _x([0-9a-fA-F]{4})_|
].
- Show Options:
tables - show tables in the extracted metadata. lineage - show parent-child lineages per table. all - show every option.
- Regular Expression Format:
Regular expressions have the form “r’…’” on the command line. The re.match function is used, which matches from the beginning of a string, meaning that a regular expression matches as if it starts with a “^”.
- Directives JSON Format:
- {
- “modification”{ table{ field{ “(exact|regex|levenshtein)-(first|first-nowarn|unique|all)” :
- { field_value{ “assign”{ fieldvalue,… }, “append”{ fieldvalue,… }, “prepend”{ fieldvalue,… },
“regex” : { field : regex_pair,… }, “delete” : [ field,… ], “rename” : { old_field : new_field } } } } } }
“automation” : [ { “header_tag_descriptions” : [ { “header” : column_description, “tag” : tag_description, “required” : true|false } ], “exclusion_test” : exclusion_value, “insert” : [ [ cell_content, … ] ] } ]
}
- class messes.extract.extract.ColumnOperand(value: str | int)[source]
Represents specific worksheet cells in a given column as operands.
Initializer
- class messes.extract.extract.Evaluator(evalString: str, useFieldTests: bool = True, listAsString: bool = False)[source]
Creates object that calls eval with a given record.
Initializer
- Parameters
- class messes.extract.extract.FieldMaker(field: str)[source]
Creates objects that convert specific information from a worksheet row into a field via concatenation of a list of operands.
Initializer
- Parameters
field (str) – name of a field in a record from TagParser.extraction.
- create(record: dict, row: Series) str [source]
Creates field-value and adds to record using row and record.
- shallowClone() FieldMaker [source]
Returns clone with shallow copy of operands.
- Returns
A copy of self, but with a shallow copy of operands.
- Return type
- class messes.extract.extract.ListFieldMaker(field: str)[source]
Creates objects that convert specific information from a worksheet row into a list field via appending of a list of operands.
Initializer
- Parameters
field (str) – name of a field in a record from TagParser.extraction.
- create(record: dict, row: Series) list [source]
Creates field-value and adds to record using PARAMETERS row and record.
- shallowClone() ListFieldMaker [source]
Returns clone with shallow copy of operands.
- Returns
A copy of self, but with a shallow copy of operands.
- Return type
- class messes.extract.extract.LiteralOperand(value: str | int)[source]
Represents string literal operands.
Initializer
- class messes.extract.extract.Operand(value: str | int)[source]
Class of objects that create string operands for concatenation operations.
Initializer
- class messes.extract.extract.RecordMaker[source]
Creates objects that convert worksheet rows into records for specific tables.
Initializer
- addColumnOperand(columnIndex: int)[source]
Add columnIndex as a column variable operand to the last FieldMaker.
- Parameters
columnIndex (int) – column number to add.
- addField(table: str, field: str, fieldMakerClass: messes.extract.extract.FieldMaker | messes.extract.extract.ListFieldMaker = <class 'messes.extract.extract.FieldMaker'>)[source]
Creates and adds new FieldMaker object.
- Parameters
table (str) – table name to add.
field (str) – field name to add.
fieldMakerClass (messes.extract.extract.FieldMaker | messes.extract.extract.ListFieldMaker) – which type of FieldMaker to add to self.fieldMakers.
- addGlobalField(table: str, field: str, literal: str, fieldMakerClass: messes.extract.extract.FieldMaker | messes.extract.extract.ListFieldMaker = <class 'messes.extract.extract.FieldMaker'>)[source]
Creates and adds new FieldMaker with literal operand that will be used as global fields for all records created from a row.
- Parameters
table (str) – table name to add.
field (str) – field name to add.
literal (str) – value of the field to be added.
fieldMakerClass (messes.extract.extract.FieldMaker | messes.extract.extract.ListFieldMaker) – which type of FieldMaker to add to self.fieldMakers.
- addLiteralOperand(literal: str)[source]
Add literal as an operand to the last FieldMaker.
- Parameters
literal (str) – value to append.
- addVariableOperand(table: str, field: str)[source]
Add field as a variable operand to the last FieldMaker.
- static child(example: RecordMaker, table: str, parentIDIndex: int) RecordMaker [source]
Returns child object derived from a example object.
- Parameters
example (RecordMaker) – RecordMaker with global literal fields.
table (str) – table where the child record will go.
parentIDIndex (int) – column index for parentID of the child record.
- Returns
RecordMaker to make a new child record.
- Return type
- field(table: str, field: str) messes.extract.extract.FieldMaker | messes.extract.extract.ListFieldMaker | None [source]
Returns FieldMaker for table.field.
- Parameters
- Returns
The FieldMaker for the table.field.
- Return type
messes.extract.extract.FieldMaker | messes.extract.extract.ListFieldMaker | None
- hasField(table: str, field: str, offset: int = 0) bool [source]
Returns whether a given table.field exists in the current RecordMaker.
- hasShortField(field: str, offset: int = 0) bool [source]
Returns whether a given field exists in the current RecordMaker.
- hasValidID() bool [source]
Returns whether there is a valid id field.
- Returns
True if there is a valid id field, False otherwise.
- Return type
- isInvalidDuplicateField(table: str, field: str, fieldMakerClass: messes.extract.extract.FieldMaker | messes.extract.extract.ListFieldMaker) bool [source]
Returns whether a given table.field is an invalid duplicate in the current RecordMaker.
- Parameters
table (str) – table name to look for.
field (str) – field name to look for.
fieldMakerClass (messes.extract.extract.FieldMaker | messes.extract.extract.ListFieldMaker) – uses this type to do the correct checks.
- Returns
True if table.field is an invalid duplicate, False otherwise.
- Return type
- isLastField(table: str, field: str) bool [source]
Returns whether the last FieldMaker is for table.field.
- properField(table: str, field: str) str [source]
Returns proper field name based on given table and field and internal self.table.
- shortField(field: str) messes.extract.extract.FieldMaker | messes.extract.extract.ListFieldMaker | None [source]
Returns FieldMaker for field.
- Parameters
field (str) – field name to look for FieldMaker.
- Returns
The FieldMaker for the field.
- Return type
messes.extract.extract.FieldMaker | messes.extract.extract.ListFieldMaker | None
- class messes.extract.extract.TagParser[source]
Creates parser objects that convert tagged .xlsx worksheets into nested dictionary structures for metadata capture.
- compare(otherMetadata: dict, groupSize: int = 5, file: TextIO | None = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>) bool [source]
Compare current metadata to other metadata.
- Parameters
- Returns
True if otherMetadata and self.extraction are different, False otherwise.
- Return type
- deleteMetadata(sections: list[list[str]])[source]
Delete sections of metadata based on given section descriptions.
- findParent(parentID: str) tuple[str, dict] | None [source]
Returns parent record for given parentID.
- generateLineages() dict [source]
Generates and returns parent-child record lineages.
- Returns
lineages by tableKey.
- Return type
- static isComparable(value1: str, value2: str) bool [source]
Compares the two values first as strings and then as floats if convertable.
- static isGoogleSheetsFile(string: str) bool [source]
Tests whether the string is a Google Sheets URL.
- static loadSheet(fileName: str | TextIO, sheetName: str, removeRegex: str | None = None, isDefaultSearch: bool = False) tuple[str, str, pandas.core.frame.DataFrame] | None [source]
Load and return worksheet as a pandas data frame.
- Parameters
fileName (str | TextIO) – filename or sys.stdin to read a csv from stdin.
sheeName – sheet name for an Excel file, ignored if not an Excel file. Can be a regular expression to search for a sheet.
removeRegex (str | None) – a string to pass to DataFrame.replace() to replace characters with an empty string in the dataframe that is read in. Can be a regex. Set to None to not replace anything.
isDefaultSearch (bool) – whether or not the sheetName is using default values, determines whether to print some messages.
sheetName (str) –
- Returns
None if the worksheet is empty, else (fileName, sheetName, dataFrame)
- Raises
Exception – If fileName is invalid.
- Return type
- merge(newMetadata: dict)[source]
Merges new metadata with current metadata.
- Parameters
newMetadata (dict) – dict to merge with self.extraction dict.
- modify(modificationDirectives: dict)[source]
Applies modificationDirectives to the extracted metadata.
- Parameters
modificationDirectives (dict) – contains the modifications to apply.
- parseSheet(fileName: str, sheetName: str, worksheet: DataFrame)[source]
Extracts useful metadata from the worksheet and puts it in the extraction dictionary.
- static printLineages(lineages: ~collections.defaultdict, indentation: int, groupSize: int = 5, file: ~typing.TextIO = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]
Prints the given lineages.
- Parameters
lineages (defaultdict) – dictionary where the keys are table names and values are a dictionary of parentID and children.
indentation (int) – number of spaces of indentation to print.
groupSize (int) – number of childIDs to print per line.
file (TextIO) –
- readDirectives(source: str, sheetName: str, directiveType: str, removeRegex: str | None, isDefaultSearch: bool = False) dict [source]
Read directives source of a given directive type.
- Parameters
source (str) – file path.
sheetName (str) – sheet name for an Excel file, ignored if not an Excel file.
directiveType (str) – either “modification” or “automation” to call the correct parsing function.
removeRegex (str | None) – a string to pass to DataFrame.replace() to replace characters with an empty string in the dataframe that is read in. Can be a regex. Set to None to not replace anything. Passed to loadSheet.
isDefaultSearch (bool) – whether or not the source is using default values, passed to loadSheet for message printing.
- Returns
The directives that were read in.
- Return type
- readMetadata(metadataSource: str, automationSource: str, automateDefaulted: bool, modificationSource: str, modifyDefaulted: bool, removeRegex: str | None, saveExtension: str = None)[source]
Reads metadata from source.
- Parameters
metadataSource (str) – file path to metadata file with possibly a sheetname if appropriate.
automationSource (str) – file path to automation file or a sheetname.
automateDefaulted (bool) – whether the automation source is the default value or not, passed to readDirectives for message printing.
modificationSource (str) – file path to modification file or a sheetname.
modificationDefaulted – whether the modification source is the default value or not, passed to readDirectives for message printing.
removeRegex (str | None) – a string to pass to DataFrame.replace() to replace characters with an empty string in the dataframe that is read in. Can be a regex. Set to None to not replace anything. Passed to loadSheet and readDirectives.
saveExtension (str) – if “csv” saves the export as a csv file, else saves it as an Excel file.
modifyDefaulted (bool) –
- saveSheet(fileName: str, sheetName: str, worksheet: DataFrame, saveExtension: str)[source]
Save given worksheet in the given format.
- exception messes.extract.extract.TagParserError(message: str, fileName: str, sheetName: str, rowIndex: int, columnIndex: int, endMessage: str = '')[source]
Exception class for errors thrown by TagParser.
- Parameters
message (str) – start of the message for the exception.
fileName (str) – the file name where the exception happened.
sheetName (str) – the sheet name in the Excel file where the exception happened.
rowIndex (int) – the row index in the tabular file where the exception happened.
columnIndex (int) – the column index in the tabular file where the exception happened.
endMessage (str) – the optional end of the message for the exception.
- static columnName(columnIndex: int) str [source]
Returns Excel-style column name for columnIndex (integer).
- Parameters
columnIndex (int) – index of the column in the spreadsheet.
- Returns
“, else return the capital letter(s) of the Excel colummn, Ex: columnIndex = 3 returns “D”
- Return type
If columnIndex is less than 0 return “
validate
Validate JSON files.
- Usage:
- messes validate json <input_JSON> [–pds=<pds> [–csv | –xlsx | –json | –gs] | –no_base_schema]
[–no_extra_checks] [–additional=<add_schema>…] [–format=<format>] [–silent=<level>]
- messes validate save-schema <output_schema> [–input=<input_JSON>]
[–pds=<pds> [–csv | –xlsx | –json | –gs]] [–format=<format>] [–silent=<level>]
messes validate schema <input_schema> messes validate pds <pds> [–csv | –xlsx | –json | –gs] [–silent=<level>] [–save=<output_name>] messes validate pds-to-table <pds_json> <output_name> [<output_filetype>] messes validate pds-to-json <pds_tabular> [–csv | –xlsx | –gs] <output_name> messes validate cd-to-json-schema <conversion_directives> [–csv | –xlsx | –json | –gs] <output_schema> messes validate –help
<input_JSON> - if ‘-’ read from standard input. <pds> - can be a JSON, csv, xlsx, or Google Sheets file. If xlsx or Google Sheets, the default sheet name to read in is #validate,
to specify a different sheet name separate it from the file name with a colon ex: file_name.xlsx:sheet_name. If ‘-’ read from standard input.
<input_schema> - must be a valid JSON Schema file. If ‘-’ read from standard input. <output_schema> - if ‘-’ save to standard output. <output_name> - path to save tabular pds to, if ‘-’ save to standard output as CSV. <output_filetype> - “xlsx” or “csv”, defaults to “csv”. <conversion_directives> - can be a JSON, csv, xlsx, or Google Sheets file. If xlsx or Google Sheets,
the default sheet name to read in is #convert, to specify a different sheet name separate it from the file name with a colon ex: file_name.xlsx:sheet_name. If ‘-’ read from standard input.
- Options:
- -h, --help
show this screen.
- -v, --version
show version.
- --silent <level>
if “full” silence all warnings,
if “nuisance” silence warnings that are more likely to be a nuisance, if “none” do not silence warnings [default: none].
- --pds <pds>
a protocol-dependent schema file, can be a JSON, csv, or xlsx file.
If xlsx the default sheet name to read in is #validate, to specify a different sheet name separate it from the file name with a colon ex: file_name.xlsx:sheet_name.
- --csv
indicates that the protocol-dependent schema file is a csv (comma delimited) file.
- --xlsx
indicates that the protocol-dependent schema file is an xlsx (Excel) file.
- --json
indicates that the protocol-dependent schema file is a JSON file.
- --gs
indicates that the protocol-dependent schema file is a Google Sheets file.
If a file type is not given then it will be guessed from the file extension.
- --additional <add_schema>
an additional JSON Schema file that will be used to validate <input_JSON>.
- --format <format>
additional validation done for the desired supported format.
- Current supported formats:
mwtab
- --no_base_schema
don’t validate with the base JSON schema.
- --no_extra_checks
only do JSON Schema validation and nothing else.
- --input <input_JSON>
optionally give an input JSON file to save-schema to reproduce the
schema used to validate in the json command.
- --save <output_name>
save the JSON Schema created from the protocol-dependent schema.
The “json” command will validate the <input_JSON> against the internal base_schema, and optional schema provided by the –pds and –additional options. To validate only against a provided schema, use the –additional and –no_base_schema options.
The “save-schema” command will save the internal base_schema to the <output_schema> location. If –pds is given then it will be parsed and placed into the base_schema. If –input is given, the protocols table will be added in with the PDS to reproduce what happens in the json command. If –format is used, then that format schema is saved instead of the base_schema.
The “schema” command will validate the <input_schema> against the JSON Schema meta schema.
The “pds” command will validate that the <pds> file is a valid protocol-dependent schema file. If the –save option is given, then save the built JSON Schema.
The “pds-to-table” command will read in a protocol-dependent schema in JSON form and save it out in a tabular form.
The “pds-to-json” command will read in a protocol-dependent schema in tabular form and save it out in a JSON form.
The “cd-to-json-schema” command will read in conversion directives and create a JSON Schema template file that can be filled in and used to validate files that will be converted using those directives.
- messes.validate.validate.SS_protocol_check(input_json: dict[str, Any] | list | str | int | float | None) None [source]
Validates the subjects and samples protocols.
Loops over the entity table in input_json and makes sure that each sample/subject has protocols of the correct type depending on its inheritance. Samples that have a sample parent must have a sample_prep type protocol. Samples that have a subject parent must have a collection type protocol. Subjects must have a treatment type protocol.
- messes.validate.validate.add_protocols_to_PDS(protocol_table: dict, pds: dict[str, Any] | list | str | int | float | None, silent: str) dict[str, Any] | list | str | int | float | None [source]
Add the protocols from the table to the protocol-dependent schema.
- Parameters
- Returns
The updated protocol-dependent schema.
- Return type
- messes.validate.validate.build_PD_schema(pds: dict[str, Any] | list | str | int | float | None) dict[str, Any] | list | str | int | float | None [source]
Build a JSON schema from the protocol-dependent schema.
- messes.validate.validate.check(self, instance: object, format: str) None [source]
Check whether the instance conforms to the given format.
Modified from jsonschema.FormatChecker.check. Used to raise an error on the custom “integer”, “str_integer”, “numeric”, and “str_numeric” formats so their values can be cast to int and float appropriately.
- Parameters
- Raises
FormatError – if the instance does not conform to
format
if the instance does conform to “integer”, “str_integer”, “numeric”, and “str_numeric” formats if the instance is not a string and the format is “str_integer” or “str_numeric”- Return type
None
- messes.validate.validate.convert_formats(validator: Validator, instance: dict | str | list) dict | str | list [source]
Convert “integer” and “numeric” formats to int and float.
Special function to iterate over JSON schema errors and if the custom “integer”, “str_integer”, “numeric”, and “str_numeric” formats are found, converts that value in the instance to the appropriate type. If the value is not a string and the format is “str_integer” or “str_numeric”, prints an error.
- messes.validate.validate.create_validator(schema: dict[str, Any] | list | str | int | float | None) Validator [source]
Create a validator for the given schema.
- Parameters
schema (dict[str, Any] | list | str | int | float | None) – the JSON schema to create a validator for.
- Returns
A jsonschema.protocols.Validator to validate the schema with an added format checker that is aware of the custom formats “integer”, “str_integer”, “numeric”, and “str_numeric”.
- Return type
Validator
- messes.validate.validate.factors_checks(input_json: dict[str, Any] | list | str | int | float | None, silent: str) None [source]
Validates some logic about the factors.
Checks that every factor in the factor table is used at least once by an entity. Whether or not values in the factor field are allowed values. If there are more than 1 allowed values in the factor field. That factor fields are str or list types.
- messes.validate.validate.id_check(JSON_file: dict[str, Any] | list | str | int | float | None) None [source]
Validate id fields for records in JSON_file.
Loops over JSON_file and makes sure each field with a period in the name is an id, that each id points to an existing id in another table that exists in JSON_file, that each “parent_id” field points to another record that exists in the same table, and that each “id” field has a value that is the same as the name of the record.
There is a special check for the “entity” table that checks that subject types have a sample type parent.
- messes.validate.validate.indexes_of_duplicates_in_list(list_of_interest: list, value_to_find: Any) list[int] [source]
Returns a list of all of the indexes in list_of_interest where the value equals value_to_find.
- messes.validate.validate.iterate_string_or_list(str_or_list: str | list) list [source]
If str_or_list is a string then make it into a list and return the items for looping.
If str_or_list is a list then return it as is.
- messes.validate.validate.measurement_protocol_check(input_json: dict[str, Any] | list | str | int | float | None) None [source]
Loops over the measurement table in input_json and makes sure that each measurement has at least one measurement type protocol.
- messes.validate.validate.mwtab_checks(input_json: dict) None [source]
Check that the input_json is ready for mwtab conversion.
- Run checks that cannot be done by JSON Schema. They are the following:
Check that at least 1 protocol has the “machine_type” field. Check that the first collection protocol has a “sample_type” field. Check that there are at least 1 collection type, treatment type, and sample_prep type protocols. Check that the first subject has the “species”, “species_type”, and “taxonomy_id” fields.
- Parameters
input_json (dict) – the JSON to perform the checks on.
- Return type
None
- messes.validate.validate.print_better_error_messages(errors_generator: Iterable[ValidationError]) bool [source]
Print better error messages for jsonschema validation errors.
- messes.validate.validate.protocol_all_used_check(input_json: dict[str, Any] | list | str | int | float | None, tables_with_protocols: list[str]) None [source]
Validates that all protocols in the protocol table are used at least once.
Compiles a list of all of the protocols used by the records in tables_with_protocols and checks that every protocol in the protocol table is in that list. For any protocols that appear in the protocol table, but are not used by any records a warning is printed.
- messes.validate.validate.protocol_description_check(input_json: dict[str, Any] | list | str | int | float | None) None [source]
Checks that every description field for the protocols in the protocol table of the metadata are unique.
- messes.validate.validate.read_and_validate_PDS(filepath: str, is_csv: bool, is_xlsx: bool, is_json: bool, is_gs: bool, no_last_message: bool, silent: str) dict[str, Any] | list | str | int | float | None [source]
Read in the protocol-dependent schema from filepath and validate it.
- Parameters
filepath (str) – the path to the protocol-dependent schema or “-” meaning to read from stdin.
is_csv (bool) – whether the protocol-dependent schema is a csv file, used for reading from stdin.
is_xlsx (bool) – whether the protocol-dependent schema is a xlsx file.
is_json (bool) – whether the protocol-dependent schema is a json file, used for reading from stdin.
is_json – whether the protocol-dependent schema is a Google Sheets file.
no_last_message (bool) – if True do not print a message about the protocol-dependent schema being invalid and execution stopping.
silent (str) – if “full” do not print any warnings, if “nuisance” do not print nuisance warnings.
is_gs (bool) –
- Returns
The protocol-dependent schema.
- Raises
SystemExit – Will raise errors if filepath does not exist or there is a read in error.
- Return type
- messes.validate.validate.read_in_JSON_file(filepath: str, description: str) dict[str, Any] | list | str | int | float | None [source]
Read in a JSON file from filepath.
- Parameters
- Returns
The JSON file.
- Raises
SystemExit – Will raise errors if filepath does not exist or there is a read in error.
- Return type
- messes.validate.validate.read_json_or_tabular_file(filepath: str, is_csv: bool, is_xlsx: bool, is_json: bool, is_gs: bool, file_title: str, default_sheet_name: str, silent: str) dict[str, Any] | list | str | int | float | None [source]
Read in a file from filepath.
- Parameters
filepath (str) – the path to the file or “-” meaning to read from stdin.
is_csv (bool) – whether the file is a csv file, used for reading from stdin.
is_xlsx (bool) – whether the file is a xlsx file.
is_json (bool) – whether the file is a json file, used for reading from stdin.
is_json – whether the file is a Google Sheets file.
file_title (str) – a string to use for printing error messages about the file.
default_sheet_name (str) – sheet name to default to for Excel and Google Sheets files.
silent (str) – if “full” do not print any warnings, if “nuisance” do not print nuisance warnings.
is_gs (bool) –
- Returns
The file contents.
- Raises
SystemExit – Will raise errors if filepath does not exist or there is a read in error.
- Return type
- messes.validate.validate.run_conversion_directives_to_json_schema_command(conversion_directives_source: str, is_csv: bool, is_xlsx: bool, is_json: bool, is_gs: bool, output_schema_path: str, silent: str) None [source]
Run the cd-to-json command.
- Parameters
conversion_directives_source (str) – either a filepath or “-” to read from stdin.
is_csv (bool) – if True the conversion_directives_source is a csv file.
is_xlsx (bool) – if True the conversion_directives_source is an xlsx file.
is_json (bool) – if True the conversion_directives_source is a JSON file.
is_gs (bool) – if True the conversion_directives_source is a Google Sheets file.
silent (str) – if “full” do not print any warnings, if “nuisance” do not print nuisance warnings.
output_schema_path (str) –
- Return type
None
- messes.validate.validate.run_json_command(input_json_source: str, pds_source: str | None, additional_schema_sources: list[str], no_base_schema: bool = False, no_extra_checks: bool = False, is_csv: bool = False, is_xlsx: bool = False, is_json: bool = False, is_gs: bool = False, silent: str = 'none', format_check: str | None = None) None [source]
Run the json command.
- Parameters
input_json_source (str) – either a filepath or “-” to read from stdin.
pds_source (str | None) – either a filepath or “-” to read from stdin, if not None.
additional_schema_sources (list[str]) – either a filepath or “-” to read from stdin, if not None.
no_base_schema (bool) – if True do not validate with the base_schema, ignored if pds_source is given.
no_extra_checks (bool) – if True only do JSON Schema validations.
is_csv (bool) – if True the pds_source is a csv file.
is_xlsx (bool) – if True the pds_source is an xlsx file.
is_json (bool) – if True the pds_source is a JSON file.
is_gs (bool) – if True the pds_source is a Google Sheets file.
silent (str) – if “full” do not print any warnings, if “nuisance” do not print nuisance warnings.
format_check (str | None) –
- Return type
None
- messes.validate.validate.run_pds_command(pds_source: str, output_path: str | None = None, is_csv: bool = False, is_xlsx: bool = False, is_json: bool = False, is_gs: bool = False, silent: str = 'none') None [source]
Run the pds command.
- Parameters
pds_source (str) – either a filepath or “-” to read from stdin.
output_path (str | None) – if given then save the JSON Schema from the PDS.
is_csv (bool) – if True the pds_source is a csv file.
is_xlsx (bool) – if True the pds_source is an xlsx file.
is_json (bool) – if True the pds_source is a JSON file.
is_gs (bool) – if True the pds_source is a Google Sheets file.
silent (str) – if “full” do not print any warnings, if “nuisance” do not print nuisance warnings.
- Return type
None
- messes.validate.validate.run_pds_to_json_command(pds_source: str, is_csv: bool, is_xlsx: bool, is_gs: bool, output_path: str, silent: str = 'none') None [source]
Run the pds-to-json command.
- Parameters
pds_source (str) – either a filepath or “-” to read from stdin.
is_csv (bool) – if True the pds_source is a csv file.
is_xlsx (bool) – if True the pds_source is an xlsx file.
is_gs (bool) – if True the pds_source is a Google Sheets file.
output_path (str) – either a filepath or “-” to write to stdout.
silent (str) – if “full” do not print any warnings, if “nuisance” do not print nuisance warnings.
- Return type
None
- messes.validate.validate.run_pds_to_table_command(pds_source: str, output_path: str, output_filetype: str, silent: str = 'none') None [source]
Run the pds-to-table command.
- Parameters
- Return type
None
- messes.validate.validate.run_save_schema_command(pds_source: str | None, output_schema_path: str, input_json_path: str | None, is_csv: bool = False, is_xlsx: bool = False, is_json: bool = False, is_gs: bool = False, silent: str = 'none', format_check: str | None = None) None [source]
Run the save-schema command.
- Parameters
pds_source (str | None) – either a filepath or “-” to read from stdin, if not None.
output_schema_path (str) – the path to save the output JSON to.
input_json_path (str | None) – either a filepath or “-” to read from stdin, if not None.
is_csv (bool) – if True the pds_source is a csv file.
is_xlsx (bool) – if True the pds_source is an xlsx file.
is_json (bool) – if True the pds_source is a JSON file.
is_gs (bool) – if True the pds_source is a Google Sheets file.
silent (str) – if “full” do not print any warnings, if “nuisance” do not print nuisance warnings.
format_check (str | None) –
- Return type
None
- messes.validate.validate.run_schema_command(input_schema_source: str) None [source]
Run the schema command.
- Parameters
input_schema_source (str) – the path to the JSON Schema file to read and validate.
- Return type
None
- messes.validate.validate.save_out_JSON_file(filepath: str, json_to_save: dict) None [source]
Handle renaming and directing JSON to the correct output.
- messes.validate.validate.validate_JSON_schema(user_json_schema: dict[str, Any] | list | str | int | float | None) bool [source]
Validate an arbitrary JSON schema.
- messes.validate.validate.validate_PDS_parent_protocols(pds: dict[str, Any] | list | str | int | float | None, silent: str) bool [source]
Validate the parent_protocols table of the protocol-dependent schema.
- Parameters
- Returns
True if there were errors (warnings don’t count), False otherwise.
- Return type
- messes.validate.validate.validate_parent_id(table: dict, table_name: str, entity_name: str, check_type: bool, type_keyword: str = 'type') bool [source]
Validate the “parent_id” fields for the table.
- Parameters
table (dict) – the table to validate {record_name:{attribute1:value1, …}, …}.
table_name (str) – the name of the table, used for printing better error messages.
entity_name (str) – name of the entities of the table, used for printing better error messages.
check_type (bool) – if True check that the type of the parent is the same as the child.
type_keyword (str) – the keyword to use to check the types of the parent and child.
- Returns
True if there were errors, False otherwise.
- Return type
convert
Convert JSON data to another JSON format.
- Usage:
messes convert mwtab (ms | nmr | nmr_binned) <input_JSON> <output_name> [–update <conversion_directives> | –override <conversion_directives>] [–silent] messes convert save-directives mwtab (ms | nmr | nmr_binned) <output_filetype> [<output_name>] messes convert generic <input_JSON> <output_name> <conversion_directives> [–silent] messes convert –help
- <conversion_directives> - can be a JSON, csv, xlsx, or Google Sheets file. If xlsx or Google Sheets the default sheet name to read in is #convert,
to specify a different sheet name separate it from the file name with a colon ex: file_name.xlsx:sheet_name.
<output_filetype> - “json”, “xlsx”, or “csv”
- Options:
- -h, --help
show this screen.
- -v, --version
show version.
- --silent
silence all warnings.
- --update <conversion_directives>
conversion directives that will be used to update the built-in directives for the format.
This is intended to be used for simple changes such as updating the value of the analysis ID. You only have to specify what needs to change, any values that are left out of the update directives won’t be changed. If you need to remove directives then use the override option.
- --override <conversion_directives>
conversion directives that will be used to override the built-in directives for the format.
The built-in directives will not be used and these will be used instead.
The general command structure for convert is convert <format> which will convert an input JSON file over to the supported format. The outputs of these commands will save both the JSON conversion and the final format file.
The generic command is the same as the supported formats except the user is required to input conversion directives specifying how to convert the input JSON to the desired output JSON. Only an output JSON is saved.
The save-directives command is used to print the default conversion directives used by convert for any of the supported formats. <output-filetype> can be one of “json”, “xlsx”, or “csv”. The file is saved as “format_conversion_directives.ext” where “.ext” is replaced with “.json”, “.xlsx”, or “.csv” depending on the value of <output-format>, unless <output_name> is given.
- messes.convert.convert.compute_matrix_value(input_json: dict, conversion_table: str, conversion_record_name: str, conversion_attributes: dict, required: bool, silent: bool = False) list[dict] | None [source]
Determine the matrix value for the conversion directive.
- Parameters
input_json (dict) – the data to build the matrix from.
conversion_table (str) – the name of the table the conversion record came from, used for good error messaging.
conversion_record_name (str) – the name of the conversion record, used for good error messaging.
conversion_attributes (dict) – the fields and values of the conversion record.
required (bool) – if True then any problems during execution are errors and the program should exit, else it’s just a warning.
silent (bool) – if True don’t print warning messages.
- Returns
the list of dicts for the directive or None if there was a problem and the directive is not required.
- Return type
- messes.convert.convert.compute_string_value(input_json: dict, conversion_table: str, conversion_record_name: str, conversion_attributes: dict, required: bool, silent: bool = False) str | None [source]
Determine the string value for the conversion directive.
- Parameters
input_json (dict) – the data to build the value from.
conversion_table (str) – the name of the table the conversion record came from, used for good error messaging.
conversion_record_name (str) – the name of the conversion record, used for good error messaging.
conversion_attributes (dict) – the fields and values of the conversion record.
required (bool) – if True then any problems during execution are errors and the program should exit, else it’s just a warning.
silent (bool) – if True don’t print warning messages.
- Returns
the str value for the directive or None if there was a problem and the directive is not required.
- Return type
str | None
- messes.convert.convert.directives_to_table(conversion_directives: dict) DataFrame [source]
Convert conversion directives to a tagged table form.
- Parameters
conversion_directives (dict) – the conversion directives to transform.
- Returns
a pandas DataFrame that can be saved to csv or xlsx.
- Return type
DataFrame
- messes.convert.convert.handle_code_field(input_json: dict, conversion_table: str, conversion_record_name: str, conversion_attributes: dict, required: bool, silent: bool = False) Any [source]
If conversion_attributes has code and/or import fields then import and run the code appropriately.
- Parameters
input_json (dict) – dict that the code is likely to operate on.
conversion_table (str) – the name of the table the conversion record came from, used for good error messaging.
conversion_record_name (str) – the name of the conversion record, used for good error messaging.
conversion_attributes (dict) – the fields and values of the conversion record.
required (bool) – if True then any problems during execution are errors and the program should exit, else it’s just a warning.
silent (bool) – if True don’t print warning messages.
- Returns
the result of eval() or None if there was no “code” field in conversion_attributes.
- Return type
- messes.convert.convert.update(original_dict: dict, upgrade_dict: dict) dict [source]
Update a dictionary in a nested fashion.
Validates user input, erroring early and allowing the rest of the program to assume inputs are sanitized.
- messes.convert.user_input_checking.validate_conversion_directives(conversion_directives: dict, schema: dict)[source]
Validate conversion directives.
Wraps around jsonschema.validate() to give more human readable errors for most validation errors.
Functions For mwtab Format
- messes.convert.mwtab_functions.create_sample_lineages(input_json: dict, entity_table_name: str = 'entity', parent_key: str = 'parent_id') dict [source]
Determine all the ancestors, parents, and siblings for each entity in the entity table.
The returned dictionary is of the form:
- {entity_id:{“ancestors”:[ancestor0, ancestor1, …],
“parents”:[parent0, parent1, …], “siblings”:[sibling0, sibling1, …]}
…
}
parents are the immediate ancestors an entity comes from. They are also included in the ancestors list.
- Parameters
- Returns
a dictionary where the keys are the entity ids and the values are a dictionary of it’s ancestors, parents, and siblings.
- Return type
- messes.convert.mwtab_functions.create_subject_sample_factors(input_json: dict, measurement_table_name: str = 'measurement', sibling_match_field: str = 'protocol.id', sibling_match_value: str = 'protein_extraction', sample_id_key: str = 'entity.id', entity_table_name: str = 'entity', entity_type_key: str = 'type', subject_type_value: str = 'subject', parent_key: str = 'parent_id', factor_table_name: str = 'factor', factor_field_key: str = 'field', factor_allowed_values_key: str = 'allowed_values', protocol_table_name: str = 'protocol', protocol_field: str = 'protocol.id', protocol_type_field: str = 'type', measurement_type_value: str = 'measurement', data_files_key: str = 'data_files', data_files_attribute_key: str = 'data_files%entity_id', lineage_field_exclusion_list: list[str] | tuple[str] = ('study.id', 'project.id', 'parent_id')) list[dict] [source]
Create the SUBJECT_SAMPLE_FACTORS section of the mwTab JSON.
- Parameters
input_json (dict) – the data to build from.
measurement_table_name (str) – the name of the table in input_json where the measurements are.
sibling_match_field (str) – the field to use to determine if a sibling should be added to the SSF.
SSF. (sibling_match_value the value to use to determine if a sibling should be added to the) –
sample_id_key (str) – the field in the measurement that has the sample id associated with it.
entity_table_name (str) – the name of the table in input_json where the entities are.
entity_type_key (str) – the field in entity records where the type is located.
subject_type_value (str) – the value in the type key that means the entity is a subject.
parent_key (str) – the field that points to the parent of the record.
factor_table_name (str) – the name of the table in input_json where the factors are.
factor_field_key (str) – the field in factor records that tells what the factor field is in other records.
factor_allowed_values_key (str) – the field in factor records where the allowed values for that factor are.
protocol_table_name (str) – the name of the table in input_json where the protocols are.
protocol_field (str) – the field in records that contains the protocol(s) of the record.
protocol_type_field (str) – the field in protocol records where the type is located.
measurement_type_value (str) – the value in the type key that means the protocol is a measurement type.
data_files_key (str) – the field in a measurement type protocol record where the file names are located.
data_files_attribute_key (str) – the field in a measurement type protocol record where the corresponding entity_id to raw file names are located.
lineage_field_exclusion_list (list[str] | tuple[str]) – the fields in entity records that should not be added as additional data.
sibling_match_value (str) –
- Returns
a list of SUBJECT_SAMPLE_FACTORS.
- Return type