API

This package contains the following modules:

extract validate convert

extract

Extract data from Excel workbooks, csv files, and JSON files.

Usage:

messes extract <metadata_source>… [–delete <metadata_section>…] [options] messes extract –help

<metadata_source> - tagged input metadata source as csv/json filename or

xlsx_filename[:worksheet_name|regular_expression] or google_sheets_url[:worksheet_name|regular_expression] “#export” worksheet name is the default.

Options:
-h, --help
  • show this help documentation.

-v, --version
  • show the version.

--silent
  • print no warning messages.

--output <filename_json>
  • output json filename.

--compare <filename_json>
  • compare extracted metadata to given JSONized metadata.

--modify <source>
  • modification directives worksheet name, regular expression, csv/json filename, or

xlsx_filename:[worksheet_name|regular_expression] or google_sheets_url[:worksheet_name|regular_expression] [default: #modify].

--end-modify <source>
  • apply modification directives after all metadata merging. Requires csv/json filename or

xlsx_filename:[worksheet_name|regular_expression] or google_sheets_url[:worksheet_name|regular_expression].

--automate <source>
  • automation directives worksheet name, regular expression, csv/json filename, or

xlsx_filename:[worksheet_name|regular_expression] or google_sheets_url[:worksheet_name|regular_expression] [default: #automate].

--save-directives <filename_json>
  • output filename with modification and automation directives in JSON format.

--save-export <filetype>
  • output export worksheet with suffix “_export” and with the indicated xlsx/csv format extension.

--show <show_option>
  • show a part of the metadata. See options below.

–delete <metadata_section>… - delete a section of the JSONized metadata. Section format is tableKey or tableKey,IDKey or tableKey,IDKey,fieldName. These can be regular expressions. –keep <metadata_tables> - only keep the selected tables. Delete the rest. Table format is tableKey,tableKey,… The tableKey can be a regular expression. –file-cleaning <remove_regex> - a string or regular expression to remove characters in input files, removes unicode and

characters by default, enter “None” to disable [default: _x([0-9a-fA-F]{4})_|

].

Show Options:

tables - show tables in the extracted metadata. lineage - show parent-child lineages per table. all - show every option.

Regular Expression Format:

Regular expressions have the form “r’…’” on the command line. The re.match function is used, which matches from the beginning of a string, meaning that a regular expression matches as if it starts with a “^”.

Directives JSON Format:
{
“modification”{ table{ field{ “(exact|regex|levenshtein)-(first|first-nowarn|unique|all)” :
{ field_value{ “assign”{ fieldvalue,… }, “append”{ fieldvalue,… }, “prepend”{ fieldvalue,… },

“regex” : { field : regex_pair,… }, “delete” : [ field,… ], “rename” : { old_field : new_field } } } } } }

“automation” : [ { “header_tag_descriptions” : [ { “header” : column_description, “tag” : tag_description, “required” : true|false } ], “exclusion_test” : exclusion_value, “insert” : [ [ cell_content, … ] ] } ]

}

class messes.extract.extract.ColumnOperand(value: str | int)[source]

Represents specific worksheet cells in a given column as operands.

Initializer

Parameters

value (str | int) – a string or int that represents the value of the operand.

class messes.extract.extract.Evaluator(evalString: str, useFieldTests: bool = True, listAsString: bool = False)[source]

Creates object that calls eval with a given record.

Initializer

Parameters
  • evalString (str) – string of the form eval(…) to deliver to eval(), “eval(” and “)” will be removed.

  • useFieldTests (bool) – whether to use field tests in field name modification.

  • listAsString (bool) – whether to convert a list into a single string.

evaluate(record: dict) str | list[source]

Return eval results for the given record.

Parameters

record (dict) – record from TagParser.extraction.

Returns

The results from eval() with the record’s contents.

Return type

str | list

hasRequiredFields(record: dict) bool[source]

Returns whether the record has all required fields.

Parameters

record (dict) – record from TagParser.extraction.

Returns

True if the record has all required fiels, False otherwise.

Return type

bool

static isEvalString(evalString: str) re.Match | None[source]

Tests whether the evalString is of the form r”^eval(…)$”

Parameters

evalString (str) – a string to determine whether or not it is of the eval variety.

Returns

An re.Match object if the evalString is indeed an eval string, or None if it is not.

Return type

re.Match | None

class messes.extract.extract.FieldMaker(field: str)[source]

Creates objects that convert specific information from a worksheet row into a field via concatenation of a list of operands.

Initializer

Parameters

field (str) – name of a field in a record from TagParser.extraction.

create(record: dict, row: Series) str[source]

Creates field-value and adds to record using row and record.

Parameters
  • record (dict) – record from TagParser.extraction.

  • row (Series) – pandas Series that is a row from metadata being parsed.

Returns

Value created by applying all operands in self.operands and written into record[self.field].

Return type

str

shallowClone() FieldMaker[source]

Returns clone with shallow copy of operands.

Returns

A copy of self, but with a shallow copy of operands.

Return type

FieldMaker

class messes.extract.extract.ListFieldMaker(field: str)[source]

Creates objects that convert specific information from a worksheet row into a list field via appending of a list of operands.

Initializer

Parameters

field (str) – name of a field in a record from TagParser.extraction.

create(record: dict, row: Series) list[source]

Creates field-value and adds to record using PARAMETERS row and record.

Parameters
  • record (dict) – record from TagParser.extraction.

  • row (Series) – pandas Series that is a row from metadata being parsed.

Returns

Value created by applying all operands in self.operands and written into record[self.field].

Return type

list

shallowClone() ListFieldMaker[source]

Returns clone with shallow copy of operands.

Returns

A copy of self, but with a shallow copy of operands.

Return type

ListFieldMaker

class messes.extract.extract.LiteralOperand(value: str | int)[source]

Represents string literal operands.

Initializer

Parameters

value (str | int) – a string or int that represents the value of the operand.

class messes.extract.extract.Operand(value: str | int)[source]

Class of objects that create string operands for concatenation operations.

Initializer

Parameters

value (str | int) – a string or int that represents the value of the operand.

class messes.extract.extract.RecordMaker[source]

Creates objects that convert worksheet rows into records for specific tables.

Initializer

addColumnOperand(columnIndex: int)[source]

Add columnIndex as a column variable operand to the last FieldMaker.

Parameters

columnIndex (int) – column number to add.

addField(table: str, field: str, fieldMakerClass: messes.extract.extract.FieldMaker | messes.extract.extract.ListFieldMaker = <class 'messes.extract.extract.FieldMaker'>)[source]

Creates and adds new FieldMaker object.

Parameters
addGlobalField(table: str, field: str, literal: str, fieldMakerClass: messes.extract.extract.FieldMaker | messes.extract.extract.ListFieldMaker = <class 'messes.extract.extract.FieldMaker'>)[source]

Creates and adds new FieldMaker with literal operand that will be used as global fields for all records created from a row.

Parameters
addLiteralOperand(literal: str)[source]

Add literal as an operand to the last FieldMaker.

Parameters

literal (str) – value to append.

addVariableOperand(table: str, field: str)[source]

Add field as a variable operand to the last FieldMaker.

Parameters
  • table (str) – table name to add.

  • field (str) – field name to add.

static child(example: RecordMaker, table: str, parentIDIndex: int) RecordMaker[source]

Returns child object derived from a example object.

Parameters
  • example (RecordMaker) – RecordMaker with global literal fields.

  • table (str) – table where the child record will go.

  • parentIDIndex (int) – column index for parentID of the child record.

Returns

RecordMaker to make a new child record.

Return type

RecordMaker

create(row: Series) tuple[str, dict][source]

Returns record created from given row.

Parameters

row (Series) – pandas Series that is a row from metadata being parsed.

Returns

The table string and created record in a tuple.

Return type

tuple[str, dict]

field(table: str, field: str) messes.extract.extract.FieldMaker | messes.extract.extract.ListFieldMaker | None[source]

Returns FieldMaker for table.field.

Parameters
  • table (str) – table name to look for FieldMaker.

  • field (str) – field name to look for FieldMaker.

Returns

The FieldMaker for the table.field.

Return type

messes.extract.extract.FieldMaker | messes.extract.extract.ListFieldMaker | None

hasField(table: str, field: str, offset: int = 0) bool[source]

Returns whether a given table.field exists in the current RecordMaker.

Parameters
  • table (str) – table name to look for.

  • field (str) – field name to look for.

  • offset (int) – offset from end to stop looking for #table.field.

Returns

True if table.field exists, False otherwise.

Return type

bool

hasShortField(field: str, offset: int = 0) bool[source]

Returns whether a given field exists in the current RecordMaker.

Parameters
  • field (str) – field name to look for.

  • offset (int) – offset from end to stop looking for #table.field.

Returns

True if field exists, False otherwise.

Return type

bool

hasValidID() bool[source]

Returns whether there is a valid id field.

Returns

True if there is a valid id field, False otherwise.

Return type

bool

isInvalidDuplicateField(table: str, field: str, fieldMakerClass: messes.extract.extract.FieldMaker | messes.extract.extract.ListFieldMaker) bool[source]

Returns whether a given table.field is an invalid duplicate in the current RecordMaker.

Parameters
Returns

True if table.field is an invalid duplicate, False otherwise.

Return type

bool

isLastField(table: str, field: str) bool[source]

Returns whether the last FieldMaker is for table.field.

Parameters
  • table (str) – table name to look for.

  • field (str) – field name to look for.

Returns

True if the last FieldMaker is for table.field, False otherwise.

Return type

bool

properField(table: str, field: str) str[source]

Returns proper field name based on given table and field and internal self.table.

Parameters
  • table (str) – table name to check against internal table name and build proper field name with.

  • field (str) – field name to build proper field name with.

Returns

“table.field” with the appropriate table.

Return type

str

shortField(field: str) messes.extract.extract.FieldMaker | messes.extract.extract.ListFieldMaker | None[source]

Returns FieldMaker for field.

Parameters

field (str) – field name to look for FieldMaker.

Returns

The FieldMaker for the field.

Return type

messes.extract.extract.FieldMaker | messes.extract.extract.ListFieldMaker | None

class messes.extract.extract.TagParser[source]

Creates parser objects that convert tagged .xlsx worksheets into nested dictionary structures for metadata capture.

compare(otherMetadata: dict, groupSize: int = 5, file: TextIO | None = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>) bool[source]

Compare current metadata to other metadata.

Parameters
  • otherMetadata (dict) – dict to compare with self.extraction.

  • groupSize (int) – number of record ids to print on a single line before printing more on a new line.

  • file (TextIO | None) – the IO to print messages to, if None then just return True or False instead of printing messages.

Returns

True if otherMetadata and self.extraction are different, False otherwise.

Return type

bool

deleteMetadata(sections: list[list[str]])[source]

Delete sections of metadata based on given section descriptions.

Parameters

sections (list[list[str]]) – list of sections that are lists of strings. The strings should be regular expressions.

findParent(parentID: str) tuple[str, dict] | None[source]

Returns parent record for given parentID.

Parameters

parentID (str) – the id to look for in the records of self.extraction.

Returns

None if the parentID was not found, (tableKey,parentRecord) if it was.

Return type

tuple[str, dict] | None

generateLineages() dict[source]

Generates and returns parent-child record lineages.

Returns

lineages by tableKey.

Return type

dict

static hasFileExtension(string: str) bool[source]

Tests whether the string has a file extension.

Parameters

string (str) – string to test.

Returns

True if .xls, .xlsx, .xlsm, .csv, or .json is in string, False otherwise.

Return type

bool

static isComparable(value1: str, value2: str) bool[source]

Compares the two values first as strings and then as floats if convertable.

Parameters
  • value1 (str) – first value to compare.

  • value2 (str) – second value to compare.

Return type

bool

static isGoogleSheetsFile(string: str) bool[source]

Tests whether the string is a Google Sheets URL.

Parameters

string (str) – string to test.

Returns

True if docs.google.com/spreadsheets/d/ is in string, False otherwise.

Return type

bool

static loadSheet(fileName: str | TextIO, sheetName: str, removeRegex: str | None = None, isDefaultSearch: bool = False) tuple[str, str, pandas.core.frame.DataFrame] | None[source]

Load and return worksheet as a pandas data frame.

Parameters
  • fileName (str | TextIO) – filename or sys.stdin to read a csv from stdin.

  • sheeName – sheet name for an Excel file, ignored if not an Excel file. Can be a regular expression to search for a sheet.

  • removeRegex (str | None) – a string to pass to DataFrame.replace() to replace characters with an empty string in the dataframe that is read in. Can be a regex. Set to None to not replace anything.

  • isDefaultSearch (bool) – whether or not the sheetName is using default values, determines whether to print some messages.

  • sheetName (str) –

Returns

None if the worksheet is empty, else (fileName, sheetName, dataFrame)

Raises

Exception – If fileName is invalid.

Return type

tuple[str, str, pandas.core.frame.DataFrame] | None

merge(newMetadata: dict)[source]

Merges new metadata with current metadata.

Parameters

newMetadata (dict) – dict to merge with self.extraction dict.

modify(modificationDirectives: dict)[source]

Applies modificationDirectives to the extracted metadata.

Parameters

modificationDirectives (dict) – contains the modifications to apply.

parseSheet(fileName: str, sheetName: str, worksheet: DataFrame)[source]

Extracts useful metadata from the worksheet and puts it in the extraction dictionary.

Parameters
  • fileName (str) – name of the file, used for error messages.

  • sheetName (str) – name of the Excel sheet, used for error messages.

  • worksheet (DataFrame) – the data from the file name and sheet name.

static printLineages(lineages: ~collections.defaultdict, indentation: int, groupSize: int = 5, file: ~typing.TextIO = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Prints the given lineages.

Parameters
  • lineages (defaultdict) – dictionary where the keys are table names and values are a dictionary of parentID and children.

  • indentation (int) – number of spaces of indentation to print.

  • groupSize (int) – number of childIDs to print per line.

  • file (TextIO) –

readDirectives(source: str, sheetName: str, directiveType: str, removeRegex: str | None, isDefaultSearch: bool = False) dict[source]

Read directives source of a given directive type.

Parameters
  • source (str) – file path.

  • sheetName (str) – sheet name for an Excel file, ignored if not an Excel file.

  • directiveType (str) – either “modification” or “automation” to call the correct parsing function.

  • removeRegex (str | None) – a string to pass to DataFrame.replace() to replace characters with an empty string in the dataframe that is read in. Can be a regex. Set to None to not replace anything. Passed to loadSheet.

  • isDefaultSearch (bool) – whether or not the source is using default values, passed to loadSheet for message printing.

Returns

The directives that were read in.

Return type

dict

readMetadata(metadataSource: str, automationSource: str, automateDefaulted: bool, modificationSource: str, modifyDefaulted: bool, removeRegex: str | None, saveExtension: str = None)[source]

Reads metadata from source.

Parameters
  • metadataSource (str) – file path to metadata file with possibly a sheetname if appropriate.

  • automationSource (str) – file path to automation file or a sheetname.

  • automateDefaulted (bool) – whether the automation source is the default value or not, passed to readDirectives for message printing.

  • modificationSource (str) – file path to modification file or a sheetname.

  • modificationDefaulted – whether the modification source is the default value or not, passed to readDirectives for message printing.

  • removeRegex (str | None) – a string to pass to DataFrame.replace() to replace characters with an empty string in the dataframe that is read in. Can be a regex. Set to None to not replace anything. Passed to loadSheet and readDirectives.

  • saveExtension (str) – if “csv” saves the export as a csv file, else saves it as an Excel file.

  • modifyDefaulted (bool) –

saveSheet(fileName: str, sheetName: str, worksheet: DataFrame, saveExtension: str)[source]

Save given worksheet in the given format.

Parameters
  • fileName (str) – file name or path to save to.

  • sheetName (str) – name to give the sheet if saving as Excel.

  • worksheet (DataFrame) – data to save.

  • saveExtension (str) – if “csv” save as csv file, else save as Excel.

tagSheet(automationDirectives: dict, worksheet: DataFrame, silent: bool) DataFrame[source]

Add tags to the worksheet using the given automation directives.

Parameters
  • automationDirectives (dict) – a dictionary used to place the tags in the appropriate places.

  • worksheet (DataFrame) – the DataFrame in which to place the tags.

  • silent (bool) – if True don’t print warnings.

Returns

The modified worksheet.

Return type

DataFrame

exception messes.extract.extract.TagParserError(message: str, fileName: str, sheetName: str, rowIndex: int, columnIndex: int, endMessage: str = '')[source]

Exception class for errors thrown by TagParser.

Parameters
  • message (str) – start of the message for the exception.

  • fileName (str) – the file name where the exception happened.

  • sheetName (str) – the sheet name in the Excel file where the exception happened.

  • rowIndex (int) – the row index in the tabular file where the exception happened.

  • columnIndex (int) – the column index in the tabular file where the exception happened.

  • endMessage (str) – the optional end of the message for the exception.

static columnName(columnIndex: int) str[source]

Returns Excel-style column name for columnIndex (integer).

Parameters

columnIndex (int) – index of the column in the spreadsheet.

Returns

“, else return the capital letter(s) of the Excel colummn, Ex: columnIndex = 3 returns “D”

Return type

If columnIndex is less than 0 return “

class messes.extract.extract.VariableOperand(value: str | int)[source]

Represents #table.record%attribute variable operands.

Initializer

Parameters

value (str | int) – a string or int that represents the value of the operand.

messes.extract.extract.xstr(s: str | None) str[source]

Returns str(s) or “” if s is None.

Parameters

s (str | None) – input string or None.

Returns

str(s) or “” if s is None.

Return type

str

validate

Validate JSON files.

Usage:
messes validate json <input_JSON> [–pds=<pds> [–csv | –xlsx | –json | –gs] | –no_base_schema]

[–no_extra_checks] [–additional=<add_schema>…] [–format=<format>] [–silent=<level>]

messes validate save-schema <output_schema> [–input=<input_JSON>]

[–pds=<pds> [–csv | –xlsx | –json | –gs]] [–format=<format>] [–silent=<level>]

messes validate schema <input_schema> messes validate pds <pds> [–csv | –xlsx | –json | –gs] [–silent=<level>] [–save=<output_name>] messes validate pds-to-table <pds_json> <output_name> [<output_filetype>] messes validate pds-to-json <pds_tabular> [–csv | –xlsx | –gs] <output_name> messes validate cd-to-json-schema <conversion_directives> [–csv | –xlsx | –json | –gs] <output_schema> messes validate –help

<input_JSON> - if ‘-’ read from standard input. <pds> - can be a JSON, csv, xlsx, or Google Sheets file. If xlsx or Google Sheets, the default sheet name to read in is #validate,

to specify a different sheet name separate it from the file name with a colon ex: file_name.xlsx:sheet_name. If ‘-’ read from standard input.

<input_schema> - must be a valid JSON Schema file. If ‘-’ read from standard input. <output_schema> - if ‘-’ save to standard output. <output_name> - path to save tabular pds to, if ‘-’ save to standard output as CSV. <output_filetype> - “xlsx” or “csv”, defaults to “csv”. <conversion_directives> - can be a JSON, csv, xlsx, or Google Sheets file. If xlsx or Google Sheets,

the default sheet name to read in is #convert, to specify a different sheet name separate it from the file name with a colon ex: file_name.xlsx:sheet_name. If ‘-’ read from standard input.

Options:
-h, --help
  • show this screen.

-v, --version
  • show version.

--silent <level>
  • if “full” silence all warnings,

if “nuisance” silence warnings that are more likely to be a nuisance, if “none” do not silence warnings [default: none].

--pds <pds>
  • a protocol-dependent schema file, can be a JSON, csv, or xlsx file.

If xlsx the default sheet name to read in is #validate, to specify a different sheet name separate it from the file name with a colon ex: file_name.xlsx:sheet_name.

--csv
  • indicates that the protocol-dependent schema file is a csv (comma delimited) file.

--xlsx
  • indicates that the protocol-dependent schema file is an xlsx (Excel) file.

--json
  • indicates that the protocol-dependent schema file is a JSON file.

--gs
  • indicates that the protocol-dependent schema file is a Google Sheets file.

If a file type is not given then it will be guessed from the file extension.

--additional <add_schema>
  • an additional JSON Schema file that will be used to validate <input_JSON>.

--format <format>
  • additional validation done for the desired supported format.

Current supported formats:

mwtab

--no_base_schema
  • don’t validate with the base JSON schema.

--no_extra_checks
  • only do JSON Schema validation and nothing else.

--input <input_JSON>
  • optionally give an input JSON file to save-schema to reproduce the

schema used to validate in the json command.

--save <output_name>
  • save the JSON Schema created from the protocol-dependent schema.

The “json” command will validate the <input_JSON> against the internal base_schema, and optional schema provided by the –pds and –additional options. To validate only against a provided schema, use the –additional and –no_base_schema options.

The “save-schema” command will save the internal base_schema to the <output_schema> location. If –pds is given then it will be parsed and placed into the base_schema. If –input is given, the protocols table will be added in with the PDS to reproduce what happens in the json command. If –format is used, then that format schema is saved instead of the base_schema.

The “schema” command will validate the <input_schema> against the JSON Schema meta schema.

The “pds” command will validate that the <pds> file is a valid protocol-dependent schema file. If the –save option is given, then save the built JSON Schema.

The “pds-to-table” command will read in a protocol-dependent schema in JSON form and save it out in a tabular form.

The “pds-to-json” command will read in a protocol-dependent schema in tabular form and save it out in a JSON form.

The “cd-to-json-schema” command will read in conversion directives and create a JSON Schema template file that can be filled in and used to validate files that will be converted using those directives.

messes.validate.validate.SS_protocol_check(input_json: dict[str, Any] | list | str | int | float | None) None[source]

Validates the subjects and samples protocols.

Loops over the entity table in input_json and makes sure that each sample/subject has protocols of the correct type depending on its inheritance. Samples that have a sample parent must have a sample_prep type protocol. Samples that have a subject parent must have a collection type protocol. Subjects must have a treatment type protocol.

Parameters

input_json (dict[str, Any] | list | str | int | float | None) – the JSON to validate.

Return type

None

messes.validate.validate.add_protocols_to_PDS(protocol_table: dict, pds: dict[str, Any] | list | str | int | float | None, silent: str) dict[str, Any] | list | str | int | float | None[source]

Add the protocols from the table to the protocol-dependent schema.

Parameters
  • protocol_table (dict) – the protocol table from the input JSON.

  • pds (dict[str, Any] | list | str | int | float | None) – the protocol-dependent schema in JSON form.

  • silent (str) – if “full” do not print any warnings, if “nuisance” do not print nuisance warnings.

Returns

The updated protocol-dependent schema.

Return type

dict[str, Any] | list | str | int | float | None

messes.validate.validate.build_PD_schema(pds: dict[str, Any] | list | str | int | float | None) dict[str, Any] | list | str | int | float | None[source]

Build a JSON schema from the protocol-dependent schema.

Parameters

pds (dict[str, Any] | list | str | int | float | None) – the protocol-dependent schema in JSON form.

Returns

A JSON schema created by combining the base schema and a schema created from the protocol-dependent schema.

Return type

dict[str, Any] | list | str | int | float | None

messes.validate.validate.check(self, instance: object, format: str) None[source]

Check whether the instance conforms to the given format.

Modified from jsonschema.FormatChecker.check. Used to raise an error on the custom “integer”, “str_integer”, “numeric”, and “str_numeric” formats so their values can be cast to int and float appropriately.

Parameters
  • instance (object) – the instance to check

  • format (str) – the format that instance should conform to

Raises

FormatError – if the instance does not conform to format if the instance does conform to “integer”, “str_integer”, “numeric”, and “str_numeric” formats if the instance is not a string and the format is “str_integer” or “str_numeric”

Return type

None

messes.validate.validate.convert_formats(validator: Validator, instance: dict | str | list) dict | str | list[source]

Convert “integer” and “numeric” formats to int and float.

Special function to iterate over JSON schema errors and if the custom “integer”, “str_integer”, “numeric”, and “str_numeric” formats are found, converts that value in the instance to the appropriate type. If the value is not a string and the format is “str_integer” or “str_numeric”, prints an error.

Parameters
  • validator (Validator) – Validator from the jsonschema library to run iter_errors() on.

  • instance (dict | str | list) – the instance to have its values converted.

Returns

The modified instance.

Return type

dict | str | list

messes.validate.validate.create_validator(schema: dict[str, Any] | list | str | int | float | None) Validator[source]

Create a validator for the given schema.

Parameters

schema (dict[str, Any] | list | str | int | float | None) – the JSON schema to create a validator for.

Returns

A jsonschema.protocols.Validator to validate the schema with an added format checker that is aware of the custom formats “integer”, “str_integer”, “numeric”, and “str_numeric”.

Return type

Validator

messes.validate.validate.factors_checks(input_json: dict[str, Any] | list | str | int | float | None, silent: str) None[source]

Validates some logic about the factors.

Checks that every factor in the factor table is used at least once by an entity. Whether or not values in the factor field are allowed values. If there are more than 1 allowed values in the factor field. That factor fields are str or list types.

Parameters
  • input_json (dict[str, Any] | list | str | int | float | None) – the JSON to validate.

  • silent (str) – if “full” do not print any warnings, if “nuisance” do not print nuisance warnings.

Return type

None

messes.validate.validate.id_check(JSON_file: dict[str, Any] | list | str | int | float | None) None[source]

Validate id fields for records in JSON_file.

Loops over JSON_file and makes sure each field with a period in the name is an id, that each id points to an existing id in another table that exists in JSON_file, that each “parent_id” field points to another record that exists in the same table, and that each “id” field has a value that is the same as the name of the record.

There is a special check for the “entity” table that checks that subject types have a sample type parent.

Parameters

JSON_file (dict[str, Any] | list | str | int | float | None) – the JSON to validate ids for.

Return type

None

messes.validate.validate.indexes_of_duplicates_in_list(list_of_interest: list, value_to_find: Any) list[int][source]

Returns a list of all of the indexes in list_of_interest where the value equals value_to_find.

Parameters
  • list_of_interest (list) – list to find indexes in.

  • value_to_find (Any) – value to look for in list_of_interest and find its index.

Returns

A list of all the indexes where value_to_find is in list_of_interest.

Return type

list[int]

messes.validate.validate.iterate_string_or_list(str_or_list: str | list) list[source]

If str_or_list is a string then make it into a list and return the items for looping.

If str_or_list is a list then return it as is.

Parameters

str_or_list (str | list) – a string to return as a list containing that string or a list to return as is.

Returns

str_or_list as a list.

Return type

list

messes.validate.validate.measurement_protocol_check(input_json: dict[str, Any] | list | str | int | float | None) None[source]

Loops over the measurement table in input_json and makes sure that each measurement has at least one measurement type protocol.

Parameters

input_json (dict[str, Any] | list | str | int | float | None) – the JSON to validate.

Return type

None

messes.validate.validate.mwtab_checks(input_json: dict) None[source]

Check that the input_json is ready for mwtab conversion.

Run checks that cannot be done by JSON Schema. They are the following:

Check that at least 1 protocol has the “machine_type” field. Check that the first collection protocol has a “sample_type” field. Check that there are at least 1 collection type, treatment type, and sample_prep type protocols. Check that the first subject has the “species”, “species_type”, and “taxonomy_id” fields.

Parameters

input_json (dict) – the JSON to perform the checks on.

Return type

None

messes.validate.validate.print_better_error_messages(errors_generator: Iterable[ValidationError]) bool[source]

Print better error messages for jsonschema validation errors.

Parameters

errors_generator (Iterable[ValidationError]) – the generator returned from validator.iter_errors().

Returns

True if there were errors, False otherwise.

Return type

bool

messes.validate.validate.protocol_all_used_check(input_json: dict[str, Any] | list | str | int | float | None, tables_with_protocols: list[str]) None[source]

Validates that all protocols in the protocol table are used at least once.

Compiles a list of all of the protocols used by the records in tables_with_protocols and checks that every protocol in the protocol table is in that list. For any protocols that appear in the protocol table, but are not used by any records a warning is printed.

Parameters
  • input_json (dict[str, Any] | list | str | int | float | None) – the JSON to validate.

  • tables_with_protocols (list[str]) – the tables in input_json that have records with “protocol.id” fields.

Return type

None

messes.validate.validate.protocol_description_check(input_json: dict[str, Any] | list | str | int | float | None) None[source]

Checks that every description field for the protocols in the protocol table of the metadata are unique.

Parameters

input_json (dict[str, Any] | list | str | int | float | None) – the JSON to validate.

Return type

None

messes.validate.validate.read_and_validate_PDS(filepath: str, is_csv: bool, is_xlsx: bool, is_json: bool, is_gs: bool, no_last_message: bool, silent: str) dict[str, Any] | list | str | int | float | None[source]

Read in the protocol-dependent schema from filepath and validate it.

Parameters
  • filepath (str) – the path to the protocol-dependent schema or “-” meaning to read from stdin.

  • is_csv (bool) – whether the protocol-dependent schema is a csv file, used for reading from stdin.

  • is_xlsx (bool) – whether the protocol-dependent schema is a xlsx file.

  • is_json (bool) – whether the protocol-dependent schema is a json file, used for reading from stdin.

  • is_json – whether the protocol-dependent schema is a Google Sheets file.

  • no_last_message (bool) – if True do not print a message about the protocol-dependent schema being invalid and execution stopping.

  • silent (str) – if “full” do not print any warnings, if “nuisance” do not print nuisance warnings.

  • is_gs (bool) –

Returns

The protocol-dependent schema.

Raises

SystemExit – Will raise errors if filepath does not exist or there is a read in error.

Return type

dict[str, Any] | list | str | int | float | None

messes.validate.validate.read_in_JSON_file(filepath: str, description: str) dict[str, Any] | list | str | int | float | None[source]

Read in a JSON file from filepath.

Parameters
  • filepath (str) – the path to the JSON file or “-” meaning to read from stdin.

  • description (str) – a name for the JSON file to print more specific error messages.

Returns

The JSON file.

Raises

SystemExit – Will raise errors if filepath does not exist or there is a read in error.

Return type

dict[str, Any] | list | str | int | float | None

messes.validate.validate.read_json_or_tabular_file(filepath: str, is_csv: bool, is_xlsx: bool, is_json: bool, is_gs: bool, file_title: str, default_sheet_name: str, silent: str) dict[str, Any] | list | str | int | float | None[source]

Read in a file from filepath.

Parameters
  • filepath (str) – the path to the file or “-” meaning to read from stdin.

  • is_csv (bool) – whether the file is a csv file, used for reading from stdin.

  • is_xlsx (bool) – whether the file is a xlsx file.

  • is_json (bool) – whether the file is a json file, used for reading from stdin.

  • is_json – whether the file is a Google Sheets file.

  • file_title (str) – a string to use for printing error messages about the file.

  • default_sheet_name (str) – sheet name to default to for Excel and Google Sheets files.

  • silent (str) – if “full” do not print any warnings, if “nuisance” do not print nuisance warnings.

  • is_gs (bool) –

Returns

The file contents.

Raises

SystemExit – Will raise errors if filepath does not exist or there is a read in error.

Return type

dict[str, Any] | list | str | int | float | None

messes.validate.validate.run_conversion_directives_to_json_schema_command(conversion_directives_source: str, is_csv: bool, is_xlsx: bool, is_json: bool, is_gs: bool, output_schema_path: str, silent: str) None[source]

Run the cd-to-json command.

Parameters
  • conversion_directives_source (str) – either a filepath or “-” to read from stdin.

  • is_csv (bool) – if True the conversion_directives_source is a csv file.

  • is_xlsx (bool) – if True the conversion_directives_source is an xlsx file.

  • is_json (bool) – if True the conversion_directives_source is a JSON file.

  • is_gs (bool) – if True the conversion_directives_source is a Google Sheets file.

  • silent (str) – if “full” do not print any warnings, if “nuisance” do not print nuisance warnings.

  • output_schema_path (str) –

Return type

None

messes.validate.validate.run_json_command(input_json_source: str, pds_source: str | None, additional_schema_sources: list[str], no_base_schema: bool = False, no_extra_checks: bool = False, is_csv: bool = False, is_xlsx: bool = False, is_json: bool = False, is_gs: bool = False, silent: str = 'none', format_check: str | None = None) None[source]

Run the json command.

Parameters
  • input_json_source (str) – either a filepath or “-” to read from stdin.

  • pds_source (str | None) – either a filepath or “-” to read from stdin, if not None.

  • additional_schema_sources (list[str]) – either a filepath or “-” to read from stdin, if not None.

  • no_base_schema (bool) – if True do not validate with the base_schema, ignored if pds_source is given.

  • no_extra_checks (bool) – if True only do JSON Schema validations.

  • is_csv (bool) – if True the pds_source is a csv file.

  • is_xlsx (bool) – if True the pds_source is an xlsx file.

  • is_json (bool) – if True the pds_source is a JSON file.

  • is_gs (bool) – if True the pds_source is a Google Sheets file.

  • silent (str) – if “full” do not print any warnings, if “nuisance” do not print nuisance warnings.

  • format_check (str | None) –

Return type

None

messes.validate.validate.run_pds_command(pds_source: str, output_path: str | None = None, is_csv: bool = False, is_xlsx: bool = False, is_json: bool = False, is_gs: bool = False, silent: str = 'none') None[source]

Run the pds command.

Parameters
  • pds_source (str) – either a filepath or “-” to read from stdin.

  • output_path (str | None) – if given then save the JSON Schema from the PDS.

  • is_csv (bool) – if True the pds_source is a csv file.

  • is_xlsx (bool) – if True the pds_source is an xlsx file.

  • is_json (bool) – if True the pds_source is a JSON file.

  • is_gs (bool) – if True the pds_source is a Google Sheets file.

  • silent (str) – if “full” do not print any warnings, if “nuisance” do not print nuisance warnings.

Return type

None

messes.validate.validate.run_pds_to_json_command(pds_source: str, is_csv: bool, is_xlsx: bool, is_gs: bool, output_path: str, silent: str = 'none') None[source]

Run the pds-to-json command.

Parameters
  • pds_source (str) – either a filepath or “-” to read from stdin.

  • is_csv (bool) – if True the pds_source is a csv file.

  • is_xlsx (bool) – if True the pds_source is an xlsx file.

  • is_gs (bool) – if True the pds_source is a Google Sheets file.

  • output_path (str) – either a filepath or “-” to write to stdout.

  • silent (str) – if “full” do not print any warnings, if “nuisance” do not print nuisance warnings.

Return type

None

messes.validate.validate.run_pds_to_table_command(pds_source: str, output_path: str, output_filetype: str, silent: str = 'none') None[source]

Run the pds-to-table command.

Parameters
  • pds_source (str) – either a filepath or “-” to read from stdin.

  • output_path (str) – either a filepath or “-” to write to stdout.

  • output_filetype (str) – either “xlsx” or “csv”.

  • silent (str) – if “full” do not print any warnings, if “nuisance” do not print nuisance warnings.

Return type

None

messes.validate.validate.run_save_schema_command(pds_source: str | None, output_schema_path: str, input_json_path: str | None, is_csv: bool = False, is_xlsx: bool = False, is_json: bool = False, is_gs: bool = False, silent: str = 'none', format_check: str | None = None) None[source]

Run the save-schema command.

Parameters
  • pds_source (str | None) – either a filepath or “-” to read from stdin, if not None.

  • output_schema_path (str) – the path to save the output JSON to.

  • input_json_path (str | None) – either a filepath or “-” to read from stdin, if not None.

  • is_csv (bool) – if True the pds_source is a csv file.

  • is_xlsx (bool) – if True the pds_source is an xlsx file.

  • is_json (bool) – if True the pds_source is a JSON file.

  • is_gs (bool) – if True the pds_source is a Google Sheets file.

  • silent (str) – if “full” do not print any warnings, if “nuisance” do not print nuisance warnings.

  • format_check (str | None) –

Return type

None

messes.validate.validate.run_schema_command(input_schema_source: str) None[source]

Run the schema command.

Parameters

input_schema_source (str) – the path to the JSON Schema file to read and validate.

Return type

None

messes.validate.validate.save_out_JSON_file(filepath: str, json_to_save: dict) None[source]

Handle renaming and directing JSON to the correct output.

Parameters
  • filepath (str) – the path to save the JSON file to or “-” meaning to write to stdout.

  • json_to_save (dict) – the JSON to save out.

Return type

None

messes.validate.validate.validate_JSON_schema(user_json_schema: dict[str, Any] | list | str | int | float | None) bool[source]

Validate an arbitrary JSON schema.

Parameters

user_json_schema (dict[str, Any] | list | str | int | float | None) – JSON schema to validate.

Returns

True if there were validation errors, False otherwise.

Return type

bool

messes.validate.validate.validate_PDS_parent_protocols(pds: dict[str, Any] | list | str | int | float | None, silent: str) bool[source]

Validate the parent_protocols table of the protocol-dependent schema.

Parameters
  • pds (dict[str, Any] | list | str | int | float | None) – the protocol-dependent schema in JSON form.

  • silent (str) – if “full” do not print any warnings, if “nuisance” do not print nuisance warnings.

Returns

True if there were errors (warnings don’t count), False otherwise.

Return type

bool

messes.validate.validate.validate_parent_id(table: dict, table_name: str, entity_name: str, check_type: bool, type_keyword: str = 'type') bool[source]

Validate the “parent_id” fields for the table.

Parameters
  • table (dict) – the table to validate {record_name:{attribute1:value1, …}, …}.

  • table_name (str) – the name of the table, used for printing better error messages.

  • entity_name (str) – name of the entities of the table, used for printing better error messages.

  • check_type (bool) – if True check that the type of the parent is the same as the child.

  • type_keyword (str) – the keyword to use to check the types of the parent and child.

Returns

True if there were errors, False otherwise.

Return type

bool

convert

Convert JSON data to another JSON format.

Usage:

messes convert mwtab (ms | nmr | nmr_binned) <input_JSON> <output_name> [–update <conversion_directives> | –override <conversion_directives>] [–silent] messes convert save-directives mwtab (ms | nmr | nmr_binned) <output_filetype> [<output_name>] messes convert generic <input_JSON> <output_name> <conversion_directives> [–silent] messes convert –help

<conversion_directives> - can be a JSON, csv, xlsx, or Google Sheets file. If xlsx or Google Sheets the default sheet name to read in is #convert,

to specify a different sheet name separate it from the file name with a colon ex: file_name.xlsx:sheet_name.

<output_filetype> - “json”, “xlsx”, or “csv”

Options:
-h, --help
  • show this screen.

-v, --version
  • show version.

--silent
  • silence all warnings.

--update <conversion_directives>
  • conversion directives that will be used to update the built-in directives for the format.

This is intended to be used for simple changes such as updating the value of the analysis ID. You only have to specify what needs to change, any values that are left out of the update directives won’t be changed. If you need to remove directives then use the override option.

--override <conversion_directives>
  • conversion directives that will be used to override the built-in directives for the format.

The built-in directives will not be used and these will be used instead.

The general command structure for convert is convert <format> which will convert an input JSON file over to the supported format. The outputs of these commands will save both the JSON conversion and the final format file.

The generic command is the same as the supported formats except the user is required to input conversion directives specifying how to convert the input JSON to the desired output JSON. Only an output JSON is saved.

The save-directives command is used to print the default conversion directives used by convert for any of the supported formats. <output-filetype> can be one of “json”, “xlsx”, or “csv”. The file is saved as “format_conversion_directives.ext” where “.ext” is replaced with “.json”, “.xlsx”, or “.csv” depending on the value of <output-format>, unless <output_name> is given.

messes.convert.convert.compute_matrix_value(input_json: dict, conversion_table: str, conversion_record_name: str, conversion_attributes: dict, required: bool, silent: bool = False) list[dict] | None[source]

Determine the matrix value for the conversion directive.

Parameters
  • input_json (dict) – the data to build the matrix from.

  • conversion_table (str) – the name of the table the conversion record came from, used for good error messaging.

  • conversion_record_name (str) – the name of the conversion record, used for good error messaging.

  • conversion_attributes (dict) – the fields and values of the conversion record.

  • required (bool) – if True then any problems during execution are errors and the program should exit, else it’s just a warning.

  • silent (bool) – if True don’t print warning messages.

Returns

the list of dicts for the directive or None if there was a problem and the directive is not required.

Return type

list[dict] | None

messes.convert.convert.compute_string_value(input_json: dict, conversion_table: str, conversion_record_name: str, conversion_attributes: dict, required: bool, silent: bool = False) str | None[source]

Determine the string value for the conversion directive.

Parameters
  • input_json (dict) – the data to build the value from.

  • conversion_table (str) – the name of the table the conversion record came from, used for good error messaging.

  • conversion_record_name (str) – the name of the conversion record, used for good error messaging.

  • conversion_attributes (dict) – the fields and values of the conversion record.

  • required (bool) – if True then any problems during execution are errors and the program should exit, else it’s just a warning.

  • silent (bool) – if True don’t print warning messages.

Returns

the str value for the directive or None if there was a problem and the directive is not required.

Return type

str | None

messes.convert.convert.directives_to_table(conversion_directives: dict) DataFrame[source]

Convert conversion directives to a tagged table form.

Parameters

conversion_directives (dict) – the conversion directives to transform.

Returns

a pandas DataFrame that can be saved to csv or xlsx.

Return type

DataFrame

messes.convert.convert.handle_code_field(input_json: dict, conversion_table: str, conversion_record_name: str, conversion_attributes: dict, required: bool, silent: bool = False) Any[source]

If conversion_attributes has code and/or import fields then import and run the code appropriately.

Parameters
  • input_json (dict) – dict that the code is likely to operate on.

  • conversion_table (str) – the name of the table the conversion record came from, used for good error messaging.

  • conversion_record_name (str) – the name of the conversion record, used for good error messaging.

  • conversion_attributes (dict) – the fields and values of the conversion record.

  • required (bool) – if True then any problems during execution are errors and the program should exit, else it’s just a warning.

  • silent (bool) – if True don’t print warning messages.

Returns

the result of eval() or None if there was no “code” field in conversion_attributes.

Return type

Any

messes.convert.convert.update(original_dict: dict, upgrade_dict: dict) dict[source]

Update a dictionary in a nested fashion.

Parameters
  • original_dict (dict) – the dictionary to update.

  • upgrade_dict (dict) – the dictionary to update values from.

Returns

original_dict, the updated original_dict

Return type

dict

Validates user input, erroring early and allowing the rest of the program to assume inputs are sanitized.

messes.convert.user_input_checking.validate_conversion_directives(conversion_directives: dict, schema: dict)[source]

Validate conversion directives.

Wraps around jsonschema.validate() to give more human readable errors for most validation errors.

Parameters
  • dict_to_validate – instance to validate.

  • schema (dict) – JSON Schema to validate the instance with.

  • conversion_directives (dict) –

Raises

jsonschema.ValidationError – any validation errors that aren’t handled reraise the original.

Functions For mwtab Format

messes.convert.mwtab_functions.create_sample_lineages(input_json: dict, entity_table_name: str = 'entity', parent_key: str = 'parent_id') dict[source]

Determine all the ancestors, parents, and siblings for each entity in the entity table.

The returned dictionary is of the form:

{entity_id:{“ancestors”:[ancestor0, ancestor1, …],

“parents”:[parent0, parent1, …], “siblings”:[sibling0, sibling1, …]}

}

parents are the immediate ancestors an entity comes from. They are also included in the ancestors list.

Parameters
  • input_json (dict) – the dictionary where the entity table is.

  • entity_table_name (str) – the name of the entity table in input_json.

  • parent_key (str) – the field name for the field that points to the entity’s parent.

Returns

a dictionary where the keys are the entity ids and the values are a dictionary of it’s ancestors, parents, and siblings.

Return type

dict

messes.convert.mwtab_functions.create_subject_sample_factors(input_json: dict, measurement_table_name: str = 'measurement', sibling_match_field: str = 'protocol.id', sibling_match_value: str = 'protein_extraction', sample_id_key: str = 'entity.id', entity_table_name: str = 'entity', entity_type_key: str = 'type', subject_type_value: str = 'subject', parent_key: str = 'parent_id', factor_table_name: str = 'factor', factor_field_key: str = 'field', factor_allowed_values_key: str = 'allowed_values', protocol_table_name: str = 'protocol', protocol_field: str = 'protocol.id', protocol_type_field: str = 'type', measurement_type_value: str = 'measurement', data_files_key: str = 'data_files', data_files_attribute_key: str = 'data_files%entity_id', lineage_field_exclusion_list: list[str] | tuple[str] = ('study.id', 'project.id', 'parent_id')) list[dict][source]

Create the SUBJECT_SAMPLE_FACTORS section of the mwTab JSON.

Parameters
  • input_json (dict) – the data to build from.

  • measurement_table_name (str) – the name of the table in input_json where the measurements are.

  • sibling_match_field (str) – the field to use to determine if a sibling should be added to the SSF.

  • SSF. (sibling_match_value the value to use to determine if a sibling should be added to the) –

  • sample_id_key (str) – the field in the measurement that has the sample id associated with it.

  • entity_table_name (str) – the name of the table in input_json where the entities are.

  • entity_type_key (str) – the field in entity records where the type is located.

  • subject_type_value (str) – the value in the type key that means the entity is a subject.

  • parent_key (str) – the field that points to the parent of the record.

  • factor_table_name (str) – the name of the table in input_json where the factors are.

  • factor_field_key (str) – the field in factor records that tells what the factor field is in other records.

  • factor_allowed_values_key (str) – the field in factor records where the allowed values for that factor are.

  • protocol_table_name (str) – the name of the table in input_json where the protocols are.

  • protocol_field (str) – the field in records that contains the protocol(s) of the record.

  • protocol_type_field (str) – the field in protocol records where the type is located.

  • measurement_type_value (str) – the value in the type key that means the protocol is a measurement type.

  • data_files_key (str) – the field in a measurement type protocol record where the file names are located.

  • data_files_attribute_key (str) – the field in a measurement type protocol record where the corresponding entity_id to raw file names are located.

  • lineage_field_exclusion_list (list[str] | tuple[str]) – the fields in entity records that should not be added as additional data.

  • sibling_match_value (str) –

Returns

a list of SUBJECT_SAMPLE_FACTORS.

Return type

list[dict]