The mwtab API Reference#

Routines for working with mwTab format files used by the Metabolomics Workbench.

This package includes the following modules:

mwtab: This module provides the MWTabFile class which is a python dictionary representation of a Metabolomics Workbench mwtab file. Data can be accessed directly from the MWTabFile instance using bracket accessors.
cli: This module provides command-line interface for the mwtab package.
tokenizer: This module provides the tokenizer() generator that generates tuples of key-value pairs from mwtab files.
fileio: This module provides the read_files() generator to open files from different sources (single file/multiple files on a local machine, directory/archive of files, URL address of a file).
converter: This module provides the Converter class that is responsible for the conversion of mwTab formated files into their JSON representation and vice versa.
mwschema: This module provides JSON schema definitions for the mwTab formatted files, i.e. specifies required and optional keys as well as data types.
validator: This module provides routines to validate mwTab formatted files based on schema definitions as well as checks for file self-consistency.
mwrest: This module provides the GenericMWURL class which is a python dictionary representation of a Metabolomics Workbench REST URL. The class is used to validate query parameters and to generate a URL path which can be used to request data from Metabolomics Workbench through their REST API.
metadata_column_matching: This module provides the ColumnFinder class which is composed of a NameMatcher and ValueMatcher. They are used to match column names and values, respecitively. Matching is done using regular expressions the “in” operator, equality operator, and type matching for values. This module also includes the “column_finders” dicitonary which is a dictionary of ColumnFinders created to match the most common columns found in the Metabolomics Workbench datasets. More information about this module can be found on the Metadata Column Matching page.

mwtab.mwtab#

This module provides the MWTabFile class that stores the data from a single mwTab formatted file in the form of an dict. Data can be accessed directly from the MWTabFile instance using bracket accessors.

The data is divided into a series of “sections” which each contain a number of “key-value”-like pairs. Also, the file contains a specially formatted SUBJECT_SAMPLE_FACTOR block and blocks of data between *_START and *_END.

class mwtab.mwtab.MWTabFile(source, duplicate_keys=False, force=False, *args, **kwds)[source]#

MWTabFile class that stores data from a single mwTab formatted file in the form of a dictionary.

Parameters:

source – A string that should be the file path to the mwtab file that will be read in.
duplicate_keys – If True, use a special dictionary type that can handle duplicate keys. If you are uisng this class to build an mwtab file by hand, don’t set this to True. This was added because some files already upload to the Metabolomics Workbench erroneously have duplicate keys and the class needed to be able to read write them back out correctly.
force – If True, replace non-dictionary values in METABOLITES_DATA, METABOLITES, and EXTENDED tables with empty dicts on JSON read in.

Attributes:

source – A string that should be the file path to the mwtab file that was read in.
study_id – A managed property. The study ID is stored in the METABOLOMICS WORKBENCH key of this class and the JSON version of an mwTab file. This property is provided as a convenience to access the study ID.
analysis_id – A managed property. The analysis ID is stored in the METABOLOMICS WORKBENCH key of this class and the JSON version of an mwTab file. This property is provided as a convenience to access the analysis ID.
header – A managed property. It is provided as a convenience to be able to view the header line of the file. You can also set the METABOLOMICS WORKBENCH key by setting this property. It will parse the string you assign into the dictionary that belongs in the METABOLOMICS WORKBENCH key. Assuming the provided string is a correctly generated header line.
data_section_key – A simple property that will give you the key to the data section. Either one of “MS_METABOLITE_DATA”, “NMR_METABOLITE_DATA”, or “NMR_BINNED_DATA”, or None if none of those were found.

Special Notes:

In general this class has the same structure as mwTab JSON, but there are a few exceptions to that. One is that the _RESULTS_FILE subsection, whether in the MS or NM section, is a dictionary in this class with keys for the elements found. In the mwTab JSON this is just a string. The possible keys the dictionary could have are: “filename”, “UNITS”, “Has m/z”, “Has RT”, and “RT units”. If the _RESULTS_FILE line did not have these keys, then they won’t be in the dictionary.

The Metabolomics Workbench has deprecated mwTab files with NMR_BINNED_DATA sections, but for the few that do exist if you read them in using this class, the dictionaries in the [‘NMR_BINNED_DATA’][‘Data’] list of dicts will have keys for both “Metabolite” and “Bin range(ppm)”. This is because the JSON version from the Metabolomics Workbench uses “Bin range(ppm)” and not “Metabolite”, but we wanted to present a seemless unified interface for this class regardless of the analysis type. They print out with only the “Bin range(ppm)” keys to match what the Metabolomics Workbench provides, but internally both keys will be there. If they somehow become different, you will see a message about it when you try to write the file out.

validate(ms_schema=mwschema.ms_required_schema, nmr_schema=mwschema.nmr_required_schema, verbose=True)[source]#

Validate the instance.

Parameters:

ms_schema (dict) – jsonschema to validate both the base parts of the file and the MS specific parts of the file.
nmr_schema (dict) – jsonschema to validate both the base parts of the file and the NMR specific parts of the file.
verbose (bool) – whether to be verbose or not.

Returns:

Error messages as a single string and error messages in JSON form. If verbose is True, then the single string will be None.

Return type:

(<class ‘str’>, list[dict])

property data_section_key#

Easily determine the data_section_key.

The key will be one of “MS_METABOLITE_DATA”, “NMR_METABOLITE_DATA”, or “NMR_BINNED_DATA”, but will be None if none of those keys are found.

set_table_from_pandas(df, table_name, clear_header=False)[source]#

Return the given table_name as a pandas.DataFrame.

table_name must be one of “Metabolites”, “Extended”, or “Data”.

Parameters:

df (pandas.DataFrame) – pandas.DataFrame that will be used to update the object.
table_name (str) – the name of the table to set from df.
clear_header (bool) – if True, sets the appropriate header to None, otherwise sets it to the df columns.

Returns:

None

Return type:

None

set_metabolites_from_pandas(df, clear_header=False)[source]#

Update MWTabFile based on provided pandas.DataFrame.

Overwrite the current list of dicts in self[data_section_key][‘Metabolites’] with the values in df. Also overwrites self._metabolite_header with the columns in df, excluding the first column. df is assumed to have the first column as the ‘Metabolites’ column.

Parameters:

df (pandas.DataFrame) – pandas.DataFrame that will be used to update the object.
clear_header (bool) – if True, sets _metabolite_header to None, otherwise sets it to the df columns.

Returns:

None

Return type:

None

set_extended_from_pandas(df, clear_header=False)[source]#

Update MWTabFile based on provided pandas.DataFrame.

Overwrite the current list of dicts in self[data_section_key][‘Extended’] with the values in df. Also overwrites self._extended_metabolite_header with the columns in df, excluding the first column. df is assumed to have the first column as the ‘Metabolites’ column.

Parameters:

df (pandas.DataFrame) – pandas.DataFrame that will be used to update the object.
clear_header (bool) – if True, sets _extended_metabolite_header to None, otherwise sets it to the df columns.

Returns:

None

Return type:

None

set_metabolites_data_from_pandas(df, clear_header=False)[source]#

Update MWTabFile based on provided pandas.DataFrame.

Overwrite the current list of dicts in self[data_section_key][‘Data’] with the values in df. Also overwrites self._samples with the columns in df, excluding the first column. df is assumed to have the first column as the ‘Metabolites’ column.

Parameters:

df (pandas.DataFrame) – pandas.DataFrame that will be used to update the object.
clear_header (bool) – if True, sets _samples to None, otherwise sets it to the df columns.

Returns:

None

Return type:

None

get_table_as_pandas(table_name)[source]#

Return the given table_name as a pandas.DataFrame.

table_name must be one of “Metabolites”, “Extended”, or “Data”. Note that if there are duplicate column names, they will have a string appended to the end of the name like {{{_d+_}}}.

Parameters:: table_name (str) – the name of the table to return as a pandas.DataFrame.
Returns:: The list of dicts for the given table_name as a pandas.DataFrame.
Return type:: pandas.DataFrame

get_metabolites_as_pandas()[source]#

Return the Metabolites table as a pandas.DataFrame.

Note that if there are duplicate column names, they will have a string appended to the end of the name like {{{_d+_}}}.

Returns:: The list of dicts for the Metabolites table as a pandas.DataFrame.
Return type:: pandas.DataFrame

get_extended_as_pandas()[source]#

Return the Extended table as a pandas.DataFrame.

Note that if there are duplicate column names, they will have a string appended to the end of the name like {{{_d+_}}}.

Returns:: The list of dicts for the Extended table as a pandas.DataFrame.
Return type:: pandas.DataFrame

get_metabolites_data_as_pandas()[source]#

Return the Data table as a pandas.DataFrame.

Note that if there are duplicate column names, they will have a string appended to the end of the name like {{{_d+_}}}.

Returns:: The list of dicts for the Data table as a pandas.DataFrame.
Return type:: pandas.DataFrame

classmethod from_dict(input_dict)[source]#

Create a new MWTabFile instance from input_dict.

Parameters:: input_dict (dict) – Dictionary to create the new instance from.
Returns:: New instance of MWTabFile
Return type:: MWTabFile

read_from_str(input_str)[source]#

Read input_str into a MWTabFile instance.

Returns:: None
Return type:: None

read(filehandle)[source]#

Read data into a MWTabFile instance.

Parameters:: filehandle (io.TextIOWrapper, gzip.GzipFile, bz2.BZ2File, zipfile.ZipFile) – file-like object.
Returns:: None
Return type:: None

write(filehandle, file_format)[source]#

Write MWTabFile data into file.

Parameters:

filehandle (io.TextIOWrapper) – file-like object.
file_format (str) – Format to use to write data: mwtab or json.

Returns:

None

Return type:

None

writestr(file_format)[source]#

Write MWTabFile data into string.

Parameters:: file_format (str) – Format to use to write data: mwtab or json.
Returns:: String representing the MWTabFile instance.
Return type:: str

print_file(f=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, file_format='mwtab')[source]#

Print MWTabFile into a file or stdout.

Parameters:

f (io.StringIO) – writable file-like stream.
file_format (str) – Format to use: mwtab or json.

Returns:

None

Return type:

None

print_subject_sample_factors(section_key, f=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, file_format='mwtab')[source]#

Print mwtab SUBJECT_SAMPLE_FACTORS section into a file or stdout.

Parameters:

section_key (str) – Section name.
f (io.StringIO) – writable file-like stream.
file_format (str) – Format to use: mwtab or json.

Returns:

None

Return type:

None

print_block(section_key, f=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, file_format='mwtab')[source]#

Print mwtab section into a file or stdout.

Parameters:

section_key (str) – Section name.
f (io.StringIO) – writable file-like stream.
file_format (str) – Format to use: mwtab or json.

Returns:

None

Return type:

None

The mwtab Command Line Interface#

Usage:
    mwtab -h | --help
    mwtab --version
    mwtab convert (<from-path> <to-path>) [--from-format=<format>] [--to-format=<format>] [--mw-rest=<url>] [--force] [--verbose]
    mwtab validate <from-path> [--to-path=<path>] [--mw-rest=<url>] [--force] [--silent]
    mwtab download url <url> [--to-path=<path>] [--verbose]
    mwtab download study all [--to-path=<path>] [--input-item=<item>] [--output-format=<format>] [--mw-rest=<url>] [--verbose]
    mwtab download study <input-value> [--to-path=<path>] [--input-item=<item>] [--output-item=<item>] [--output-format=<format>] [--mw-rest=<url>] [--verbose]
    mwtab download (study | compound | refmet | gene | protein) <input-item> <input-value> <output-item> [--output-format=<format>] [--to-path=<path>] [--mw-rest=<url>] [--verbose]
    mwtab download moverz <input-item> <m/z-value> <ion-type-value> <m/z-tolerance-value> [--to-path=<path>] [--mw-rest=<url>] [--verbose]
    mwtab download exactmass <LIPID-abbreviation> <ion-type-value> [--to-path=<path>] [--mw-rest=<url>] [--verbose]
    mwtab extract metadata <from-path> <to-path> <key> ... [--to-format=<format>] [--no-header] [--force]
    mwtab extract metabolites <from-path> <to-path> (<key> <value>) ... [--to-format=<format>] [--no-header] [--force]

Options:
    -h, --help                           Show this screen.
    --version                            Show version.
    --verbose                            Print what files are processing.
    --silent                             Silence all standard output.
    --from-format=<format>               Input file format, available formats: mwtab, json [default: mwtab].
    --to-format=<format>                 Output file format [default: json].
                                         Available formats for convert:
                                             mwtab, json.
                                         Available formats for extract:
                                             json, csv.
    --mw-rest=<url>                      URL to MW REST interface
                                            [default: https://www.metabolomicsworkbench.org/rest/].
    --to-path=<path>                     Directory to save outputs into. Defaults to the current working directory.
                                         For the validate command, if the given path ends in '.json', then
                                         all JSON file outputs will be condensed into that 1 file. Also for
                                         the validate command no output files are saved unless this option is given.
    --prefix=<prefix>                    Prefix to add at the beginning of the output file name. Defaults to no prefix.
    --suffix=<suffix>                    Suffix to add at the end of the output file name. Defaults to no suffix.
    --context=<context>                  Type of resource to access from MW REST interface, available contexts: study,
                                         compound, refmet, gene, protein, moverz, exactmass [default: study].
    --input-item=<item>                  Item to search Metabolomics Workbench with.
    --output-item=<item>                 Item to be retrieved from Metabolomics Workbench.
    --output-format=<format>             Format for item to be retrieved in, available formats: mwtab, json.
    --no-header                          Include header at the top of csv formatted files.
    --force                              Ignore non-dictionary values in METABOLITES_DATA, METABOLITES, and EXTENDED tables for JSON files.

    For extraction <to-path> can take a "-" which will use stdout.
    All <from-path>'s can be single files, directories, or URLs.

Documentation webpage: https://moseleybioinformaticslab.github.io/mwtab/
GitHub webpage: https://github.com/MoseleyBioinformaticsLab/mwtab

mwtab.cli.cli(cmdargs)[source]#

Implements the command line interface.

param dict cmdargs: dictionary of command line arguments.

mwtab.tokenizer#

This module provides the tokenizer() lexical analyzer for mwTab format syntax. It is implemented as Python generator-based state machine which generates (yields) tokens one at a time when next() is invoked on tokenizer() instance.

Each token is a tuple of “key-value”-like pairs, tuple of SUBJECT_SAMPLE_FACTORS or tuple of data deposited between *_START and *_END blocks.

mwtab.tokenizer.tokenizer(text, dict_type=None)[source]#

A lexical analyzer for the mwtab formatted files.

Parameters:

text (str) – mwTab formatted text.
dict_type – the type of dictionary to use, default is dict.

Returns:

Tuples of data.

Return type:

namedtuple

mwtab.fileio#

This module provides routines for reading mwTab formatted files from difference kinds of sources:

Single mwTab formatted file on a local machine.

Directory containing multiple mwTab formatted files.

Compressed zip/tar archive of mwTab formatted files.

URL address of mwTab formatted file.

ANALYSIS_ID of mwTab formatted file.

mwtab.fileio.read_files(sources, *, read_class=<class 'mwtab.mwtab.MWTabFile'>, class_kwds={'duplicate_keys': True}, return_exceptions=False)#

Read from sources using the given read_class.

This is really created to use functools partial to create a read mwthod for a particular class.

Parameters:

sources (str | list[str]) – A string or list of strings to read from.
read_class (type) – A class with a read() method to instantiate to read from source.
class_kwds (dict) – A dictionary of keyword arguments to pass to the class constructor.
return_exceptions (bool) – Whether to yield a tuple with file instance and exception or just the file instance.

Returns:

Returns the instantiated class and any exceptions, or None and any exceptions, or the source and any exceptions.

Return type:

tuple[Any, Exception] | Any

mwtab.converter#

This module provides functionality for converting between the Metabolomics Workbench mwTab formatted file and its equivalent JSONized representation.

The following conversions are possible:

Local files:

One-to-one file conversions:
- textfile - to - textfile
- textfile - to - textfile.gz
- textfile - to - textfile.bz2
- textfile.gz - to - textfile
- textfile.gz - to - textfile.gz
- textfile.gz - to - textfile.bz2
- textfile.bz2 - to - textfile
- textfile.bz2 - to - textfile.gz
- textfile.bz2 - to - textfile.bz2
- textfile / textfile.gz / textfile.bz2 - to - textfile.zip / textfile.tar / textfile.tar.gz / textfile.tar.bz2 (TypeError: One-to-many conversion)
Many-to-many files conversions:
- Directories:
  
  directory - to - directory
  
  directory - to - directory.zip
  
  directory - to - directory.tar
  
  directory - to - directory.tar.bz2
  
  directory - to - directory.tar.gz
  
  directory - to - directory.gz / directory.bz2 (TypeError: Many-to-one conversion)
- Zipfiles:
  
  zipfile.zip - to - directory
  
  zipfile.zip - to - zipfile.zip
  
  zipfile.zip - to - tarfile.tar
  
  zipfile.zip - to - tarfile.tar.gz
  
  zipfile.zip - to - tarfile.tar.bz2
  
  zipfile.zip - to - directory.gz / directory.bz2 (TypeError: Many-to-one conversion)
- Tarfiles:
  
  tarfile.tar - to - directory
  
  tarfile.tar - to - zipfile.zip
  
  tarfile.tar - to - tarfile.tar
  
  tarfile.tar - to - tarfile.tar.gz
  
  tarfile.tar - to - tarfile.tar.bz2
  
  tarfile.tar - to - directory.gz / directory.bz2 (TypeError: Many-to-one conversion)
  
  tarfile.tar.gz - to - directory
  
  tarfile.tar.gz - to - zipfile.zip
  
  tarfile.tar.gz - to - tarfile.tar
  
  tarfile.tar.gz - to - tarfile.tar.gz
  
  tarfile.tar.gz - to - tarfile.tar.bz2
  
  tarfile.tar.gz - to - directory.gz / directory.bz2 (TypeError: Many-to-one conversion)
  
  tarfile.tar.bz2 - to - directory
  
  tarfile.tar.bz2 - to - zipfile.zip
  
  tarfile.tar.bz2 - to - tarfile.tar
  
  tarfile.tar.bz2 - to - tarfile.tar.gz
  
  tarfile.tar.bz2 - to - tarfile.tar.bz2
  
  tarfile.tar.bz2 - to - directory.gz / directory.bz2 (TypeError: Many-to-one conversion)

URL files:

One-to-one file conversions:
- analysis_id - to - textfile
- analysis_id - to - textfile.gz
- analysis_id - to - textfile.bz2
- analysis_id - to - textfile.zip / textfile.tar / textfile.tar.gz / textfile.tar.bz2 (TypeError: One-to-many conversion)
- textfileurl - to - textfile
- textfileurl - to - textfile.gz
- textfileurl - to - textfile.bz2
- textfileurl.gz - to - textfile
- textfileurl.gz - to - textfile.gz
- textfileurl.gz - to - textfile.bz2
- textfileurl.bz2 - to - textfile
- textfileurl.bz2 - to - textfile.gz
- textfileurl.bz2 - to - textfile.bz2
- textfileurl / textfileurl.gz / textfileurl.bz2 - to - textfile.zip / textfile.tar / textfile.tar.gz / textfile.tar.bz2 (TypeError: One-to-many conversion)
Many-to-many files conversions:
- Zipfiles:
  
  zipfileurl.zip - to - directory
  
  zipfileurl.zip - to - zipfile.zip
  
  zipfileurl.zip - to - tarfile.tar
  
  zipfileurl.zip - to - tarfile.tar.gz
  
  zipfileurl.zip - to - tarfile.tar.bz2
  
  zipfileurl.zip - to - directory.gz / directory.bz2 (TypeError: Many-to-one conversion)
- Tarfiles:
  
  tarfileurl.tar - to - directory
  
  tarfileurl.tar - to - zipfile.zip
  
  tarfileurl.tar - to - tarfile.tar
  
  tarfileurl.tar - to - tarfile.tar.gz
  
  tarfileurl.tar - to - tarfile.tar.bz2
  
  tarfileurl.tar - to - directory.gz / directory.bz2 (TypeError: Many-to-one conversion)
  
  tarfileurl.tar.gz - to - directory
  
  tarfileurl.tar.gz - to - zipfile.zip
  
  tarfileurl.tar.gz - to - tarfile.tar
  
  tarfileurl.tar.gz - to - tarfile.tar.gz
  
  tarfileurl.tar.gz - to - tarfile.tar.bz2
  
  tarfileurl.tar.gz - to - directory.gz / directory.bz2 (TypeError: Many-to-one conversion)
  
  tarfileurl.tar.bz2 - to - directory
  
  tarfileurl.tar.bz2 - to - zipfile.zip
  
  tarfileurl.tar.bz2 - to - tarfile.tar
  
  tarfileurl.tar.bz2 - to - tarfile.tar.gz
  
  tarfileurl.tar.bz2 - to - tarfile.tar.bz2
  
  tarfileurl.tar.bz2 - to - directory.gz / directory.bz2 (TypeError: Many-to-one conversion)

class mwtab.converter.Translator(from_path, to_path, from_format=None, to_format=None, force=False)[source]#: Translator abstract class.

class mwtab.converter.MWTabFileToMWTabFile(from_path, to_path, from_format=None, to_format=None, force=False)[source]#: Translator concrete class that can convert between mwTab and JSON formats.

class mwtab.converter.Converter(from_path, to_path, from_format='mwtab', to_format='json', force=False)[source]#

Converter class to convert mwTab files from mwTab to JSON or from JSON to mwTab format.

convert()[source]#: Convert file(s) from mwTab format to JSON format or from JSON format to mwTab format. :return: None :rtype: None

mwtab.validator#

This module contains routines to validate consistency of the mwTab formatted files, e.g. make sure that Samples and Factors identifiers are consistent across the file, and make sure that all required key-value pairs are present.

mwtab.validator.validate_file(mwtabfile, ms_schema, nmr_schema, verbose=False)[source]#

Validate mwTab formatted file.

Note that some of the validations are pretty strict to account for the majority of cases, but if warranted could be ignored. For example, COLUMN_PRESSURE in the CHROMATOGRAPHY section will print a warning if the value is not a single number or range of numbers followed by a unit, but there might be some situations where the method is complex and thus the column pressure is not static. So something like “60 bar at starting conditions. 180 bar at %A” would be required to accurately describe the COLUMN_PRESSURE, and would be valid. So in these kinds of situations the warning printed can safely be ignored.

Parameters:

mwtabfile (MWTabFile) – The file to be validated.
ms_schema (dict) – jsonschema to validate both the base parts of the file and the MS specific parts of the file.
nmr_schema (dict) – jsonschema to validate both the base parts of the file and the NMR specific parts of the file.
verbose (bool) – whether to be verbose or not.

Returns:

Error messages as a single string and error messages in JSON form. If verbose is True, then the single string will be None.

Return type:

(<class ‘str’>, list[dict])

mwtab.mwrest#

This module provides routines for accessing the Metabolomics Workbench REST API.

See https://www.metabolomicsworkbench.org/tools/MWRestAPIv1.0.pdf for details.

mwtab.mwrest.analysis_ids(base_url='https://www.metabolomicsworkbench.org/rest/')[source]#

Method for retrieving a list of analysis ids for every current analysis in Metabolomics Workbench.

Parameters:: base_url (str) – Base url to Metabolomics Workbench REST API.
Returns:: List of every available Metabolomics Workbench analysis identifier.
Return type:: list

mwtab.mwrest.study_ids(base_url='https://www.metabolomicsworkbench.org/rest/')[source]#

Method for retrieving a list of study ids for every current study in Metabolomics Workbench.

Parameters:: base_url (str) – Base url to Metabolomics Workbench REST API.
Returns:: List of every available Metabolomics Workbench study identifier.
Return type:: list

mwtab.mwrest.generate_mwtab_urls(input_items, base_url='https://www.metabolomicsworkbench.org/rest/', output_format='txt', return_exceptions=False)[source]#

Method for generating URLS to be used to retrieve mwtab files for analyses and studies through the REST API of the Metabolomics Workbench database.

Parameters:

input_items (list) – List of Metabolomics Workbench input values for mwTab files.
base_url (str) – Base url to Metabolomics Workbench REST API.
output_format (str) – Output format for the mwTab files to be retrieved in.
return_exceptions (bool) – Whether to yield a tuple with url and exception or just the url.

Returns:

Metabolomics Workbench REST URL string(s).

Return type:

str

class mwtab.mwrest.GenericMWURL(rest_params, base_url='https://www.metabolomicsworkbench.org/rest/')[source]#

GenericMWURL class that stores and validates parameters specifying a Metabolomics Workbench REST URL.

Metabolomics REST API requests are performed using URL requests in the form of: https://www.metabolomicsworkbench.org/rest/context/input_specification/output_specification

where:
    if context = "study" | "compound" | "refmet" | "gene" | "protein"
        input_specification = input_item/input_value
        output_specification = output_item/[output_format]
    elif context = "moverz"
        input_specification = input_item/input_value1/input_value2/input_value3
            input_item = "LIPIDS" | "MB" | "REFMET"
            input_value1 = m/z_value
            input_value2 = ion_type_value
            input_value3 = m/z_tolerance_value
        output_specification = output_format
            output_format = "txt"
    elif context =  "exactmass"
        input_specification = input_item/input_value1/input_value2
            input_item = "LIPIDS" | "MB" | "REFMET"
            input_value1 = LIPID_abbreviation
            input_value2 = ion_type_value
        output_specification = None

class mwtab.mwrest.MWRESTFile(source)[source]#

MWRESTFile class that stores data from a single file download through Metabolomics Workbench’s REST API.

Mirrors MWTabFile.

read(filehandle)[source]#

Read data into a MWRESTFile instance.

Parameters:: filehandle (io.TextIOWrapper, gzip.GzipFile, bz2.BZ2File, zipfile.ZipFile) – file-like object.
Returns:: None
Return type:: None

write(filehandle)[source]#

Write MWRESTFile data into file.

Parameters:: filehandle (io.TextIOWrapper) – file-like object.
Returns:: None
Return type:: None

mwtab.mwextract#

This module provides a number of functions and classes for extracting and saving data and metadata stored in mwTab formatted files in the form of MWTabFile.

class mwtab.mwextract.ItemMatcher(full_key, value_comparison)[source]#: ItemMatcher class that can be called to match items from mwTab formatted files in the form of MWTabFile.

class mwtab.mwextract.ReGeXMatcher(full_key, value_comparison)[source]#: ReGeXMatcher class that can be called to match items from mwTab formatted files in the form of MWTabFile using regular expressions.

mwtab.mwextract.generate_matchers(items)[source]#

Construct a generator that yields Matchers ItemMatcher or ReGeXMatcher.

Parameters:: items (iterable) – Iterable object containing key value pairs to match.
Returns:: Yields a Matcher object for each given item.
Return type:: ItemMatcher or ReGeXMatcher

mwtab.mwextract.extract_metabolites(sources, matcher_generator)[source]#

Extract metabolite data from mwTab formatted files in the form of MWTabFile.

Parameters:

sources (generator) – Generator of mwtab file objects (MWTabFile).
matcher_generator (generator) – Generator of matcher objects (ItemMatcher or ReGeXMatcher).

Returns:

Extracted metabolites dictionary.

Return type:

dict

mwtab.mwextract.extract_metadata(mwtabfile, keys)[source]#

Extract metadata data from mwTab formatted files in the form of MWTabFile.

Parameters:

mwtabfile (MWTabFile) – mwTab file object for metadata to be extracted from.
keys (list) – List of metadata field keys for metadata values to be extracted.

Returns:

Extracted metadata dictionary.

Return type:

dict

mwtab.mwextract.write_metadata_csv(to_path, extracted_values, no_header=False)[source]#

Write extracted metadata dict into csv file.

Example: “metadata”,”value1”,”value2” “SUBJECT_TYPE”,”Human”,”Plant”

Parameters:

to_path (str) – Path to output file.
extracted_values (dict) – Metadata dictionary to be saved.
no_header (bool) – If true header is not included, otherwise header is included.

Returns:

None

Return type:

None

mwtab.mwextract.write_metabolites_csv(to_path, extracted_values, no_header=False)[source]#

Write extracted metabolites data dict into csv file.

Example: “metabolite_name”,”num-studies”,”num_analyses”,”num_samples” “1,2,4-benzenetriol”,”1”,”1”,”24” “1-monostearin”,”1”,”1”,”24” …

Parameters:

to_path (str) – Path to output file.
extracted_values (dict) – Metabolites data dictionary to be saved.
no_header (bool) – If true header is not included, otherwise header is included.

Returns:

None

Return type:

None

class mwtab.mwextract.SetEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]#

SetEncoder class for encoding Python sets set into json serializable objects list.

default(obj)[source]#

Method for encoding Python objects. If object passed is a set, converts the set to JSON serializable lists or calls base implementation.

Parameters:: obj (object) – Python object to be json encoded.
Returns:: JSON serializable object.
Return type:: dict, list, tuple, str, int, float, bool, or None

mwtab.mwextract.write_json(to_path, extracted_dict)[source]#

Write extracted data or metadata dict into json file.

Metabolites example:
{
    "1,2,4-benzenetriol": {
        "ST000001": {
            "AN000001": [
                "LabF_115816",
                ...
            ]
        }
    }
}

Metadata example:
{
    "SUBJECT_TYPE": [
        "Plant",
        "Human"
    ]
}

Parameters:

to_path (str) – Path to output file.
extracted_dict (dict) – Metabolites data or metadata dictionary to be saved.

Returns:

None

Return type:

None

Metadata Column Matching#

Regular expressions, functions, and classes to match column names and values in mwtab METABOLITES blocks.

More information can be found on the Metadata Column Matching page.

mwtab.metadata_column_matching.WRAP_STRING = '[^a-zA-Z0-9]'#: Used for wrapping regexes in certain functions.

class mwtab.metadata_column_matching.NameMatcher(regex_search_strings=None, not_regex_search_strings=None, regex_search_sets=None, in_strings=None, not_in_strings=None, in_string_sets=None, exact_strings=None)[source]#

Used to filter names that match certain criteria.

Mostly intended to be used through the ColumnFinder class. Created for the purpose of matching tabular column names based on regular expressions and “in” criteria.

Parameters:

regex_search_strings (None | list[str]) – A collection of strings to deliver to re.search() to match a column name. If any string in the collection matches, then the name is matched. This does not simply look for any of the strings within a column name to match. Each string is wrapped with WRAP_STRING before searching, so the string ‘bar’ would not be found in the column name “foobarbaz”, but would be found in the name “foo bar baz”.
not_regex_search_strings (None | list[str]) – The same as regex_search_strings, except a match to a column name eliminates that name. Attributes that begin with “not” take precedence over the others. So if a column name matches a string in regex_search_strings and not_regex_search_strings, then it will be filtered OUT.
regex_search_sets (None | list[list[str]]) – A collection of sets of strings. Each string in the set of strings must be found in the column name to match, but any set could be found. For example, [(‘foo’, ‘bar’), (‘baz’, ‘asd’)] will match the name “foo bar” or “bar foo”, but not “foobar”, due to the aforementioned wrapping with WRAP_STRING. The names “foo”, “baz”, or “asd” would not match either, but “var asd baz” would.
in_strings (None | list[str]) – Similar to regex_search_strings except instead of using re.search() the “in” operator is used. For example, [‘foo’] would match the column name “a fool”, since ‘foo’ is in “a fool”.
not_in_strings (None | list[str]) – The same as in_strings, but matches to a column name eliminate or filter OUT that name.
in_string_sets (None | list[list[str]]) – The same as regex_search_sets, but each string in a set is determined to match using the “in” operator instead of re.search(). For example, [(‘foo’, ‘bar’), (‘baz’, ‘asd’)] WILL match the column name “foobar” because both ‘foo’ and ‘bar’ are in the name.
exact_strings (None | list[str]) – A collection of strings that must exactly match the column name. For example, [‘foo’, ‘bar’] would only the match the column names “foo” or “bar”.

Examples

Find a column for “moverz”.

>>> NameMatcher(regex_search_strings = ['m/z', 'mz', 'moverz', 'mx'],
...             not_regex_search_strings = ['id'],
...             in_strings = ['m.z', 'calcmz', 'medmz', 'm_z', 'obsmz', 'mass to charge', 'mass over z'],
...             not_in_strings = ['spec', 'pectrum', 'structure', 'regno', 'retention'])

This is a real example based on the datasets in the Metabolomics Workbench. We can examine some of the strings to illustrate the attributes’ function. The “id” string needs to be in not_regex_search_strings, rather than not_in_strings, because “id” is a very small substring that could easily be in a longer word. Putting in not_regex_search_strings means it will most likely match “ID” fields, such as “PubChem ID”. Note that it is recommend to lower all column names before filtering and thus use lower case strings, but in general NameMatcher is case sensitive. The “spec” string is in not_in_strings, rather than not_regex_search_strings because the risk of it being in a name that should not be filtered out is low. Also it catches both the full word “spectrum” and its common abbreviation “spec”. Hopefully, these 2 explanations of “id” and “spec” have illustrated some of the tradeoffs and advantages of the “in” style attributes versus the “search” style ones.

Find a column for “retention time”.

>>> NameMatcher(regex_search_strings = ['rt'],
...             regex_search_sets = [['ret', 'time']],
...             in_strings = ['rtimes', 'r.t.', 'medrt', 'rtsec', 'bestrt', 'compoundrt', 'rtmed'],
...             in_string_sets = [['retention', 'time'], ['rentetion', 'time'], ['retension', 'time']],
...             not_in_strings = ['type', 'error', 'index', 'delta', 'feature', 'm/z'])

This is another real example based on the datasets in the Metabolomics Workbench. It illustrates the “set” style attributes quite well. For multi-word column names the “set” style attributes are usually what you want to use. It is possible to to give a string like “retention time”, note the space character, to an attribute like “in_strings”, but this is more fragile than it seems and won’t match some common alternate spellings or mistakes, such as “retention_time” or “retention time”. Using “set” style attributes means you don’t have to add as many strings to an attribute like “in_strings”. You can still see some repetition in the in_string_sets attribute here though to cover the many mispellings of “retention”. “set” style attributes would not be a good use case if the strings in the set must be in a certain order though. The set [‘ret’, ‘time’] will match ‘ret’ and ‘time’ in any order. Generally, this will not be a problem because there aren’t many instances where you will get a false positive match for a multi-word column due to the order of the words.

Find a column for “other_id”.

>>> NameMatcher(not_regex_search_strings = ['cas'],
...             in_strings = ['other'],
...             in_string_sets = [['database', 'identifier'], ['chemical', 'id'], ['cmpd', 'id'],
...                               ['database', 'id'], ['database', 'match'], ['local', 'id'],
...                               ['row', 'id'], ['comp', 'id'], ['chem', 'id'], ['chro', 'lib', 'id'],
...                               ['lib', 'id']],
...             not_in_strings = ['type', 'pubchem', 'chemspider', 'kegg'],
...             exact_strings = ['id'],)

This is another real example based on the datasets in the Metabolomics Workbench. It is shown to demonstrate the “exact_strings” attribute. There are many columns that contain the “id” string. There are specific database ID columns, such as those from PubChem or KEGG, but there are often lesser known or individual lab IDs. This example is trying to lump many of the lesser ones into a single “other_id” column. Trying to have “id” in an in_strings or regex_search_strings attribute would cause far too many false positive matches for reasons described in the first example, but there are columns simply labeled “ID”, so the only recourse is to use the exact_strings attribute to match them exactly.

Typical usage.

>>> df = pandas.read_csv('some_file.csv')
>>> name_matcher = NameMatcher(exact_strings = ['foo'])
>>> modified_columns = {{column_name: column_name.lower().strip() for column_name in df.columns}}
>>> matching_columns = name_matcher.dict_match(modified_columns)

NameMatcher is really meant to be used as part of a ColumnFinder, but this example uses it directly for simplicity. The instantiated NameMatcher is also very simple in this example because it is trying to show the usage of the dict_match method more than anything else. dict_match requires a dictionary as input, rather than a simple list so that column names can be modified if necessary for easier matching, but then still be linked back to the original name in the dataframe.

Attributes:

regex_search_strings (list[str]) – The current list of strings used for regex searching.
not_regex_search_strings (list[str]) – The current list of strings used for regex searching to exclude names.
regex_search_sets (list[list[str]]) – The current list of string sets used for regex searching.
in_strings (list[str]) – The current list of strings used for “in” operator matching.
not_in_strings (list[str]) – The current list of strings used for “in” operator matching to exclude names.
in_string_sets (list[list[str]]) – The current list of string sets used for “in” operator matching.
exact_strings (list[str]) – The current collection of strings used for “==” operator matching.

dict_match(name_map)[source]#

Return a list of names that match based on the NameMatcher attributes.

Find all names in name_map that match. name_map should be a dictionary of original names to modified names. The value is used for matching, but the key is what will be returned. Each of the name regex, in_string, and exact strings attributes are ORed together, meaning any of them can be used to match, except for the “not” parameters. If a column name is matched by a “not” parameter, then it overrides other matches and will be filtered out.

Parameters:: name_map (dict[str, str]) – a dictionary of original names to the modified version of that name to use for matching.
Returns:: A list of names that match based on the NameMatcher attributes.
Return type:: list[str]

class mwtab.metadata_column_matching.ValueMatcher(values_type=None, values_regex=None, values_inverse_regex=None)[source]#

Used to find a mask for certain values in a column.

Mostly intended to be used through the ColumnFinder class. Created for the purpose of matching tabular column data based on regular expressions and type criteria.

Parameters:

values_type (None | str) – A string whose only relevant values are ‘integer’, ‘numeric’, and ‘non-numeric’. ‘integer’ will only match values in a column that are integer numbers. ‘numeric’ will only match values that are numbers, this includes integers. ‘non-numeric’ will only match values that are non-numeric. Numeric values can be in the value, but cannot be the whole value. For example, ‘123 id’ is considered non-numeric.
values_regex (None | str) – A regular expression to positively identify values in a column.
inverse_values_regex – A regular expression to negatively identify values in a column. This is mutually exclusive with values_regex. If both are given, values_regex takes precedence and inverse_values_regex is ignored. values_type can be combined with either regex and values must match both criteria to match overall.

Examples

Simple type example.

>>> vm = ValueMatcher(values_type = 'numeric')
>>> test = pandas.Series([1, '1', 'foo'])
>>> vm.series_match(test)
0     True
1     True
2    False
dtype: bool

This ValueMatcher is very simple and will only match numeric values. Note that numeric values in string form are also recognized as numeric. This is intentional.

Simple regex example.

>>> vm = ValueMatcher(values_regex = 'foo.*')
>>> test = pandas.Series(['foo', 'bar', 'foobar', 1])
>>> vm.series_match(test)
0     True
1    False
2     True
3    False
dtype: bool

Simple inverse regex example. >>> vm = ValueMatcher(values_inverse_regex = ‘foo.*’) >>> test = pandas.Series([‘foo’, ‘bar’, ‘foobar’, 1]) >>> vm.series_match(test) 0 False 1 True 2 False 3 False dtype: bool

Note that 1 is False in both examples. In general this was designed with strings in mind, so it is recommended to convert all values to strings in any Series delivered to series_match. It is also HIGHLY recommended to use the ‘string[pyarrow]’ dtype when using the regex attributes. This dtype uses much faster regular expression algorithms and can make orders of magnitude speed differences over Python’s built-in regular expressions. There are some features of regular expressions that cannot be used with the ‘string[pyarrow]’ dtype though. For example, lookahead assertions. More information can be found at https://pypi.org/project/re2/.

Attributes:

values_type – The current type of the values being matched.
values_regex – The regular expression to positively identify values in a column.
inverse_values_regex – The regular expression to positively exclude values in a column.

series_match(series, na_values=None, match_na_values=True)[source]#

Return a mask for the series based on type and regex matching.

“values_regex” and “values_inverse_regex” are mutually exclusive and “values_regex” will take precedence if both are given. “values_type” and one of the regex parameters can both be used, the intermediate masks are ANDed together. “values_type” can only be “integer”, “numeric”, or “non-numeric” to match those types, respectively.

Parameters:

series (Series) – series to match values based on type and/or regex.
na_values (list | None) – list of values to consider NA values.
match_na_values (bool) – if True, NA values will be consider a match and return True, False otherwise.

Returns:

A pandas Series the same length as “series” with Boolean values that can be used to select the matching values in the series.

Return type:

Series

class mwtab.metadata_column_matching.ColumnFinder(standard_name, name_matcher, value_matcher)[source]#

Used to find columns in a DataFrame that match a NameMatcher and values in the column that match a ValueMatcher.

This is pretty much just a convenient way to keep the standard_name, NameMatcher, and ValueMatcher together in a single object. Convenience methods to utilize the NameMatcher and ValueMatcher are provided as name_dict_match and values_series_match, respectively.

Parameters:

standard_name (str) – A string to give a standard name to the column you are trying to find. Not used by any methods.
name_matcher (NameMatcher) – The NameMatcher object used to match column names.
value_matcher (ValueMatcher) – The ValueMatcher object used to match column values.

Examples

Basic usage.

>>> df = pandas.DataFrame({'foo':[1, 2, 'asdf'], 'bar':[1, 2, 3]})
>>> df
    foo  bar
0     1    1
1     2    2
2  asdf    3
>>> column_finder = ColumnFinder('FOO', NameMatcher(exact_strings = ['foo']), ValueMatcher(values_type = 'numeric'))
>>> modified_columns = {column_name: column_name.lower().strip() for column_name in df.columns}
>>> matching_columns = column_finder.name_dict_match(modified_columns)
>>> matched_column_name = matching_columns[0]
>>> matched_column_name
foo
>>> matching_values = column_finder.values_series_match(df.loc[:, matched_column_name])
>>> matching_values
0     True
1     True
2    False
dtype: bool

Attributes:

standard_name – The standard name of the column trying to be found.
name_matcher – The NameMatcher object used to match column names.
value_matcher – The ValueMatcher object used to match column values.

name_dict_match(name_map)[source]#: Convenience method to use the dict_match method for name_matcher.

values_series_match(series, na_values=None, match_na_values=True)[source]#: Convenience method to use the series_match method for value_matcher.

mwtab.metadata_column_matching.make_list_regex(element_regex, delimiter, quoted_elements=False, empty_string=False)[source]#

Creates a regular expression that will match a list of element_regex delimited by delimiter.

Note that delimiter can be a regular expression like (,|;) to match 2 different types of delimiters. If quoted_elements is True, then allow element_regex to be surrounded by single or double quotes. Note that this allows mixed elements, so quoted and unquoted elements are both allowed in the same list. If empty_string is True, then the list regex will match a single element_regex and the empty string. empty_string = True will actually match anything, but the length of the match for strings that are not appropriate will be 0. So this parameter could be useful in some edge case scenarios, but you must investigate the specific match more closely. If the match is the empty string, but the given string is not itself the empty string, then it is not really a match.

Parameters:

element_regex (str) – A regular expression in the form of a string that matches the elements of the list to match.
delimiter (str) – The character(s) that seperate list elements.
quoted_elements (bool) – If True, list elements can be surrounded by single or double quotes.
empty_string (bool) – If True, then allow the returned regular exression to match the empty string.

Returns:

A regular expression in str form that will match a list of element_regexes delimited by delimiter.

Return type:

str

Examples

Regular expression to match a list of 4 digit numbers.

>>> regex = make_list_regex(r'\d\d\d\d', r',')
'((\\d\\d\\d\\d\\s*,\\s*)+(\\d\\d\\d\\d\\s*|\\s*))'
>>> bool(re.match(regex, '1234'))
False
>>> bool(re.match(regex, '1234, 5678'))
True
>>> bool(re.match(regex, ''))
False

Allow the empty string.

>>> regex = make_list_regex(r'\d\d\d\d', r',', empty_string = True)
'((\\d\\d\\d\\d\\s*,\\s*)*(\\d\\d\\d\\d\\s*|\\s*))'
>>> bool(re.match(regex, '1234'))
True
>>> bool(re.match(regex, '1234, 5678'))
True
>>> bool(re.match(regex, ''))
True
>>> bool(re.match(regex, 'asdf'))
True
>>> re.match(regex, 'asdf')
<re.Match object; span=(0, 0), match=''>

Allow numbers to be surrounded with quotation marks.

>>> regex = make_list_regex(r'\d\d\d\d', r',', quoted_elements = True)
'(((\\d\\d\\d\\d\\s*,\\s*)+(\\d\\d\\d\\d\\s*|\\s*))|((\'\\d\\d\\d\\d\'\\s*,\\s*)+(\'\\d\\d\\d\\d\'\\s*|\\s*))|(("\\d\\d\\d\\d"\\s*,\\s*)+("\\d\\d\\d\\d"\\s*|\\s*)))'
>>> bool(re.match(regex, '1234'))
False
>>> bool(re.match(regex, '1234, 5678'))
True
>>> bool(re.match(regex, '1234, "5678"'))
True
>>> bool(re.match(regex, '\'1234\', "5678"'))
True
>>> bool(re.match(regex, ''))
False