The mwtab API Reference#
Routines for working with mwTab format files used by the
Metabolomics Workbench.
This package includes the following modules:
mwtabThis module provides the
MWTabFileclass which is a python dictionary representation of a Metabolomics Workbench mwtab file. Data can be accessed directly from theMWTabFileinstance using bracket accessors.cliThis module provides command-line interface for the
mwtabpackage.tokenizerThis module provides the
tokenizer()generator that generates tuples of key-value pairs from mwtab files.fileioThis module provides the
read_files()generator to open files from different sources (single file/multiple files on a local machine, directory/archive of files, URL address of a file).converterThis module provides the
Converterclass that is responsible for the conversion ofmwTabformated files into their JSON representation and vice versa.mwschemaThis module provides JSON schema definitions for the
mwTabformatted files, i.e. specifies required and optional keys as well as data types.validatorThis module provides routines to validate
mwTabformatted files based on schema definitions as well as checks for file self-consistency.mwrestThis module provides the
GenericMWURLclass which is a python dictionary representation of a Metabolomics Workbench REST URL. The class is used to validate query parameters and to generate a URL path which can be used to request data from Metabolomics Workbench through their REST API.metadata_column_matchingThis module provides the
ColumnFinderclass which is composed of aNameMatcherandValueMatcher. They are used to match column names and values, respecitively. Matching is done using regular expressions the “in” operator, equality operator, and type matching for values. This module also includes the “column_finders” dicitonary which is a dictionary of ColumnFinders created to match the most common columns found in the Metabolomics Workbench datasets. More information about this module can be found on the Metadata Column Matching page.
mwtab.mwtab#
This module provides the MWTabFile class
that stores the data from a single mwTab formatted file in the
form of an dict. Data can be accessed
directly from the MWTabFile instance using
bracket accessors.
The data is divided into a series of “sections” which each contain a
number of “key-value”-like pairs. Also, the file contains a specially
formatted SUBJECT_SAMPLE_FACTOR block and blocks of data between
*_START and *_END.
- class mwtab.mwtab.MWTabFile(source, duplicate_keys=False, force=False, *args, **kwds)[source]#
MWTabFile class that stores data from a single
mwTabformatted file in the form of a dictionary.- Parameters:
source – A string that should be the file path to the mwtab file that will be read in.
duplicate_keys – If True, use a special dictionary type that can handle duplicate keys. If you are uisng this class to build an mwtab file by hand, don’t set this to True. This was added because some files already upload to the Metabolomics Workbench erroneously have duplicate keys and the class needed to be able to read write them back out correctly.
force – If True, replace non-dictionary values in METABOLITES_DATA, METABOLITES, and EXTENDED tables with empty dicts on JSON read in.
- Attributes:
source – A string that should be the file path to the mwtab file that was read in.
study_id – A managed property. The study ID is stored in the METABOLOMICS WORKBENCH key of this class and the JSON version of an mwTab file. This property is provided as a convenience to access the study ID.
analysis_id – A managed property. The analysis ID is stored in the METABOLOMICS WORKBENCH key of this class and the JSON version of an mwTab file. This property is provided as a convenience to access the analysis ID.
header – A managed property. It is provided as a convenience to be able to view the header line of the file. You can also set the METABOLOMICS WORKBENCH key by setting this property. It will parse the string you assign into the dictionary that belongs in the METABOLOMICS WORKBENCH key. Assuming the provided string is a correctly generated header line.
data_section_key – A simple property that will give you the key to the data section. Either one of “MS_METABOLITE_DATA”, “NMR_METABOLITE_DATA”, or “NMR_BINNED_DATA”, or None if none of those were found.
- Special Notes:
In general this class has the same structure as mwTab JSON, but there are a few exceptions to that. One is that the _RESULTS_FILE subsection, whether in the MS or NM section, is a dictionary in this class with keys for the elements found. In the mwTab JSON this is just a string. The possible keys the dictionary could have are: “filename”, “UNITS”, “Has m/z”, “Has RT”, and “RT units”. If the _RESULTS_FILE line did not have these keys, then they won’t be in the dictionary.
The Metabolomics Workbench has deprecated mwTab files with NMR_BINNED_DATA sections, but for the few that do exist if you read them in using this class, the dictionaries in the [‘NMR_BINNED_DATA’][‘Data’] list of dicts will have keys for both “Metabolite” and “Bin range(ppm)”. This is because the JSON version from the Metabolomics Workbench uses “Bin range(ppm)” and not “Metabolite”, but we wanted to present a seemless unified interface for this class regardless of the analysis type. They print out with only the “Bin range(ppm)” keys to match what the Metabolomics Workbench provides, but internally both keys will be there. If they somehow become different, you will see a message about it when you try to write the file out.
- validate(ms_schema=mwschema.ms_required_schema, nmr_schema=mwschema.nmr_required_schema, verbose=True)[source]#
Validate the instance.
- Parameters:
- Returns:
Error messages as a single string and error messages in JSON form. If verbose is True, then the single string will be None.
- Return type:
- property data_section_key#
Easily determine the data_section_key.
The key will be one of “MS_METABOLITE_DATA”, “NMR_METABOLITE_DATA”, or “NMR_BINNED_DATA”, but will be None if none of those keys are found.
- set_table_from_pandas(df, table_name, clear_header=False)[source]#
Return the given table_name as a pandas.DataFrame.
table_name must be one of “Metabolites”, “Extended”, or “Data”.
- set_metabolites_from_pandas(df, clear_header=False)[source]#
Update MWTabFile based on provided pandas.DataFrame.
Overwrite the current list of dicts in self[data_section_key][‘Metabolites’] with the values in df. Also overwrites self._metabolite_header with the columns in df, excluding the first column. df is assumed to have the first column as the ‘Metabolites’ column.
- set_extended_from_pandas(df, clear_header=False)[source]#
Update MWTabFile based on provided pandas.DataFrame.
Overwrite the current list of dicts in self[data_section_key][‘Extended’] with the values in df. Also overwrites self._extended_metabolite_header with the columns in df, excluding the first column. df is assumed to have the first column as the ‘Metabolites’ column.
- set_metabolites_data_from_pandas(df, clear_header=False)[source]#
Update MWTabFile based on provided pandas.DataFrame.
Overwrite the current list of dicts in self[data_section_key][‘Data’] with the values in df. Also overwrites self._samples with the columns in df, excluding the first column. df is assumed to have the first column as the ‘Metabolites’ column.
- get_table_as_pandas(table_name)[source]#
Return the given table_name as a pandas.DataFrame.
table_name must be one of “Metabolites”, “Extended”, or “Data”. Note that if there are duplicate column names, they will have a string appended to the end of the name like {{{_d+_}}}.
- Parameters:
table_name (str) – the name of the table to return as a pandas.DataFrame.
- Returns:
The list of dicts for the given table_name as a pandas.DataFrame.
- Return type:
pandas.DataFrame
- get_metabolites_as_pandas()[source]#
Return the Metabolites table as a pandas.DataFrame.
Note that if there are duplicate column names, they will have a string appended to the end of the name like {{{_d+_}}}.
- Returns:
The list of dicts for the Metabolites table as a pandas.DataFrame.
- Return type:
pandas.DataFrame
- get_extended_as_pandas()[source]#
Return the Extended table as a pandas.DataFrame.
Note that if there are duplicate column names, they will have a string appended to the end of the name like {{{_d+_}}}.
- Returns:
The list of dicts for the Extended table as a pandas.DataFrame.
- Return type:
pandas.DataFrame
- get_metabolites_data_as_pandas()[source]#
Return the Data table as a pandas.DataFrame.
Note that if there are duplicate column names, they will have a string appended to the end of the name like {{{_d+_}}}.
- Returns:
The list of dicts for the Data table as a pandas.DataFrame.
- Return type:
pandas.DataFrame
- read_from_str(input_str)[source]#
Read input_str into a
MWTabFileinstance.- Returns:
None
- Return type:
- read(filehandle)[source]#
Read data into a
MWTabFileinstance.- Parameters:
filehandle (
io.TextIOWrapper,gzip.GzipFile,bz2.BZ2File,zipfile.ZipFile) – file-like object.- Returns:
None
- Return type:
- write(filehandle, file_format)[source]#
Write
MWTabFiledata into file.- Parameters:
filehandle (
io.TextIOWrapper) – file-like object.file_format (str) – Format to use to write data: mwtab or json.
- Returns:
None
- Return type:
- print_file(f=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, file_format='mwtab')[source]#
Print
MWTabFileinto a file or stdout.- Parameters:
f (
io.StringIO) – writable file-like stream.file_format (str) – Format to use: mwtab or json.
- Returns:
None
- Return type:
- print_subject_sample_factors(section_key, f=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, file_format='mwtab')[source]#
Print mwtab SUBJECT_SAMPLE_FACTORS section into a file or stdout.
- Parameters:
section_key (str) – Section name.
f (
io.StringIO) – writable file-like stream.file_format (str) – Format to use: mwtab or json.
- Returns:
None
- Return type:
- print_block(section_key, f=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, file_format='mwtab')[source]#
Print mwtab section into a file or stdout.
- Parameters:
section_key (str) – Section name.
f (
io.StringIO) – writable file-like stream.file_format (str) – Format to use: mwtab or json.
- Returns:
None
- Return type:
The mwtab Command Line Interface#
Usage:
mwtab -h | --help
mwtab --version
mwtab convert (<from-path> <to-path>) [--from-format=<format>] [--to-format=<format>] [--mw-rest=<url>] [--force] [--verbose]
mwtab validate <from-path> [--to-path=<path>] [--mw-rest=<url>] [--force] [--silent]
mwtab download url <url> [--to-path=<path>] [--verbose]
mwtab download study all [--to-path=<path>] [--input-item=<item>] [--output-format=<format>] [--mw-rest=<url>] [--verbose]
mwtab download study <input-value> [--to-path=<path>] [--input-item=<item>] [--output-item=<item>] [--output-format=<format>] [--mw-rest=<url>] [--verbose]
mwtab download (study | compound | refmet | gene | protein) <input-item> <input-value> <output-item> [--output-format=<format>] [--to-path=<path>] [--mw-rest=<url>] [--verbose]
mwtab download moverz <input-item> <m/z-value> <ion-type-value> <m/z-tolerance-value> [--to-path=<path>] [--mw-rest=<url>] [--verbose]
mwtab download exactmass <LIPID-abbreviation> <ion-type-value> [--to-path=<path>] [--mw-rest=<url>] [--verbose]
mwtab extract metadata <from-path> <to-path> <key> ... [--to-format=<format>] [--no-header] [--force]
mwtab extract metabolites <from-path> <to-path> (<key> <value>) ... [--to-format=<format>] [--no-header] [--force]
Options:
-h, --help Show this screen.
--version Show version.
--verbose Print what files are processing.
--silent Silence all standard output.
--from-format=<format> Input file format, available formats: mwtab, json [default: mwtab].
--to-format=<format> Output file format [default: json].
Available formats for convert:
mwtab, json.
Available formats for extract:
json, csv.
--mw-rest=<url> URL to MW REST interface
[default: https://www.metabolomicsworkbench.org/rest/].
--to-path=<path> Directory to save outputs into. Defaults to the current working directory.
For the validate command, if the given path ends in '.json', then
all JSON file outputs will be condensed into that 1 file. Also for
the validate command no output files are saved unless this option is given.
--prefix=<prefix> Prefix to add at the beginning of the output file name. Defaults to no prefix.
--suffix=<suffix> Suffix to add at the end of the output file name. Defaults to no suffix.
--context=<context> Type of resource to access from MW REST interface, available contexts: study,
compound, refmet, gene, protein, moverz, exactmass [default: study].
--input-item=<item> Item to search Metabolomics Workbench with.
--output-item=<item> Item to be retrieved from Metabolomics Workbench.
--output-format=<format> Format for item to be retrieved in, available formats: mwtab, json.
--no-header Include header at the top of csv formatted files.
--force Ignore non-dictionary values in METABOLITES_DATA, METABOLITES, and EXTENDED tables for JSON files.
For extraction <to-path> can take a "-" which will use stdout.
All <from-path>'s can be single files, directories, or URLs.
Documentation webpage: https://moseleybioinformaticslab.github.io/mwtab/
GitHub webpage: https://github.com/MoseleyBioinformaticsLab/mwtab
- mwtab.cli.cli(cmdargs)[source]#
Implements the command line interface.
param dict cmdargs: dictionary of command line arguments.
mwtab.tokenizer#
This module provides the tokenizer() lexical analyzer for
mwTab format syntax. It is implemented as Python generator-based state
machine which generates (yields) tokens one at a time when next()
is invoked on tokenizer() instance.
Each token is a tuple of “key-value”-like pairs, tuple of
SUBJECT_SAMPLE_FACTORS or tuple of data deposited between
*_START and *_END blocks.
- mwtab.tokenizer.tokenizer(text, dict_type=None)[source]#
A lexical analyzer for the mwtab formatted files.
- Parameters:
text (
str) – mwTab formatted text.dict_type – the type of dictionary to use, default is dict.
- Returns:
Tuples of data.
- Return type:
namedtuple
mwtab.fileio#
This module provides routines for reading mwTab formatted files
from difference kinds of sources:
Single
mwTabformatted file on a local machine.Directory containing multiple
mwTabformatted files.Compressed zip/tar archive of
mwTabformatted files.URL address of
mwTabformatted file.
ANALYSIS_IDofmwTabformatted file.
- mwtab.fileio.read_files(sources, *, read_class=<class 'mwtab.mwtab.MWTabFile'>, class_kwds={'duplicate_keys': True}, return_exceptions=False)#
Read from sources using the given read_class.
This is really created to use functools partial to create a read mwthod for a particular class.
- Parameters:
sources (str | list[str]) – A string or list of strings to read from.
read_class (type) – A class with a read() method to instantiate to read from source.
class_kwds (dict) – A dictionary of keyword arguments to pass to the class constructor.
return_exceptions (bool) – Whether to yield a tuple with file instance and exception or just the file instance.
- Returns:
Returns the instantiated class and any exceptions, or None and any exceptions, or the source and any exceptions.
- Return type:
mwtab.converter#
This module provides functionality for converting between the
Metabolomics Workbench mwTab formatted file and its equivalent
JSONized representation.
The following conversions are possible:
- Local files:
- One-to-one file conversions:
textfile - to - textfile
textfile - to - textfile.gz
textfile - to - textfile.bz2
textfile.gz - to - textfile
textfile.gz - to - textfile.gz
textfile.gz - to - textfile.bz2
textfile.bz2 - to - textfile
textfile.bz2 - to - textfile.gz
textfile.bz2 - to - textfile.bz2
textfile / textfile.gz / textfile.bz2 - to - textfile.zip / textfile.tar / textfile.tar.gz / textfile.tar.bz2 (TypeError: One-to-many conversion)
- Many-to-many files conversions:
- Directories:
directory - to - directory
directory - to - directory.zip
directory - to - directory.tar
directory - to - directory.tar.bz2
directory - to - directory.tar.gz
directory - to - directory.gz / directory.bz2 (TypeError: Many-to-one conversion)
- Zipfiles:
zipfile.zip - to - directory
zipfile.zip - to - zipfile.zip
zipfile.zip - to - tarfile.tar
zipfile.zip - to - tarfile.tar.gz
zipfile.zip - to - tarfile.tar.bz2
zipfile.zip - to - directory.gz / directory.bz2 (TypeError: Many-to-one conversion)
- Tarfiles:
tarfile.tar - to - directory
tarfile.tar - to - zipfile.zip
tarfile.tar - to - tarfile.tar
tarfile.tar - to - tarfile.tar.gz
tarfile.tar - to - tarfile.tar.bz2
tarfile.tar - to - directory.gz / directory.bz2 (TypeError: Many-to-one conversion)
tarfile.tar.gz - to - directory
tarfile.tar.gz - to - zipfile.zip
tarfile.tar.gz - to - tarfile.tar
tarfile.tar.gz - to - tarfile.tar.gz
tarfile.tar.gz - to - tarfile.tar.bz2
tarfile.tar.gz - to - directory.gz / directory.bz2 (TypeError: Many-to-one conversion)
tarfile.tar.bz2 - to - directory
tarfile.tar.bz2 - to - zipfile.zip
tarfile.tar.bz2 - to - tarfile.tar
tarfile.tar.bz2 - to - tarfile.tar.gz
tarfile.tar.bz2 - to - tarfile.tar.bz2
tarfile.tar.bz2 - to - directory.gz / directory.bz2 (TypeError: Many-to-one conversion)
- URL files:
- One-to-one file conversions:
analysis_id - to - textfile
analysis_id - to - textfile.gz
analysis_id - to - textfile.bz2
analysis_id - to - textfile.zip / textfile.tar / textfile.tar.gz / textfile.tar.bz2 (TypeError: One-to-many conversion)
textfileurl - to - textfile
textfileurl - to - textfile.gz
textfileurl - to - textfile.bz2
textfileurl.gz - to - textfile
textfileurl.gz - to - textfile.gz
textfileurl.gz - to - textfile.bz2
textfileurl.bz2 - to - textfile
textfileurl.bz2 - to - textfile.gz
textfileurl.bz2 - to - textfile.bz2
textfileurl / textfileurl.gz / textfileurl.bz2 - to - textfile.zip / textfile.tar / textfile.tar.gz / textfile.tar.bz2 (TypeError: One-to-many conversion)
- Many-to-many files conversions:
- Zipfiles:
zipfileurl.zip - to - directory
zipfileurl.zip - to - zipfile.zip
zipfileurl.zip - to - tarfile.tar
zipfileurl.zip - to - tarfile.tar.gz
zipfileurl.zip - to - tarfile.tar.bz2
zipfileurl.zip - to - directory.gz / directory.bz2 (TypeError: Many-to-one conversion)
- Tarfiles:
tarfileurl.tar - to - directory
tarfileurl.tar - to - zipfile.zip
tarfileurl.tar - to - tarfile.tar
tarfileurl.tar - to - tarfile.tar.gz
tarfileurl.tar - to - tarfile.tar.bz2
tarfileurl.tar - to - directory.gz / directory.bz2 (TypeError: Many-to-one conversion)
tarfileurl.tar.gz - to - directory
tarfileurl.tar.gz - to - zipfile.zip
tarfileurl.tar.gz - to - tarfile.tar
tarfileurl.tar.gz - to - tarfile.tar.gz
tarfileurl.tar.gz - to - tarfile.tar.bz2
tarfileurl.tar.gz - to - directory.gz / directory.bz2 (TypeError: Many-to-one conversion)
tarfileurl.tar.bz2 - to - directory
tarfileurl.tar.bz2 - to - zipfile.zip
tarfileurl.tar.bz2 - to - tarfile.tar
tarfileurl.tar.bz2 - to - tarfile.tar.gz
tarfileurl.tar.bz2 - to - tarfile.tar.bz2
tarfileurl.tar.bz2 - to - directory.gz / directory.bz2 (TypeError: Many-to-one conversion)
- class mwtab.converter.Translator(from_path, to_path, from_format=None, to_format=None, force=False)[source]#
Translator abstract class.
- class mwtab.converter.MWTabFileToMWTabFile(from_path, to_path, from_format=None, to_format=None, force=False)[source]#
Translator concrete class that can convert between
mwTabandJSONformats.
- class mwtab.converter.Converter(from_path, to_path, from_format='mwtab', to_format='json', force=False)[source]#
Converter class to convert
mwTabfiles frommwTabtoJSONor fromJSONtomwTabformat.
mwtab.validator#
This module contains routines to validate consistency of the mwTab
formatted files, e.g. make sure that Samples and Factors
identifiers are consistent across the file, and make sure that all
required key-value pairs are present.
- mwtab.validator.validate_file(mwtabfile, ms_schema, nmr_schema, verbose=False)[source]#
Validate
mwTabformatted file.Note that some of the validations are pretty strict to account for the majority of cases, but if warranted could be ignored. For example, COLUMN_PRESSURE in the CHROMATOGRAPHY section will print a warning if the value is not a single number or range of numbers followed by a unit, but there might be some situations where the method is complex and thus the column pressure is not static. So something like “60 bar at starting conditions. 180 bar at %A” would be required to accurately describe the COLUMN_PRESSURE, and would be valid. So in these kinds of situations the warning printed can safely be ignored.
- Parameters:
mwtabfile (MWTabFile) – The file to be validated.
ms_schema (dict) – jsonschema to validate both the base parts of the file and the MS specific parts of the file.
nmr_schema (dict) – jsonschema to validate both the base parts of the file and the NMR specific parts of the file.
verbose (bool) – whether to be verbose or not.
- Returns:
Error messages as a single string and error messages in JSON form. If verbose is True, then the single string will be None.
- Return type:
mwtab.mwrest#
This module provides routines for accessing the Metabolomics Workbench REST API.
See https://www.metabolomicsworkbench.org/tools/MWRestAPIv1.0.pdf for details.
- mwtab.mwrest.analysis_ids(base_url='https://www.metabolomicsworkbench.org/rest/')[source]#
Method for retrieving a list of analysis ids for every current analysis in Metabolomics Workbench.
- mwtab.mwrest.study_ids(base_url='https://www.metabolomicsworkbench.org/rest/')[source]#
Method for retrieving a list of study ids for every current study in Metabolomics Workbench.
- mwtab.mwrest.generate_mwtab_urls(input_items, base_url='https://www.metabolomicsworkbench.org/rest/', output_format='txt', return_exceptions=False)[source]#
Method for generating URLS to be used to retrieve mwtab files for analyses and studies through the REST API of the Metabolomics Workbench database.
- Parameters:
input_items (list) – List of Metabolomics Workbench input values for mwTab files.
base_url (str) – Base url to Metabolomics Workbench REST API.
output_format (str) – Output format for the mwTab files to be retrieved in.
return_exceptions (bool) – Whether to yield a tuple with url and exception or just the url.
- Returns:
Metabolomics Workbench REST URL string(s).
- Return type:
- class mwtab.mwrest.GenericMWURL(rest_params, base_url='https://www.metabolomicsworkbench.org/rest/')[source]#
GenericMWURL class that stores and validates parameters specifying a Metabolomics Workbench REST URL.
- Metabolomics REST API requests are performed using URL requests in the form of
https://www.metabolomicsworkbench.org/rest/context/input_specification/output_specification
where: if context = "study" | "compound" | "refmet" | "gene" | "protein" input_specification = input_item/input_value output_specification = output_item/[output_format] elif context = "moverz" input_specification = input_item/input_value1/input_value2/input_value3 input_item = "LIPIDS" | "MB" | "REFMET" input_value1 = m/z_value input_value2 = ion_type_value input_value3 = m/z_tolerance_value output_specification = output_format output_format = "txt" elif context = "exactmass" input_specification = input_item/input_value1/input_value2 input_item = "LIPIDS" | "MB" | "REFMET" input_value1 = LIPID_abbreviation input_value2 = ion_type_value output_specification = None
- class mwtab.mwrest.MWRESTFile(source)[source]#
MWRESTFile class that stores data from a single file download through Metabolomics Workbench’s REST API.
Mirrors
MWTabFile.- read(filehandle)[source]#
Read data into a
MWRESTFileinstance.- Parameters:
filehandle (
io.TextIOWrapper,gzip.GzipFile,bz2.BZ2File,zipfile.ZipFile) – file-like object.- Returns:
None
- Return type:
- write(filehandle)[source]#
Write
MWRESTFiledata into file.- Parameters:
filehandle (
io.TextIOWrapper) – file-like object.- Returns:
None
- Return type:
mwtab.mwextract#
This module provides a number of functions and classes for extracting and saving data and metadata
stored in mwTab formatted files in the form of MWTabFile.
- class mwtab.mwextract.ItemMatcher(full_key, value_comparison)[source]#
ItemMatcher class that can be called to match items from
mwTabformatted files in the form ofMWTabFile.
- class mwtab.mwextract.ReGeXMatcher(full_key, value_comparison)[source]#
ReGeXMatcher class that can be called to match items from
mwTabformatted files in the form ofMWTabFileusing regular expressions.
- mwtab.mwextract.generate_matchers(items)[source]#
Construct a generator that yields Matchers
ItemMatcherorReGeXMatcher.- Parameters:
items (iterable) – Iterable object containing key value pairs to match.
- Returns:
Yields a Matcher object for each given item.
- Return type:
ItemMatcherorReGeXMatcher
- mwtab.mwextract.extract_metabolites(sources, matcher_generator)[source]#
Extract metabolite data from
mwTabformatted files in the form ofMWTabFile.- Parameters:
sources (generator) – Generator of mwtab file objects (
MWTabFile).matcher_generator (generator) – Generator of matcher objects (
ItemMatcherorReGeXMatcher).
- Returns:
Extracted metabolites dictionary.
- Return type:
- mwtab.mwextract.extract_metadata(mwtabfile, keys)[source]#
Extract metadata data from
mwTabformatted files in the form ofMWTabFile.
- mwtab.mwextract.write_metadata_csv(to_path, extracted_values, no_header=False)[source]#
Write extracted metadata
dictinto csv file.Example: “metadata”,”value1”,”value2” “SUBJECT_TYPE”,”Human”,”Plant”
- mwtab.mwextract.write_metabolites_csv(to_path, extracted_values, no_header=False)[source]#
Write extracted metabolites data
dictinto csv file.Example: “metabolite_name”,”num-studies”,”num_analyses”,”num_samples” “1,2,4-benzenetriol”,”1”,”1”,”24” “1-monostearin”,”1”,”1”,”24” …
- class mwtab.mwextract.SetEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]#
SetEncoder class for encoding Python sets
setinto json serializable objectslist.
- mwtab.mwextract.write_json(to_path, extracted_dict)[source]#
Write extracted data or metadata
dictinto json file.Metabolites example: { "1,2,4-benzenetriol": { "ST000001": { "AN000001": [ "LabF_115816", ... ] } } } Metadata example: { "SUBJECT_TYPE": [ "Plant", "Human" ] }
Metadata Column Matching#
Regular expressions, functions, and classes to match column names and values in mwtab METABOLITES blocks.
More information can be found on the Metadata Column Matching page.
- mwtab.metadata_column_matching.WRAP_STRING = '[^a-zA-Z0-9]'#
Used for wrapping regexes in certain functions.
- class mwtab.metadata_column_matching.NameMatcher(regex_search_strings=None, not_regex_search_strings=None, regex_search_sets=None, in_strings=None, not_in_strings=None, in_string_sets=None, exact_strings=None)[source]#
Used to filter names that match certain criteria.
Mostly intended to be used through the ColumnFinder class. Created for the purpose of matching tabular column names based on regular expressions and “in” criteria.
- Parameters:
regex_search_strings (None | list[str]) – A collection of strings to deliver to re.search() to match a column name. If any string in the collection matches, then the name is matched. This does not simply look for any of the strings within a column name to match. Each string is wrapped with
WRAP_STRINGbefore searching, so the string ‘bar’ would not be found in the column name “foobarbaz”, but would be found in the name “foo bar baz”.not_regex_search_strings (None | list[str]) – The same as regex_search_strings, except a match to a column name eliminates that name. Attributes that begin with “not” take precedence over the others. So if a column name matches a string in regex_search_strings and not_regex_search_strings, then it will be filtered OUT.
regex_search_sets (None | list[list[str]]) – A collection of sets of strings. Each string in the set of strings must be found in the column name to match, but any set could be found. For example, [(‘foo’, ‘bar’), (‘baz’, ‘asd’)] will match the name “foo bar” or “bar foo”, but not “foobar”, due to the aforementioned wrapping with
WRAP_STRING. The names “foo”, “baz”, or “asd” would not match either, but “var asd baz” would.in_strings (None | list[str]) – Similar to regex_search_strings except instead of using re.search() the “in” operator is used. For example, [‘foo’] would match the column name “a fool”, since ‘foo’ is in “a fool”.
not_in_strings (None | list[str]) – The same as in_strings, but matches to a column name eliminate or filter OUT that name.
in_string_sets (None | list[list[str]]) – The same as regex_search_sets, but each string in a set is determined to match using the “in” operator instead of re.search(). For example, [(‘foo’, ‘bar’), (‘baz’, ‘asd’)] WILL match the column name “foobar” because both ‘foo’ and ‘bar’ are in the name.
exact_strings (None | list[str]) – A collection of strings that must exactly match the column name. For example, [‘foo’, ‘bar’] would only the match the column names “foo” or “bar”.
Examples
Find a column for “moverz”.
>>> NameMatcher(regex_search_strings = ['m/z', 'mz', 'moverz', 'mx'], ... not_regex_search_strings = ['id'], ... in_strings = ['m.z', 'calcmz', 'medmz', 'm_z', 'obsmz', 'mass to charge', 'mass over z'], ... not_in_strings = ['spec', 'pectrum', 'structure', 'regno', 'retention'])
This is a real example based on the datasets in the Metabolomics Workbench. We can examine some of the strings to illustrate the attributes’ function. The “id” string needs to be in not_regex_search_strings, rather than not_in_strings, because “id” is a very small substring that could easily be in a longer word. Putting in not_regex_search_strings means it will most likely match “ID” fields, such as “PubChem ID”. Note that it is recommend to lower all column names before filtering and thus use lower case strings, but in general NameMatcher is case sensitive. The “spec” string is in not_in_strings, rather than not_regex_search_strings because the risk of it being in a name that should not be filtered out is low. Also it catches both the full word “spectrum” and its common abbreviation “spec”. Hopefully, these 2 explanations of “id” and “spec” have illustrated some of the tradeoffs and advantages of the “in” style attributes versus the “search” style ones.
Find a column for “retention time”.
>>> NameMatcher(regex_search_strings = ['rt'], ... regex_search_sets = [['ret', 'time']], ... in_strings = ['rtimes', 'r.t.', 'medrt', 'rtsec', 'bestrt', 'compoundrt', 'rtmed'], ... in_string_sets = [['retention', 'time'], ['rentetion', 'time'], ['retension', 'time']], ... not_in_strings = ['type', 'error', 'index', 'delta', 'feature', 'm/z'])
This is another real example based on the datasets in the Metabolomics Workbench. It illustrates the “set” style attributes quite well. For multi-word column names the “set” style attributes are usually what you want to use. It is possible to to give a string like “retention time”, note the space character, to an attribute like “in_strings”, but this is more fragile than it seems and won’t match some common alternate spellings or mistakes, such as “retention_time” or “retention time”. Using “set” style attributes means you don’t have to add as many strings to an attribute like “in_strings”. You can still see some repetition in the in_string_sets attribute here though to cover the many mispellings of “retention”. “set” style attributes would not be a good use case if the strings in the set must be in a certain order though. The set [‘ret’, ‘time’] will match ‘ret’ and ‘time’ in any order. Generally, this will not be a problem because there aren’t many instances where you will get a false positive match for a multi-word column due to the order of the words.
Find a column for “other_id”.
>>> NameMatcher(not_regex_search_strings = ['cas'], ... in_strings = ['other'], ... in_string_sets = [['database', 'identifier'], ['chemical', 'id'], ['cmpd', 'id'], ... ['database', 'id'], ['database', 'match'], ['local', 'id'], ... ['row', 'id'], ['comp', 'id'], ['chem', 'id'], ['chro', 'lib', 'id'], ... ['lib', 'id']], ... not_in_strings = ['type', 'pubchem', 'chemspider', 'kegg'], ... exact_strings = ['id'],)
This is another real example based on the datasets in the Metabolomics Workbench. It is shown to demonstrate the “exact_strings” attribute. There are many columns that contain the “id” string. There are specific database ID columns, such as those from PubChem or KEGG, but there are often lesser known or individual lab IDs. This example is trying to lump many of the lesser ones into a single “other_id” column. Trying to have “id” in an in_strings or regex_search_strings attribute would cause far too many false positive matches for reasons described in the first example, but there are columns simply labeled “ID”, so the only recourse is to use the exact_strings attribute to match them exactly.
Typical usage.
>>> df = pandas.read_csv('some_file.csv') >>> name_matcher = NameMatcher(exact_strings = ['foo']) >>> modified_columns = {{column_name: column_name.lower().strip() for column_name in df.columns}} >>> matching_columns = name_matcher.dict_match(modified_columns)
NameMatcher is really meant to be used as part of a ColumnFinder, but this example uses it directly for simplicity. The instantiated NameMatcher is also very simple in this example because it is trying to show the usage of the dict_match method more than anything else. dict_match requires a dictionary as input, rather than a simple list so that column names can be modified if necessary for easier matching, but then still be linked back to the original name in the dataframe.
- Attributes:
regex_search_strings (list[str]) – The current list of strings used for regex searching.
not_regex_search_strings (list[str]) – The current list of strings used for regex searching to exclude names.
regex_search_sets (list[list[str]]) – The current list of string sets used for regex searching.
in_strings (list[str]) – The current list of strings used for “in” operator matching.
not_in_strings (list[str]) – The current list of strings used for “in” operator matching to exclude names.
in_string_sets (list[list[str]]) – The current list of string sets used for “in” operator matching.
exact_strings (list[str]) – The current collection of strings used for “==” operator matching.
- dict_match(name_map)[source]#
Return a list of names that match based on the NameMatcher attributes.
Find all names in name_map that match. name_map should be a dictionary of original names to modified names. The value is used for matching, but the key is what will be returned. Each of the name regex, in_string, and exact strings attributes are ORed together, meaning any of them can be used to match, except for the “not” parameters. If a column name is matched by a “not” parameter, then it overrides other matches and will be filtered out.
- class mwtab.metadata_column_matching.ValueMatcher(values_type=None, values_regex=None, values_inverse_regex=None)[source]#
Used to find a mask for certain values in a column.
Mostly intended to be used through the ColumnFinder class. Created for the purpose of matching tabular column data based on regular expressions and type criteria.
- Parameters:
values_type (None | str) – A string whose only relevant values are ‘integer’, ‘numeric’, and ‘non-numeric’. ‘integer’ will only match values in a column that are integer numbers. ‘numeric’ will only match values that are numbers, this includes integers. ‘non-numeric’ will only match values that are non-numeric. Numeric values can be in the value, but cannot be the whole value. For example, ‘123 id’ is considered non-numeric.
values_regex (None | str) – A regular expression to positively identify values in a column.
inverse_values_regex – A regular expression to negatively identify values in a column. This is mutually exclusive with values_regex. If both are given, values_regex takes precedence and inverse_values_regex is ignored. values_type can be combined with either regex and values must match both criteria to match overall.
Examples
Simple type example.
>>> vm = ValueMatcher(values_type = 'numeric') >>> test = pandas.Series([1, '1', 'foo']) >>> vm.series_match(test) 0 True 1 True 2 False dtype: bool
This ValueMatcher is very simple and will only match numeric values. Note that numeric values in string form are also recognized as numeric. This is intentional.
Simple regex example.
>>> vm = ValueMatcher(values_regex = 'foo.*') >>> test = pandas.Series(['foo', 'bar', 'foobar', 1]) >>> vm.series_match(test) 0 True 1 False 2 True 3 False dtype: bool
Simple inverse regex example. >>> vm = ValueMatcher(values_inverse_regex = ‘foo.*’) >>> test = pandas.Series([‘foo’, ‘bar’, ‘foobar’, 1]) >>> vm.series_match(test) 0 False 1 True 2 False 3 False dtype: bool
Note that 1 is False in both examples. In general this was designed with strings in mind, so it is recommended to convert all values to strings in any Series delivered to series_match. It is also HIGHLY recommended to use the ‘string[pyarrow]’ dtype when using the regex attributes. This dtype uses much faster regular expression algorithms and can make orders of magnitude speed differences over Python’s built-in regular expressions. There are some features of regular expressions that cannot be used with the ‘string[pyarrow]’ dtype though. For example, lookahead assertions. More information can be found at https://pypi.org/project/re2/.
- Attributes:
values_type – The current type of the values being matched.
values_regex – The regular expression to positively identify values in a column.
inverse_values_regex – The regular expression to positively exclude values in a column.
- series_match(series, na_values=None, match_na_values=True)[source]#
Return a mask for the series based on type and regex matching.
“values_regex” and “values_inverse_regex” are mutually exclusive and “values_regex” will take precedence if both are given. “values_type” and one of the regex parameters can both be used, the intermediate masks are ANDed together. “values_type” can only be “integer”, “numeric”, or “non-numeric” to match those types, respectively.
- Parameters:
- Returns:
A pandas Series the same length as “series” with Boolean values that can be used to select the matching values in the series.
- Return type:
Series
- class mwtab.metadata_column_matching.ColumnFinder(standard_name, name_matcher, value_matcher)[source]#
Used to find columns in a DataFrame that match a NameMatcher and values in the column that match a ValueMatcher.
This is pretty much just a convenient way to keep the standard_name, NameMatcher, and ValueMatcher together in a single object. Convenience methods to utilize the NameMatcher and ValueMatcher are provided as name_dict_match and values_series_match, respectively.
- Parameters:
standard_name (str) – A string to give a standard name to the column you are trying to find. Not used by any methods.
name_matcher (NameMatcher) – The NameMatcher object used to match column names.
value_matcher (ValueMatcher) – The ValueMatcher object used to match column values.
Examples
Basic usage.
>>> df = pandas.DataFrame({'foo':[1, 2, 'asdf'], 'bar':[1, 2, 3]}) >>> df foo bar 0 1 1 1 2 2 2 asdf 3 >>> column_finder = ColumnFinder('FOO', NameMatcher(exact_strings = ['foo']), ValueMatcher(values_type = 'numeric')) >>> modified_columns = {column_name: column_name.lower().strip() for column_name in df.columns} >>> matching_columns = column_finder.name_dict_match(modified_columns) >>> matched_column_name = matching_columns[0] >>> matched_column_name foo >>> matching_values = column_finder.values_series_match(df.loc[:, matched_column_name]) >>> matching_values 0 True 1 True 2 False dtype: bool
- Attributes:
standard_name – The standard name of the column trying to be found.
name_matcher – The NameMatcher object used to match column names.
value_matcher – The ValueMatcher object used to match column values.
- mwtab.metadata_column_matching.make_list_regex(element_regex, delimiter, quoted_elements=False, empty_string=False)[source]#
Creates a regular expression that will match a list of element_regex delimited by delimiter.
Note that delimiter can be a regular expression like (,|;) to match 2 different types of delimiters. If quoted_elements is True, then allow element_regex to be surrounded by single or double quotes. Note that this allows mixed elements, so quoted and unquoted elements are both allowed in the same list. If empty_string is True, then the list regex will match a single element_regex and the empty string. empty_string = True will actually match anything, but the length of the match for strings that are not appropriate will be 0. So this parameter could be useful in some edge case scenarios, but you must investigate the specific match more closely. If the match is the empty string, but the given string is not itself the empty string, then it is not really a match.
- Parameters:
element_regex (str) – A regular expression in the form of a string that matches the elements of the list to match.
delimiter (str) – The character(s) that seperate list elements.
quoted_elements (bool) – If True, list elements can be surrounded by single or double quotes.
empty_string (bool) – If True, then allow the returned regular exression to match the empty string.
- Returns:
A regular expression in str form that will match a list of element_regexes delimited by delimiter.
- Return type:
Examples
Regular expression to match a list of 4 digit numbers.
>>> regex = make_list_regex(r'\d\d\d\d', r',') '((\\d\\d\\d\\d\\s*,\\s*)+(\\d\\d\\d\\d\\s*|\\s*))' >>> bool(re.match(regex, '1234')) False >>> bool(re.match(regex, '1234, 5678')) True >>> bool(re.match(regex, '')) False
Allow the empty string.
>>> regex = make_list_regex(r'\d\d\d\d', r',', empty_string = True) '((\\d\\d\\d\\d\\s*,\\s*)*(\\d\\d\\d\\d\\s*|\\s*))' >>> bool(re.match(regex, '1234')) True >>> bool(re.match(regex, '1234, 5678')) True >>> bool(re.match(regex, '')) True >>> bool(re.match(regex, 'asdf')) True >>> re.match(regex, 'asdf') <re.Match object; span=(0, 0), match=''>
Allow numbers to be surrounded with quotation marks.
>>> regex = make_list_regex(r'\d\d\d\d', r',', quoted_elements = True) '(((\\d\\d\\d\\d\\s*,\\s*)+(\\d\\d\\d\\d\\s*|\\s*))|((\'\\d\\d\\d\\d\'\\s*,\\s*)+(\'\\d\\d\\d\\d\'\\s*|\\s*))|(("\\d\\d\\d\\d"\\s*,\\s*)+("\\d\\d\\d\\d"\\s*|\\s*)))' >>> bool(re.match(regex, '1234')) False >>> bool(re.match(regex, '1234, 5678')) True >>> bool(re.match(regex, '1234, "5678"')) True >>> bool(re.match(regex, '\'1234\', "5678"')) True >>> bool(re.match(regex, '')) False