Metadata Column Matching#

The mwTab format gives a lot of freedom to uploaders in what kind of metadata associated with the assigned metabolites can be given. While this has its merits, a big drawback is the amount of different names given for the same column of information. For example, using both ‘m/z’ and ‘moverz’. Another drawback is the variety of value formats for some columns. For example, a column containing HMDB IDs might leave the ‘HMDB’ prefix off of the beginning of the ID, or the ‘HMDB’ prefix might be lower case. To address these issues a significant amount of work was done to data mine the most used and most useful columns and using this information code was created to be able to identify the columns under their various names as well as evaluate the format of their values.

This code is utilized in the validation to inform users about the state of their metadata columns in the METABOLITES section. In order to do this we chose standardized column names and value formats based on the information obtained during data mining, so there were some executive decisions we made in deciding what the standard names and values are or should be. We did our best to base them on the data already in the Metabolomics Workbench.

We did our best to test this code on the data in the Metabolomics Workbench, but it is not perfect. Due to the nature of regular expressions and trying to match a variety of names and values, inevitably there are going to be false positives and negatives. You may see a validation warning about a column name or value in error. If you do find incorrect warnings or issues that are not being warned about in the validation, please visit our GitHub Issues and create a new issue for it.

Reusing Match Code#

Although the mwtab package is largely meant to be used through its CLI, the code created for column name and value matching could have many more use cases outside of this package. This section will explain the different parts of the code and how they might be used outside of this package.

The 3 significant parts are: 1. ColumnFinder class 2. Regular expressions and associated functions 3. Dictionary of ColumnFinders for specific mwTab columns

ColumnFinder Class#

Column matching has 2 components, matching column names and matching column values, therefore, a ColumnFinder has a NameMatcher and ValueMatcher to handle those functions. All NameMatcher attributes are lists of strings or lists of lists of strings and all are used in its only method, dict_match. All ValueMatcher attributes are strings and all are used in its only method, series_match. ColumnFinder takes both a NameMatcher and ValueMatcher, along with a standard_name attribute. The “standard_name” attribute is not used by any method and is simply there to tie the instance of the class to a name. The Metadata Column Matching page for metadata_column_matching.py contains more detailed information and examples on each class.

Regular Expressions#

In order to create some of the complex regular expressions needed to validate column names and values, a modular approach to constructing them was taken. This means the smaller building block regular expressions could have use outside of this particular one. There is also a function used heavily in creating these regular expressions, make_list_regex. The Metadata Column Matching page for metadata_column_matching.py contains more detailed information about using this function. There are too many regular expressions to explain each one, so the source code for creating them is included below. The names are largely self explanatory, and being able to see the regular expression for each name can help clear up confusion.

INTEGER = r'-?\d+'
FLOAT = r'-?\d*\.\d+'
SCIENTIFIC_NOTATION = r'-?\d*\.\d+E(-|\+)?\d+'
NUMS = '(' + FLOAT + '|' + SCIENTIFIC_NOTATION + '|' + INTEGER + ')'
NUM_RANGE = NUMS + r'(-|\sto\s|−)' + NUMS
LIST_OF_NUMS = make_list_regex(NUMS, ',')
BRACKETED_LIST_OF_NUMS = r'\[' + LIST_OF_NUMS + r'\]'
PARENTHESIZED_LIST_OF_NUMS = r'\(' + LIST_OF_NUMS + r'\)'
LIST_OF_NUMS_UNDERSCORE = make_list_regex(NUMS, '_')
LIST_OF_NUMS_SLASH = make_list_regex(NUMS, '/')
POSITIVE_NUMS = NUMS.replace('-?', '')
POSITIVE_INTS = r'\d+'
LIST_OF_POS_INTS = make_list_regex(POSITIVE_INTS, ',')
LIST_OF_POS_INTS_OR = make_list_regex(POSITIVE_INTS, 'or')
LIST_OF_POS_INTS_BAR = make_list_regex(POSITIVE_INTS, r'\|')
LIST_OF_POS_INTS_SLASH = make_list_regex(POSITIVE_INTS, '/')
LIST_OF_POS_INTS_SEMICOLON = make_list_regex(POSITIVE_INTS, ';')
LIST_OF_POS_INTS_SPACE = make_list_regex(POSITIVE_INTS, ' ')

POSITIVE_FLOATS = r'\d*.\d+'
POSITIVE_SCIENTIFIC_NOTATION = r'\d*\.\d*E(-|\+)?\d+'
POSITIVE_FLOAT_RANGE = POSITIVE_FLOATS + r'\s*(_|-)\s*' + POSITIVE_FLOATS
LIST_OF_POS_FLOATS_UNDERSCORE = make_list_regex(POSITIVE_FLOATS, '_')
POS_FLOAT_PAIRS = '(' +  POSITIVE_FLOATS + r'(_|@)' + POSITIVE_FLOATS + ')'
LIST_OF_POS_FLOAT_PAIRS_UNDERSCORE = make_list_regex(POS_FLOAT_PAIRS, '_')
LIST_OF_POS_FLOAT_PAIRS_NO_SPACE = make_list_regex(POS_FLOAT_PAIRS, '')
LIST_OF_POS_FLOAT_PAIRS_MIXED = make_list_regex(POS_FLOAT_PAIRS, '(//|,)')
POS_INT_FLOAT_PAIR = '(' +  POSITIVE_INTS + r'_' + POSITIVE_FLOATS + ')'

ELEMENT_SYMBOL = (r'([BCFHIKNOPSUVWY]|[ISZ][nr]|[ACELP][ru]|A[cglmst]|B[aehikr]|'
                  r'C[adeflos]|D[bsy]|Es|F[elmr]|G[ade]|H[efgos]|Kr|L[aiv]|M[cdgnot]|'
                  r'N[abdehiop]|O[gs]|P[abdmot]|R[abe-hnu]|S[bcegim]|T[abcehilms]|Xe|Yb)')
ELEMENT_COUNT = r'([1-9]\d*)*'
FORMULA_ELEMENT = ELEMENT_SYMBOL + ELEMENT_COUNT
FORMULA = '(' + FORMULA_ELEMENT + ')+'
LIST_OF_FORMULAS = make_list_regex(FORMULA, ',', True)
BRACKETED_LIST_OF_FORMULAS = r'\[' + LIST_OF_FORMULAS + r'\]'
# C12H22O11, C12 H22 O11, [13C]4H7NO4, [13]C6 H14 [15]N4 O2, [C13]C4H6O5, C21C(13)2H38N7O17P3S, 12C12+14N4+16O19+1H32
ISOTOPIC_NUM = r'\d+'
# Add dueterium.
ISOTOPIC_SYMBOL = ELEMENT_SYMBOL[0:-1] + r'|D)'
ISOTOPIC_ELEMENT = ISOTOPIC_SYMBOL + ELEMENT_COUNT
ISOTOPIC_FORMULA = '(' + ISOTOPIC_ELEMENT + '|' + \
                   r'\[' + ISOTOPIC_NUM + ISOTOPIC_SYMBOL + r'\]' + ELEMENT_COUNT + '|' + \
                   r'\[' + ISOTOPIC_NUM + r'\]' + ISOTOPIC_ELEMENT + '|' + \
                   r'\[' + ISOTOPIC_SYMBOL + ISOTOPIC_NUM + r'\]' + ELEMENT_COUNT + '|' + \
                   ISOTOPIC_SYMBOL + r'\(' + ISOTOPIC_NUM + r'\)' + ELEMENT_COUNT + '|' + \
                   make_list_regex(ISOTOPIC_NUM + ISOTOPIC_SYMBOL + ELEMENT_COUNT, r'\+') + ')+'
LIST_OF_ISOTOPIC_FORMULAS = make_list_regex(ISOTOPIC_FORMULA, ',', True)
BRACKETED_LIST_OF_ISOTOPIC_FORMULAS = r'\[' + LIST_OF_ISOTOPIC_FORMULAS + r'\]'
# C12H22O11+, [C12H22O11]+
CHARGE_FORMULA = r'(\[' + FORMULA + r'\](-|\+)' + '|' + FORMULA + r'(-|\+)' + ')'
# CH3(CH2)16COOH
GROUP_FORMULA = '(' + FORMULA_ELEMENT + '|' + r'\(' + FORMULA + r'\)\d+' + r')+'

ORGANIC_ELEMENT_SYMBOL = r'([CHNOPS])'
ORGANIC_FORMULA_ELEMENT = ORGANIC_ELEMENT_SYMBOL + ELEMENT_COUNT
ORGANIC_FORMULA = '(' + ORGANIC_FORMULA_ELEMENT + '){4,}'

SMILES_ELEMENT_SYMBOL = r'\d?[a-zA-Z][a-z]?'
# Note the , , , and  symbols at the end of the character set are '\x01', 'x02', '\x03', and '\x04'. 
# I don't think they are apart of the SMILES characters, but they appeared in the SMILES of some datasets.
# There are some SMILES in AN003143 like 'C[n]1c[n]cc1C[C@@H](NC(=O)CCN)C(O)=O |&1:7|'. 
# I don't think the end bit between '|' is part of a legit SMILES, but I added it to pass this dataset.
SMILES = r'(\[?' + SMILES_ELEMENT_SYMBOL + r'[0-9@+\-[\]()\\/%=#$.*]*)+' + r'( \|[0-9:&,w]+\|)?'
LIST_OF_SMILES_SEMICOLON = make_list_regex(SMILES, ';')

INCHIKEY = r'(InChIKey=)?[a-zA-Z]{14}-[a-zA-Z]{10}-[a-zA-Z]?'
INCHIKEY_OR_NULL = '(' + INCHIKEY + r'|null|No record)'
LIST_OF_INCHIKEYS = '(' + INCHIKEY + r'\s*,\s*)+' + '(' + INCHIKEY + r'\s*|\s*)'
LIST_OF_INCHIKEYS_SLASH = LIST_OF_INCHIKEYS.replace(',', '/')

# InChI=1 or InChI=1 where "1" is a version number and "S" means it's a standard InChI.
INCHI_HEAD = r'InChI='
INCHI_VERSION = r'\d+S?'
INCHI_FORMULA = r'/' + FORMULA
INCHI_SKELETAL_LAYER = r'/c(\d+([-,()])?)+'
INCHI_HYDROGEN_LAYER = r'/h' + '(' + make_list_regex(r'\d+(-\d+)?H?\d*', ',') + r')?' + r'(\(H\d*-?(,\d+)+\))*'
INCHI_CHARGE_LAYER = r'/q(-|\+)\d+'
INCHI_PROTONATION_LAYER = r'/p(-|\+)\d+'
INCHI_STEREOCHEMISTRY_LAYER = r'/t' + make_list_regex(r'(\d+(-|\+|\?|u)|M)', ',')
INCHI_STEREOCHEMISTRY_SUBLAYER1 = r'/m\d+'
INCHI_STEREOCHEMISTRY_SUBLAYER2 = r'/s\d+'
INCHI_FIXED_HYDROGEN_LAYER = r'/f(' + FORMULA + r')?' + \
                             '(' + INCHI_HYDROGEN_LAYER + r')?' + \
                             '(' + INCHI_STEREOCHEMISTRY_SUBLAYER2 + r')?' + \
                             '(' + INCHI_CHARGE_LAYER + r')?'
INCHI_DOUBLEBOND_LAYER = r'/b' + make_list_regex(r'\d+(-|\+)\d+(-|\+)', ',')
INCHI_ISOTOPIC_LAYER = r'/i' + make_list_regex(r'(\d+(-|\+)\d+|\d+[A-Z]\d*|\d+(-|\+)\d+[A-Z]?\d*)', ',') + '(' + INCHI_STEREOCHEMISTRY_SUBLAYER2 + r')?'

FULL_INCHI = '(' + INCHI_HEAD + r')?' + \
             INCHI_VERSION + \
             '(' + INCHI_FORMULA + r')?' + \
             '(' + INCHI_SKELETAL_LAYER + r')?' + \
             '(' + INCHI_HYDROGEN_LAYER + r')?' + \
             '(' + INCHI_CHARGE_LAYER + r')?' + \
             '(' + INCHI_PROTONATION_LAYER + r')?' + \
             '(' + INCHI_DOUBLEBOND_LAYER + r')?' + \
             '(' + INCHI_STEREOCHEMISTRY_LAYER + r')?' + \
             '(' + INCHI_STEREOCHEMISTRY_SUBLAYER1 + r')?' + \
             '(' + INCHI_STEREOCHEMISTRY_SUBLAYER2 + r')?' + \
             '(' + INCHI_ISOTOPIC_LAYER + r')?' + \
             '(' + INCHI_FIXED_HYDROGEN_LAYER + r')?'

CHEAP_INCHI = r'\s*' + '(' + INCHI_HEAD + r')?' + INCHI_VERSION + r'((/c|/h|/i|/t|/m|/s|/q|/f|/p|/b|/[A-Z])(\S)+)+' + r'\s*'
LIST_OF_INCHI = make_list_regex(CHEAP_INCHI, ',', True)
BRACKETED_LIST_OF_INCHI = r'\[' + LIST_OF_INCHI + r'\]'

# M+H, [M+H]+, -H(-), M+AGN+H, M+Acid, (M-H)/2, [M+NH4] +_[M+Na]+, [M+H–H2O]+, [M-2H](2-)
# Cat, Hac, and Chol-head are wierd special cases.
ION_ELEMENTS = r'(\d?(m|M)' + '|' + \
               r'(-|\+)?\d*' + FORMULA + r'\d*(\((-|\+)\))?' +  '|' + \
               r'[ [a-zA-Z]*(A|a)cid' + '|' + \
               r'Cat' + '|' + r'Chol-head' + '|' + r'Hac' + '|' + r'\di' + '|' + r'FA|NA|A' + ')'
ION = '(' + make_list_regex(ION_ELEMENTS, r'(-|\+)?') + ')'
ION = '(' + ION + '|' + \
      r'\[' + ION + r'\]\(?\d?\s?(-|\+)?\d?\)?' + '|' + \
      ION + r'\](-|\+)?' + '|' + \
      r'\(' + ION + r'\)((/\d)|(-|\+))?' + ')'
LIST_OF_IONS = make_list_regex(ION, ',')
LIST_OF_IONS_SPACE = make_list_regex(ION, ' ')
LIST_OF_IONS_UNDERSCORE = make_list_regex(ION, '_')
LIST_OF_IONS_MIXED = make_list_regex(ION, '(_| )')
LIST_OF_IONS_NO_DELIMITER = make_list_regex(ION, '')
BRACKETED_LIST_OF_IONS = r'\[' + make_list_regex(ION, ',', True) + r'\]'

# There are values like CA1511 that I can't confirm are KEGG values. You can't find them in the compound database, but they appear often.
# The prefixes CE, UP, Z, and U are the same. There is a common mistake of only having 4 numbers after C, so that is accounted for as well.
KEGG = '(' + r'(cpd:)?[CDMGKRZU]0?\d{5}\?{0,2}' + '|' + r'(DG|ko)\d{5}' + '|' + r'(CA|CE|UP|C)\d{4}' + '|' + r'(NA|n/a)' + ')'
LIST_OF_KEGG = make_list_regex(KEGG, ',')
LIST_OF_KEGG_SEMICOLON = make_list_regex(KEGG, ';')
LIST_OF_KEGG_SLASH = make_list_regex(KEGG, '/')
LIST_OF_KEGG_SPACE = make_list_regex(KEGG, ' ')
LIST_OF_KEGG_DOUBLE_SLASH = make_list_regex(KEGG, '//')
LIST_OF_KEGG_UNDERSCORE = make_list_regex(KEGG, '_')
LIST_OF_KEGG_HYPHEN = make_list_regex(KEGG, '-')
LIST_OF_KEGG_MIXED = make_list_regex(KEGG, '(/|,)')
LIST_OF_KEGG_BAR = make_list_regex(KEGG, r'(\|)')

HMDB = '(' + r'(HMDB|HDMB|YMDB|HMBD)\d+(\*|\?)?' + '|' + r'n/a' + ')'
LIST_OF_HMDB = make_list_regex(HMDB, ',')
LIST_OF_HMDB_SLASH = make_list_regex(HMDB, '/')
LIST_OF_HMDB_AMPERSAND = make_list_regex(HMDB, '&')
LIST_OF_HMDB_SEMICOLON = make_list_regex(HMDB, ';')
LIST_OF_HMDB_SPACE = make_list_regex(HMDB, ' ')
LIST_OF_HMDB_UNDERSCORE = make_list_regex(HMDB, '_')
HMDB_INT = r'\d{,5}'
LIST_OF_HMDB_INTS = make_list_regex(HMDB_INT, ',')
LIST_OF_HMDB_INTS_SLASH = make_list_regex(HMDB_INT, '/')

LIPID_MAPS = '(' + r'LM(PK|ST|GL|FA|SP|GP|PR|SL)[0-9A-Z]{8,10}\*?' + '|' + r'(ST|FA|PR|GP|PK|GL|SP)\d{4,6}-' + FORMULA + ')'
LIST_OF_LMP = make_list_regex(LIPID_MAPS, ',')
LIST_OF_LMP_UNDERSCORE = make_list_regex(LIPID_MAPS, '_')
LIST_OF_LMP_SLASH = make_list_regex(LIPID_MAPS, '/')

# When using Excel, a CAS number can get mistaken for a date and it will automatically change the value.
DATE = r'\d{1,2}/\d{1,2}/(\d{4}|\d{2})'
CAS = r'(CAS: ?)?\d+-\d\d-0?\d' + '|' + DATE
LIST_OF_CAS = make_list_regex(CAS, ',')
LIST_OF_CAS_SEMICOLON = make_list_regex(CAS, ';')

column_finders Dictionary#

The column_finders dictionary is the culmination of the ColumnFinder class and regular expressions for the purpose of finding and evaluating/validating columns in Metabolomics Workbench datasets. The keys are standardized column names, and the values are a ColumnFinder class to find and validate that column. The entire library of datasets in the Metbolomics Workbench was used to determine the most relevant and most abundant columns that should go into the column_finders dictionary. The column_finders dictionary is likely to be useful for other similar metabolomics data, but may vary outside of that usecase. Some columns, such as database ID columns, like PubChem or KEGG, are likely to be more widely useable, but any user would have to test and make that determination for themselves. The list of standardized column names available in the dictionary is given below.

[
  "moverz_quant",
  "mass",
  "parent_moverz_quant",
  "mass_spectrum",
  "composite_mass_spectrum",
  "inchi_key",
  "inchi",
  "smiles",
  "formula",
  "compound",
  "name",
  "refmet",
  "class",
  "pathway",
  "pathway_sortorder",
  "ion",
  "adduct",
  "species",
  "pubchem_id",
  "kegg_id",
  "hmdb_id",
  "lm_id",
  "chemspider_id",
  "metlin_id",
  "cas_number",
  "binbase_id",
  "chebi_id",
  "mw_regno",
  "mzcloud_id",
  "identifier",
  "other_id",
  "other_id_type",
  "retention_time",
  "delta_rt",
  "retention_index",
  "retention_index_type",
  "abbreviation",
  "assignment_certainty",
  "comment",
  "assignment_method",
  "isotopologue",
  "isotopologue_type",
  "peak_description",
  "peak_pattern",
  "transient_peak",
  "transient_peak_type",
  "fish_coverage",
  "msi_category",
  "annotations",
  "istd",
  "platform",
  "ms_method",
  "polarity",
  "esi_mode",
  "ionization_mode",
  "frequency"
]