Tagging

Terminology

  • header - a descriptive label at the top of a column in a table of data.

    • Example: SampleID

  • tag - used to describe fields in a record of data. They are a semantic descriptive header used to parse a column (field) of data from a table.

    • Example: #measurement.compound

  • table - in this section, table can refer to either a table in a data file or a SQL style table in the output. Often there are many small tables in a data file that are combined into a larger one in the extract output.

  • directives - directions given to MESSES commands using tagged files or their JSONized forms. A table and its tags taken as a whole.

Introduction

In order to extract data from arbitrary tables programmatically, some kind of system has to be devised. This could be something as simple as requiring a table be on the very first row and for the starting row to have column names for every column, but a system such as that would then be fragile. We decided to create a more robust system that could handle more complicated data arrangements and reduce the verbosity to a minimum. The system that was devised was an extra layer of tags added on top of existing tables that tell the extract command how to transform the tabular data into key-based record data similar to JSON or SQL databases. The specific table schema this system was designed for is covered in the Experiment Description Specification section, but it is general enough that it can be used with most table schema.

This initial system served its function well, but it became clear that more functionality was sorely needed. Namely, both a way to add tags programmatically to data and a way to modify record values was needed, so the system was expanded to provide facilities to do both. Ultimately there are 3 parts to the tagging system that are distinct from one another but have similar syntax and ideas. The “export” part is the system used to add directly on top of existing tabular data. It is the base system that must be used for the extraction to work at all. The “automation” part that is used to automate adding “export” tags to tabular data. Based on the header values in your data, you can use “automation” tags to add the “export” tags automatically. A good use case for this is when you have data output by a program in a consistent way. Instead of manually adding export tags to the program output each time, you can create an “automation” page that will add the “export” tags for you. The last “modification” part is the system to modify record values. It can be used to prepend, append, delete, overwrite, or regex substitute values. An example use-case would be to update old naming conventions. Validly tagged files in their tabular or JSON form can be referred to as directives as they direct the actions of the program. To reduce confusion between tags and directives “tags” should generally refer to the extra text added above tables, while “directives” are the tables and tags taken as a whole. Each row of a tagged table is an individual directive.

Each part of the tagging system has to be in their own sheet or file for the extract command. By default, export tags are expected to be in a sheet named ‘#export’, if given an Excel file without specifying a sheet name. If given a CSV file, then this file is expected to have export tags. Modification tags are expected to be in a sheet named ‘#modify’ by default, but can be specified using the –modify option. The option can be used to specify either a different sheet name in the given Excel file, a different Excel file, a different Excel file with a different sheet name, a Google Sheets file, a Google Sheets file with a different sheet name, a JSON file, or a CSV file. Automation tags are similarly specifiedusing the –automate option or otherwise expected in a sheet named ‘#automate’ by default.

Each part of the tagging system are explained below with examples. Examples using them with the extract command are in the Tutorial section and there are full run examples in the “examples” folder of the GitHub repository.

Export Tags

  • Tag rows are identified by putting ‘#tags’ in the left most column of the data file.

    • Typically, there will be nothing underneath of ‘#tags’ except ‘#ignore’ for rows to ignore.

    • You may need to add a blank column to the left of your data to accommodate this.

    • ‘#tags’ always needs to be on the very left most column, even if your table has several blank columns before it starts.

  • Cells on this row will have tags over the columns to be exported.

  • Subsequent non-blank rows are interpreted as data unless the tag ‘#ignore’ is used in the left most column.

Basic Example:

#tags

#sample.id

#.weight;#%units=”g”

#.protocol.id

#ignore

Sample ID

Weight

Treatment

Keyword Tags

  • #tags - identifies tag header rows. Must be in the first column of a row.

    • Additional tag-value pairs in a cell with #tags are applied to all child records in this section, but are ignored for non-child records.

  • #ignore - ignore this row. Must be put in the first column of a row.

    • Use this to ignore additional header explanation rows and unused data records within a block of records.

  • #table - change the working table to the indicated table name.

Column Field Format

  • Tags begin with a ‘#’ symbol and identify a specific relational database table, field, and optionally a field-attribute.

    • Format: #table_name.field_name

      • Example: #sample.dry_weight

        • sample - the sample data relational table (workbench worksheet)

        • dry_weight - the field name.

    • Format: #table

      • Example: #sample

        • sample - the sample data relational table (workbench worksheet)

    • Format: #.field_name

      • Example: #.dry_weight

        • dry_weight - the field name, for the table last identified.

    • Format: #table_name.field_name%attribute_name

      • Example: #sample.dry_weight%units

        • sample - the sample data relational table (workbench worksheet)

        • dry_weight - the field name.

        • units - units attribute for the field.

    • Format: #.field_name%attribute_name

      • Example: #.dry_weight%units

        • dry_weight - the field name, for the last table identified.

        • units - units attribute for the field.

    • Format: #%attribute_name

      • Example: #%units

        • units - units attribute for a field previously identified in a table previously identified.

Value Field Format

  • Values for a tag field can be specified directly using an equal sign “=”.

    • Format: #table_name.field_name=field_value

      • Example: #study.title=”labeled mouse study 278”

        • study - the study table.

        • “labeled mouse study 278” - value for the study title field.

    • Note that there is a special case for “id” fields. “id” fields cannot be set with an “=”.

    • Format: #table_name.field_name%attribute_name=attribute_value

      • Example: #sample.dry_weight%units=g

        • sample - sample table

        • dry_weight - dry_weight field

        • units - attribute for dry weight with a value of “g”.

  • Multiple tags can be specified in a single cell using a semicolon “;”.

    • However, only one tag can be left without a direct value for unambiguous interpretation of a column.

    • Format: #table_name.field_name;#table_name.field_name=field_value;…

      • Example: #sample.dry_weight;#.dry_weight%units=mg

        • sample - the sample data table.

        • dry_weight - field for the column interpretation

        • units - field with a direct value of “mg”.

  • Values can be combined into a single value using a plus sign “+”.

    • Format: #table_name.field_name;#table_name.field_name=field_value+field_value

      • Example: #study.title=”labeled mouse study 278 “+#.type

        • study - the study table.

        • “labeled mouse study 278 ” - part of value for the study title field.

        • #.type - part of value of the study title field taken from the study type field.

ID Field Format

  • ID tags indicate an identifier for a record in a table.

  • The ID must uniquely identify a record.

  • There must be an ID tag in every tag row.

  • Using the same ID in multiple tag rows will add to the existing record.

    • Fields with the same name for the same ID across multiple tag rows will automatically become list fields, and new values will be added to the list.

  • Records can have record IDs from other tables or the same table as fields.

    • Format: #table.id

      • Example: #sample.id

        • sample - type of id tag. This often is a table name.

        • id - indicates that this is an id tag.

    • Format: #.id

      • Short format can be used when the table is already specified.

List Field Format

  • List field tags begin with ‘*#’ (asterisk followed by the pound sign) and identify a specific relational database table, field, and optionally a field-attribute that has multiple values.

  • They have the same format as normal column field tags.

    • Format: *#table_name.field_name

      • Example: *#sample.dry_weight

        • sample - the sample data relational table (workbench worksheet)

        • dry_weight - the field name.

  • Individual values are separated by commas “,” both in the column cells or in the column tag value.

    • Format: *#table_name.field_name=field_value,field_value,…

      • Example: *#study.labeling=13C,15N

        • study - the study table.

        • 13C,15N - two values for the study labeling field.

  • List field tags can be listed multiple times in a record, with each value(s) appended.

Child Tag Format

  • Child record tags provides a mechanism for indicating parent-child relationships between records in the same table or between tables.

    • The child tag indicates the creation of a new record.

    • Subsequent normal tags identify fields in the new child record.

    • A special parentID field is added using the first ID tag indicated in the header row.

    • Format: #%child.id=id_sub_string

      • The value for id_sub_string will be appended to the ID of the child’s parent (parentID) to create the child ID.

Example:

#tags

#sample.id

#%child.id=-media-0h;#.dry_weight;#.dry_weight%units=mg

#%child.id=-media-3h;#.dry_weight;#.dry_weight%units=mg

KO labelled_1

4.2

8.5

KO labelled_2

4.7

9.7

Output JSON:

{
  "sample": {
    "KO labelled_1": {
      "id": "KO labelled_1"
    },
    "KO labelled_1-media-0h": {
      "dry_weight": "4.2",
      "dry_weight%units": "mg",
      "id": "KO labelled_1-media-0h",
      "parentID": "KO labelled_1"
    },
    "KO labelled_1-media-3h": {
      "dry_weight": "8.5",
      "dry_weight%units": "mg",
      "id": "KO labelled_1-media-3h",
      "parentID": "KO labelled_1"
    },
    "KO labelled_2": {
      "id": "KO labelled_2"
    },
    "KO labelled_2-media-0h": {
      "dry_weight": "4.7",
      "dry_weight%units": "mg",
      "id": "KO labelled_2-media-0h",
      "parentID": "KO labelled_2"
    },
    "KO labelled_2-media-3h": {
      "dry_weight": "9.7",
      "dry_weight%units": "mg",
      "id": "KO labelled_2-media-3h",
      "parentID": "KO labelled_2"
    }
  }
}

Field Tracking Tags

  • Field tracking tags provide a mechanism for copying the latest field value from one table into the records of another.

    • The tag indicates which table’s records to add to and which field to track from another table.

    • The latest value for the field seen while parsing will be added.

    • Useful for adding project and study ids to records in a document with multiple projects or studies.

    • Format: #table%track=table.field

    • Example: #sample%track=project.id

      • will add the project.id field to every sample record.

    • A list format can also be used:

      • #table%track=table.field1,table.field2,…

    • Example: #sample%track=project.id,study.id

    • Fields can also be untracked after tracking to stop adding the field to records.

    • Format: #table%untrack=table.field

    • The list format also works for untrack.

    • If a tracked field is specifically given in a table the given value is used over the tracked value.

Example:

#tags

#sample%track=project.id

#tags

#project.id

Project 1

#tags

#sample.id

01_A0_Spleen_naive_0days_170427_UKy_GCH_rep1

02_A1_Spleen_naive_0days_170427_UKy_GCH_rep2

Output JSON:

{
  "project": {
    "Project 1": {
      "id": "Project 1"
    }
  },
  "sample": {
    "01_A0_Spleen_naive_0days_170427_UKy_GCH_rep1": {
      "id": "01_A0_Spleen_naive_0days_170427_UKy_GCH_rep1",
      "project.id": "Project 1",
    },
    "02_A1_Spleen_naive_0days_170427_UKy_GCH_rep2": {
      "id": "02_A1_Spleen_naive_0days_170427_UKy_GCH_rep2",
      "project.id": "Project 1",
    }
  }
}

Modification Tags

Similar to export tags, the modification tag rows are indicated by #tags in the left most column, and #ignore can be used to ignore rows. The general idea behind the modification system is that you first use tags to indicate a field in a table to match to. Then underneath that tag indicate the value in that field to match to. Then another tag in the same row will indicate both what field to modify in the record that has the matching field and what modification to do. Underneath that tag will have the value to do the modification with.

Basic Examples:

#tags

#measurement.compound.value

#measurement.compound.assign

#match=all

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

(S)-2-Acetolactate Glutaric acid Methylsuccinic acid

This example replaces the “compound” field with value “(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd” in “measurement” table records with “(S)-2-Acetolactate Glutaric acid Methylsuccinic acid”.

#tags

#measurement.compound.value

#measurement.id.regex

#match=all

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

r’(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd’,r’(S)-2-Acetolactate Glutaric acid Methylsuccinic acid’

This example does a regex substitution on the “id” field of records in the “measurement” table if their “compound” field matches “(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd”. In the “id” field “(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd” is substituted with “(S)-2-Acetolactate Glutaric acid Methylsuccinic acid”.

Value Tag

  • All modification tag rows must start with a value tag after #tags.

  • This tag indicates which table and field to compare with.

  • The value underneath the tag will be compared with the value in the indicated field for all of the records in the indicated table to determine if a match is made.

    • Signature:

      • #[table_name].field_name[%attribute].value

Comparison Type

The comparison type is controlled through the #comparison tag. It can be used to specify the type for each row individually, or for the whole column by using the = sign. Ex. #comparison=exact

There are 4 comparison types, “exact”, “regex”, “levenshtein”, and “regex|exact”.

  • exact is a simple exact comparison between the comparison value and field value. The given comparison value must be exactly what is in the field value (i.e. a “==” comparison).

  • regex expects a regex to be in the comparison value and will print a message if it is not. The given regex will be delivered to re.search() for matching with field values.

  • levenshtein calculates the levenshtein distance between the comparison value and record field values and matches to the field values with the minimum distance.

    • This means it always matches to something even if the values seem wildly different, so be aware of possible unexpected results.

  • regex|exact is an intelligent combination of regex and exact. If a regex is specified for the comparison value it will be detected and the type will be regex for that comparison value only, otherwise the comparison type will be exact.

  • If a type is specified then that type is used regardless of the comparison value, so a regex string with an exact comparison type will try to match exactly.

  • If the comparison tag is not specified, then the type defaults to “regex|exact”.

  • exact type modifications are executed first, then regex type, and lastly levenshtein type. Within each type, modifications are executed in order from first to last.

Match Type

Match behavior can be altered further using the #match tag. It can be used to specify the type for each row individually, or for the whole column by using the = sign. Ex. #match=all

There are 4 match types, “first”, “first-nowarn”, “unique”, and “all”.

  • “first” - the modification is performed only for the first record matched, additional matches beyond the first will print a warning.

  • “first-nowarn” - the same as first, but won’t print warnings.

  • “unique” - the modification is only performed if 1 and only 1 record matched.

    • For levenshtein, this means that only 1 field value can have the minimum distance, if 2 values share the minimum distance then the action won’t take place.

  • “all” - the modification is done to every record that matches.

  • If the match tag is not specified, then the type defaults to “first”.

Modifications

There are 6 modifications that can be done, “assign”, “append”, “prepend”, “regex”, “delete”, and “rename”.

  • assign - will overwrite whatever value is in the field with the indicated assignment value.

    • Signatures:

      • #[table_name].field_name[%attribute].assign

      • *#[table_name].field_name[%attribute].assign

    • If the indicated assignment field does not exist in the record then it will be added to the record.

    • An eval function can be used in the form “eval(…)”.

      • “#field_name#” and “#r’…’#” can be used to construct the assignment value for the record.

      • All Python language operators can be used. But remember to use “float(#field_name#)” to convert strings to floating point numbers.

      • Example: eval(float(#intensity#) / float(#normalization#) * 5)

      • evals that return a list of strings will convert the field to a list field.

    • Add an asterisk, ‘*’, to the front of the tag to interpret the assignment value as a list and assign that list value to the field.

      • An eval function can be used, but it must return a list of strings.

    • The assign modification can be used to change list types to non list types and vice versa.

      • This can lead to an issue where some records have a list type for the field and some do not.

      • If that is not intended, then be sure to construct the assign tag such that it matches the type of the field.

      • For instance, make sure evals return a list if the field should be a list type.

    Example:

    #tags

    #measurement.compound.value

    #measurement.compound.assign

    (S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

    (S)-2-Acetolactate Glutaric acid Methylsuccinic acid

  • append - will append the indicated value to the value in the indicated field.

    • Signatures:

      • #[table_name].field_name[%attribute].append

      • *#[table_name].field_name[%attribute].append

    • If the indicated append field does not exist in the record, then it will be added to the record.

    • If the field value is a list and the append value is not a list, then the append value will be appended to each value in the list.

    • Add an asterisk, ‘*’, to the front of the tag to interpret the append value as a list.

      • When the append value is a list, the behavior is more complicated.

      • For each value in the field value list, the append value in the append list at the same index will be appended to the field value.

      • Examples:

        • field_value = [“a”, “b”] append_value = [“c”, “d”] result = [“ac”, “bd”]

        • field_value = [“a”, “b”] append_value = [“c”, “d”, “e”] result = [“ac”, “bd”]

        • field_value = [“a”, “b”, “e”] append_value = [“c”, “d”] result = [“ac”, “bd”, “e”]

    Example:

    #tags

    #measurement.compound.value

    #measurement.sample.id.append

    (S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

    (S)-2-Acetolactate Glutaric acid Methylsuccinic acid

  • prepend - will prepend the indicated value to the value in the indicated field.

    • Signatures:

      • #[table_name].field_name[%attribute].prepend

      • *#[table_name].field_name[%attribute].prepend

    • If the indicated prepend field does not exist in the record, then it will be added to the record.

    • If the field value is a list and the prepend value is not a list, then the prepend value will be prepended to each value in the list.

    • Add an asterisk, ‘*’, to the front of the tag to interpret the prepend value as a list.

      • When the prepend value is a list, the behavior is more complicated.

      • For each value in the field value list, the prepend value in the prepend list at the same index will be prepended to the field value.

      • Examples:

        • field_value = [“a”, “b”] prepend_value = [“c”, “d”] result = [“ca”, “db”]

        • field_value = [“a”, “b”] prepend_value = [“c”, “d”, “e”] result = [“ca”, “db”]

        • field_value = [“a”, “b”, “e”] prepend_value = [“c”, “d”] result = [“ca”, “db”, “e”]

    Example:

    #tags

    #measurement.compound.value

    #measurement.sample.id.prepend

    (S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

    (S)-2-Acetolactate Glutaric acid Methylsuccinic acid

  • regex - will do a regex substitution on the indicated field using the indicated values.

    • Signatures:

      • #[table_name].field_name[%attribute].regex

    • If the indicated regex field does not exist in the record, then a warning will be printed.

    • If the field value is a list, then the regex substitution will be done on each element in the list.

    Example:

    #tags

    #measurement.compound.value

    #measurement.id.regex

    (S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

    r’(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd’,r’(S)-2-Acetolactate Glutaric acid Methylsuccinic acid’

  • delete - will remove the field from the record.

    • Signatures:

      • #[table_name].field_name[%attribute].delete

    • “id” fields cannot be deleted. An error will be raised during parsing if it is attempted.

    • No value is needed under the tag.

    Example:

    #tags

    #measurement.compound.value

    #measurement.mol_formula.delete

    (S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

  • rename - will rename the field in the record.

    • Signatures:

      • #[table_name].field_name[%attribute].rename.field_name[%attribute]

    • “id” fields cannot be renamed. An error will be raised during parsing if it is attempted.

    • Fields cannot be renamed to the same name. An error will be raised during parsing if it is attempted.

    • No value is needed under the tag.

    Example:

    #tags

    #measurement.compound.value

    #measurement.mol_formula.rename.molecular_formula

    (S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

Note

ID fields are special.

  • Modifications to id fields will be propagated to the table key of that record.

  • They cannot be deleted or renamed.

  • They are the only time a period, ‘.’, can appear in the field_name.

    • Ex. #measurement.intensity.units.assign is malformed, but #measurement.entity.id.assign is not.

Important Points:

  • Tags in the same row must have the same table. An error will be raised during parsing if they don’t.

  • The value tag must be before the modification tags.

  • Modifications are confined to the matched record. It is not possible to modify a record based on another record’s fields or values.

  • Modifications can be chained together, so that the same field can have multiple modifications.

    • This can be utilized effectively, but can also cause hard to diagnose unexpected output.

    • Some warnings are printed when fields are modified twice in a way that doesn’t make sense, such as a delete modification after an assign modification, but all other chained modifications are assumed to be intended.

Tag Format Reference:

  • #table_name.field_name[%attribute].value - identifies table_name containing field_name (with possible attribute name) and associated column with value to match. A regular expression can be given with r’…’.

  • #comparison - identifies column with type of comparison (exact, regex, levenshtein, regex|exact). Default regex|exact.

  • #comparison=type - type of comparison (exact, regex, levenshtein, regex|exact).

  • #match - identifies column with type of match (first, first-nowarn, unique, all). Default first.

  • #match=type - type of match (first, first-nowarn, unique, all).

  • #[table_name].field_name[%attribute].assign identifies field to assign and associated column with its value.

  • *#[table_name].field_name[%attribute].assign identifies field to assign and that the associated column values are a list type.

  • #[table_name].field_name[%attribute].append identifies field to append to and associated column with its value.

  • *#[table_name].field_name[%attribute].append identifies field to append to and that the associated column values are a list type.

  • *#[table_name].field_name[%attribute].prepend identifies field to prepend to and associated column with its value.

  • *#[table_name].field_name[%attribute].prepend identifies field to prepend to and that the associated column values are a list type.

  • #[table_name].field_name[%attribute].regex identifies field to apply regex substitution to and associated column with the pair of regex strings of the form r”…”,r”…”.

  • #[table_name].field_name[%attribute].delete identifies field to delete.

  • #[table_name].field_name[%attribute].rename.field_name[%attribute] identifies field to rename.

Automation Tags

Automation tag rows, like the other tag rows, are indicated by #tags in the left most column with #ignore used to ignore rows, but #insert and #end tags are also introduced. There are 2 main functions in the automation system. One is to specify a table of header-tag pairs that will be used to automatically add the tags associated with the headers underneath of the headers when it finds them in the data. The other is to specify a block of rows to add to the data exactly as is. This is what introduces the #insert and #end tags.

Insert

The insertion functionality is easy to understand. You simply write whatever you want to add into the data and add #insert above it in the left most column and #end below it in the left most column. Everything in between #insert and #end is simply added as is into the data before it is parsed by the export tagging. A good use case for this is when you have a standard protocol that always needs to be added to some data. Instead of copying it in manually, you can add it to an automation sheet/file and deliver it to the extract command so it can add it for you. The thing to be careful of is to make sure everything in the insert block is valid under the export tagging. It can be tricky to debug a tagging error here because extract won’t be able to tell you that the issue is in the insert block.

Example:

#insert

#tags

#protocol.id

#.type

#.instrument

#.ion_mode

#.ionization

#.instrument_type

ICMS1

MS

Orbitrap Fusion

NEGATIVE

ESI

IC-FTMS

#end

Header Tagging

The header tagging allows you to automatically put export tags under a cell in tabular data based on the value in the cell. Typically, a table will already have descriptive human readable headers to identify what type of data is in the column. These headers are used to match to and put the associated export tags under them. Any row that has a header match where export tags are added is automatically ignored with the #ignore tag. Just like modification tags and export tags, #tags is used to denote the start of a tag block. An entire block is matched as a whole to a row in the data, so if you have multiple tables to add tags to you should created multiple tag blocks. There are additional tags to help control how a blank is matched, detailed below.

Example:

Data:

Compound

Mol_Formula

C_isomers

SamplID

Intensity

protein_mg

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

C5H8O4

0

01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench

7989221.834

0.618176844244679

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

C5H8O4

1

01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench

289287.7334

0.618176844244679

Header Tags:

#tags

#header

#add

Compound+”-13C”+C_isomers+”-“+SamplID

#measurement.id

Compound+”-13C”+C_isomers

#measurement.assignment

Compound

#measurement.compound

Mol_Formula

#measurement.formula

SamplID

#sample.id

“13C”+C_isomers

#measurement.isotopologue;#%type=”13C”

Intensity

#measurement.raw_intensity;#%type=”spectrometer peak area”

eval(float(#Intensity#) / float(#protein_mg#))

#measurement.intensity;#%type=”natural abundance corrected and protein normalized peak area”;#%units=”area/g”

#protocol.id=ICMS1

After Automation:

#ignore

Compound

Mol_Formula

C_isomers

SamplID

Intensity

#tags

#measurement.id

#measurement.assignment

#measurement.compound

#measurement.formula

#sample.id

#measurement.isotopologue;#%type=”13C”

#measurement.raw_intensity;#%type=”spectrometer peak area”

#measurement.intensity;#%type=”natural abundance corrected and protein normalized peak area”;#%units=”area/g”

#protocol.id=ICMS1

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd-13C0-01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd-13C0

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

C5H8O4

0

01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench

13C0

7989221.834

12923845.19

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd-13C1-01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd-13C1

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

C5H8O4

1

01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench

13C1

289287.7334

467969.2165

JSON Output:

{
  "measurement": {
    "(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C0-01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench": {
      "assignment": "(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd-13C0",
      "compound": "(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd",
      "formula": "C5H8O4",
      "id": "(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C0-01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench",
      "intensity": "12923845.19",
      "intensity%type": "natural abundance corrected and protein normalized peak area",
      "intensity%units": "area/g",
      "isotopologue": "13C0",
      "isotopologue%type": "13C",
      "protocol.id": "ICMS1",
      "raw_intensity": "7989221.83386388",
      "raw_intensity%type": "spectrometer peak area",
      "sample.id": "01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench"
    },
    "(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C1-01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench": {
      "assignment": "(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd-13C1",
      "compound": "(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd",
      "formula": "C5H8O4",
      "id": "(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C1-01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench",
      "intensity": "467969.2165",
      "intensity%type": "natural abundance corrected and protein normalized peak area",
      "intensity%units": "area/g",
      "isotopologue": "13C1",
      "isotopologue%type": "13C",
      "protocol.id": "ICMS1",
      "raw_intensity": "289287.733437356",
      "raw_intensity%type": "spectrometer peak area",
      "sample.id": "01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench"
    }
  }
}

The Tags

  • #header - header value to match to. Can be a regular expression of the form r’…’.

    • All tag blocks must contain a #header tag.

    • Additional column headers can be included with blank corresponding tags to help make header row identification unique.

    • Cell contents are stripped of leading and trailing white space before comparison with the header value.

    • A new column will be created if the headers and literals in quotes are combined with plus signs.

      • Example: Name+”-“+Isopotologue+”-“+r’^d+w+ Isotope$’

      • This functionality means that certain characters can’t be used for literal matching outside of a regex.

      • For example, if a header name in a data table is “protein+solvent”, then you can’t simply put protein+solvent under #header because it will be interpreted as a concatenation of a “protein” header and a separate “solvent” header.

      • The easiest way to solve this issue is to use a regular expression. r’protein+solvent’ will match the header correctly.

      • In general, if you are having difficulty matching a header, try using a regex.

    • An eval function can be used in the form “eval(…)”.

      • “#header_name#” and “#r’…’#” can be used to indicate specific columns in the row.

      • All Python language operators can be used. But remember to use “float(#field_name#)” to convert strings to floating point numbers.

      • Example: eval(float(#Intensity#) / float(#r’.*Normalization’#) * 5)

      • If the eval returns a list of items, it is converted into a string separated by semicolons.

        • If the corresponding tag is a list tag, then this will become a list.

  • #add - tags to add in an inserted row below the column header row.

    • All tag blocks must contain a #add tag.

    • The actual value under #add does not have to be a valid tag, the value will be copied as is.

    • Leave this value blank to add headers that are required to match a block, but don’t need tags.

  • #required - true|false whether this header description is required. The default is that all header descriptions are required.

    • Example:

    #tags

    #header

    #add | #required

    Compound+”-13C”+C_isomers+”-“+SamplID

    #measurement.id | true

    Compound+”-13C”+C_isomers

    #measurement.assignment | true

    Compound

    #measurement.compound | true

    Mol_Formula

    #measurement.formula | false

    SamplID

    #sample.id | true

    If the Mol_Formula header is not found, the tags will still be added, but without the Mol_Formula ones.

  • #exclude=test_string - test string or regular expression to use for excluding a given header row.

    • If a header matches the exclude string or regex, then the tags are not inserted regardless of whether the headers match.

    • Example:

    #tags

    #header

    #add | #exclude=r’Cell Type|Mouse Species’

    Compound+”-13C”+C_isomers+”-“+SamplID

    #measurement.id |

    Compound+”-13C”+C_isomers

    #measurement.assignment |

    Compound

    #measurement.compound |

    Mol_Formula

    #measurement.formula |

    SamplID

    #sample.id |

    If the “Cell Type” or “Mouse Species” headers are in the row, then don’t add the tags.

Insertions can be inside of header tag blocks, so they are inserted only when a match is made to the header rows. They function just like when they are on their own, except that there is an additional #multiple tag that can be used to control whether the insert happens every time the tag block matches or only the first time.

  • #multiple=true - will insert on every match.

  • #multiple=false - will only insert on the first match, this is the default behavior if #multiple is not specified.

Common Use Case Examples

For a real example you can see some in the examples folder of the GitHub repository under the extract folder. The following are excerpts from those examples.

Export Tags

Tagging Subject Entities

#tags

#entity.id;#entity.species=”Mus musculus”;#.species_type=Mouse;#.taxonomy_id=10090;#.type=subject

*#.protocol.id

#entity.replicate

#entity.time_point

#ignore

mouse number

protocol

replicate

time_point

01

01_A0_naive_0days_UKy_GCH_rep1

A0

naive

1

0

02

02_A1_naive_0days_UKy_GCH_rep2

A1

naive

2

0

03

03_A2_naive_0days_UKy_GCH_rep3

A2

naive

3

0

04

04_B0_syngenic_42days_UKy_GCH_rep1

B0

syngenic

1

42

05

05_B1_syngenic_42days_UKy_GCH_rep2

B1

syngenic

2

42

06

06_B2_syngenic_42days_UKy_GCH_rep3

B2

syngenic

3

42

07

07_C1-1_allogenic_42days_UKy_GCH_rep1

C1-1

allogenic

1

42

08

08_C1-2_allogenic_42days_UKy_GCH_rep2

C1-2

allogenic

2

42

You can see that the original header row is ignored with “#ignore” and the important columns are tagged with appropriate field names. Not all of the columns are tagged. What needs to be tagged or what is important will vary according to your needs and application. MESSES is largely meant for converting for deposition into a repository so information that is not important or relevant to the repository you are targeting will likely be untagged. Note that the *#.protocol.id tag is a list field even though it has single values underneath. This is be consistent with field types as other records can have multiple protocols associated with them.

Creating Samples From Parents

#tags

#entity.id

#entity%child.id=-polar-ICMS_A;#.replicate=1; #%type=”analytical”;#.weight; #%units=g;*#.protocol.id=polar_extraction,IC-FTMS_preparation; #entity.type=sample

#ignore

Slice #

Parent Sample ID

polar tare (5ml tube)

polar+tare

g polar ext

g polar FTMS A

g polar FTMS B

g polar ICMS A

g polar ICMS B

01

01_A0_Spleen_naive_0days_170427_UKy_GCH_rep1

3.1476

4.8772

1.7296

0.0983

0.1056

0.2084

0.2083

02

02_A1_Spleen_naive_0days_170427_UKy_GCH_rep2

3.1782

4.8490

1.6708

0.0983

0.0973

0.2001

0.1949

03

03_A2_Spleen_naive_0days_170427_UKy_GCH_rep3

3.1621

4.8485

1.6864

0.1029

0.1041

0.1951

0.2043

04

04_B0_Spleen_syngenic_42days_170427_UKy_GCH_rep1

3.1479

4.7828

1.6349

0.0938

0.0958

0.1885

0.1936

05

05_B1_Spleen_syngenic_42days_170427_UKy_GCH_rep2

3.1685

4.9067

1.7382

0.0938

0.0994

0.2010

0.2003

06

06_B2_Spleen_syngenic_42days_170427_UKy_GCH_rep3

3.1735

4.8483

1.6748

0.1003

0.0914

0.2000

0.1891

07

07_C1-1_Spleen_allogenic_42days_170427_UKy_GCH_rep1

3.1764

4.7631

1.5867

0.0980

0.0916

0.1859

0.1868

08

08_C1-2_Spleen_allogenic_42days_170427_UKy_GCH_rep2

3.1617

4.8176

1.6559

0.0955

0.0957

0.1982

0.1922

The entities in the 3rd column under the #entity.id tag are assumed to exist elsewhere. The %child tag is used to create a new sample that captures the weight of the polar extraction for the ICMS_A aliquot. The other aliquots are not captured because they are not needed or were not used for the study that is being deposited, but it is a good exercise for the reader to add tags to those columns based on the one shown for ICSM_A.

Modification and Automation Tags

The examples here will both use a common set of measurement data.

Measurement Data:

Compound

Mol_Formula

C_isomers

SamplID

Intensity

Renormalized

Total.NA.Removed

Corrected

Predicted

Comments

Quantified_uM_ratio

Quantified_uM_sequence_ratio

reconstitution_volume_uL

injection_volume_uL

protein_mg

icms_split_ratio

Amount_ProteinAdj_uMol_g_protein_RatioBased

Amount_ProteinAdj_uMol_g_protein_SequenceBased

CommentQuantification

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

C5H8O4

0

01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench

7989221.834

8447352.892

8490014.365

8447352.892

7839899.288

NA

0

0

20

10

0.618176844

0.25575733

0

0

Legend

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

C5H8O4

1

01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench

289287.7334

0

8490014.365

0

439597.5524

NA

0

0

20

10

0.618176844

0.25575733

0

0

Compound: name of assigned metabolite, noStd means assigment was NOT verified with standard compound

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

C5H8O4

2

01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench

43815.55259

34916.99767

8490014.365

34916.99767

42996.61803

NA

0

0

20

10

0.618176844

0.25575733

0

0

Mol_Formula: molucular formula od assigned compound

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

C5H8O4

3

01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench

6555.515601

5426.466785

8490014.365

5426.466785

6432.98974

NA

0

0

20

10

0.618176844

0.25575733

0

0

C_isomers and/or N_isomers: additional Da units that were measured in a targeted manner for the compound in question

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

C5H8O4

4

01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench

2049.939279

1939.047016

8490014.365

1939.047016

2011.62489

NA

0

0

20

10

0.618176844

0.25575733

0

0

Intensity: raw XIC intensity data

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

C5H8O4

5

01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench

401.1818948

378.961431

8490014.365

378.961431

393.6836049

NA

0

0

20

10

0.618176844

0.25575733

0

0

Renormalized: Natural abundance Corrected XIC intensity data

13-BPG

C3H8O10P2

0

01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench

348209.1176

360055.818

560907.2249

360055.818

348209.1176

NA

0.055515033

0.0225354

20

10

0.618176844

0.25575733

0.00702263

0.002850719

Amount_ProteinAdj_uMol_g_protein_RatioBased = amount of quantified isotoplogue in sample in umol/g protein?protein, calculated as Quantified_uM x (RecVol/1000/SplitRatio/mg protein)

13-BPG

C3H8O10P2

1

01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench

14984.22348

3343.128038

560907.2249

3343.128038

14984.22348

NA

0.000515459

0.000209242

20

10

0.618176844

0.25575733

6.52E-05

2.65E-05

Amount_ProteinAdj_uMol_g_protein_SequenceBased = amount of quantified isotoplogue in sample in umol/g protein?protein, calculated as Quantified_uM x (RecVol/1000/SplitRatio/mg protein)

13-BPG

C3H8O10P2

2

01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench

11385.033

11305.71062

560907.2249

11305.71062

11385.033

NA

0.001743166

0.000707609

20

10

0.618176844

0.25575733

0.00022051

8.95E-05

13-BPG

C3H8O10P2

3

01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench

186328.8509

186202.5683

560907.2249

186202.5683

186328.8509

NA

0.028709553

0.011654163

20

10

0.618176844

0.25575733

0.003631747

0.001474247

Not all of the columns will be used and this is just a small portion of a much larger dataset.

Adding a Protocol

#insert

#tags

#protocol.id

#.type

#.machine_type

#.instrument

#.ion_mode

#.ionization

#.instrument_type

#.description

#.chromatography_instrument_name

#.chromatography_type

#.column_name

#.chromatography_description

#.parent_protocol

ICMS1

measurement

MS

Orbitrap Fusion

NEGATIVE

ESI

IC-FTMS

ICMS Analytical Experiment with detection of compounds by comparison to standards.

Thermo Dionex ICS-5000+

Targeted IC

Dionex IonPac AS11-HC-4um 2 mm i.d. x 250 mm

Targeted IC

IC-FTMS_measurement

#end

This would be one set of directives in the #automate sheet or file. The description tag has been shortened.

Automating Tags

#tags

#header

#add

Compound+”-13C”+C_isomers+”-“+SamplID

#measurement.id

Compound+”-13C”+C_isomers

#measurement.assignment

Compound

#measurement.compound

Mol_Formula

#measurement.formula

SamplID

#.entity.id

“13C”+C_isomers

#measurement.isotopologue;#%type=”13C”

Intensity

#measurement.raw_intensity;#%type=”spectrometer peak area”

Renormalized

#measurement.corrected_raw_intensity;#%type=”natural abundance corrected peak area”

Quantified_uM_sequence_ratio

#measurement.concentration;#%units=uM;#%type=”calculated from standard”

Amount_ProteinAdj_uMol_g_protein_SequenceBased

#measurement.normalized_concentration;#%type=”protein normalized”;#%units=”uMol/g”

eval(float(#Renormalized#) / float(#protein_mg#))

#measurement.intensity;#%type=”natural abundance corrected and protein normalized peak area”;#%units=”area/g”

#.protocol.id=ICMS1

You can see how the values in the #header column correspond to headers in the measurement data, and how the tags in the #add column will be newly created or added on top of the column in the measurement data.

Changing Entity IDs

#tags

#measurement.entity.id.value

#measurement.entity.id.assign

#unique=false

#match=all

01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench

01_A0_Colon_naive_0days_170427_UKy_GCH_rep1-polar-ICMS_A

02_A1_Colon_T03-2017_naive_170427_UKy_GCB_rep2-quench

02_A1_Colon_naive_0days_170427_UKy_GCH_rep2-polar-ICMS_A

03_A2_Colon_T03-2017_naive_170427_UKy_GCB_rep3-quench

03_A2_Colon_naive_0days_170427_UKy_GCH_rep3-polar-ICMS_A

This has been truncated, but you can see how it will overwrite the entity.id field for all records that match the “.value” column. Something like this may be necessary when there is a name mismatch between metadata and measurement data.

Changing IDs

#tags

#measurement.id.value

#measurement.id.regex

#unique=false

#comparison=regex

r’01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench’

r’01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench’,r’01_A0_Colon_naive_0days_170427_UKy_GCH_rep1-polar-ICMS_A’

r’02_A1_Colon_T03-2017_naive_170427_UKy_GCB_rep2-quench’

r’02_A1_Colon_T03-2017_naive_170427_UKy_GCB_rep2-quench’,r’02_A1_Colon_naive_0days_170427_UKy_GCH_rep2-polar-ICMS_A’

r’03_A2_Colon_T03-2017_naive_170427_UKy_GCB_rep3-quench’

r’03_A2_Colon_T03-2017_naive_170427_UKy_GCB_rep3-quench’,r’03_A2_Colon_naive_0days_170427_UKy_GCH_rep3-polar-ICMS_A’

This is similar to and partly because of the previous example that changes the entity.id field. You can see in the Automating Tags example that the id field is created with a concatenation of the column that will become the entity.id field (SamplID). In order to make things match we need to make the same change in the id field, but we can’t simply use an assign, thus the use of regex.

Adding Additional Fields

#tags

#measurement.assignment.value

#measurement.assignment%method.assign

#unique=false

#match=all

(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C0

database

(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C1

database

(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C2

database

(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C3

database

(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C4

database

(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C5

database

13-BPG-13C0

direct_to_standard

13-BPG-13C1

indirect_to_standard

13-BPG-13C2

indirect_to_standard

13-BPG-13C3

indirect_to_standard

This will add an additional “assignment%method” field to the measurement records whose “assignment” field matches the “.value” column.