Tagging

Terminology

header - a descriptive label at the top of a column in a table of data.
- Example: SampleID
tag - used to describe fields in a record of data. They are a semantic descriptive header used to parse a column (field) of data from a table.
- Example: #measurement.compound
table - in this section, table can refer to either a table in a data file or a SQL style table in the output. Often there are many small tables in a data file that are combined into a larger one in the extract output.
directives - directions given to MESSES commands using tagged files or their JSONized forms. A table and its tags taken as a whole.

Introduction

In order to extract data from arbitrary tables programmatically, some kind of system has to be devised. This could be something as simple as requiring a table be on the very first row and for the starting row to have column names for every column, but a system such as that would then be fragile. We decided to create a more robust system that could handle more complicated data arrangements and reduce the verbosity to a minimum. The system that was devised was an extra layer of tags added on top of existing tables that tell the extract command how to transform the tabular data into key-based record data similar to JSON or SQL databases. The specific table schema this system was designed for is covered in the Experiment Description Specification section, but it is general enough that it can be used with most table schema.

This initial system served its function well, but it became clear that more functionality was sorely needed. Namely, both a way to add tags programmatically to data and a way to modify record values was needed, so the system was expanded to provide facilities to do both. Ultimately there are 3 parts to the tagging system that are distinct from one another but have similar syntax and ideas. The “export” part is the system used to add directly on top of existing tabular data. It is the base system that must be used for the extraction to work at all. The “automation” part that is used to automate adding “export” tags to tabular data. Based on the header values in your data, you can use “automation” tags to add the “export” tags automatically. A good use case for this is when you have data output by a program in a consistent way. Instead of manually adding export tags to the program output each time, you can create an “automation” page that will add the “export” tags for you. The last “modification” part is the system to modify record values. It can be used to prepend, append, delete, overwrite, or regex substitute values. An example use-case would be to update old naming conventions. Validly tagged files in their tabular or JSON form can be referred to as directives as they direct the actions of the program. To reduce confusion between tags and directives “tags” should generally refer to the extra text added above tables, while “directives” are the tables and tags taken as a whole. Each row of a tagged table is an individual directive.

Each part of the tagging system has to be in their own sheet or file for the extract command. By default, export tags are expected to be in a sheet named ‘#export’, if given an Excel file without specifying a sheet name. If given a CSV file, then this file is expected to have export tags. Modification tags are expected to be in a sheet named ‘#modify’ by default, but can be specified using the –modify option. The option can be used to specify either a different sheet name in the given Excel file, a different Excel file, a different Excel file with a different sheet name, a Google Sheets file, a Google Sheets file with a different sheet name, a JSON file, or a CSV file. Automation tags are similarly specifiedusing the –automate option or otherwise expected in a sheet named ‘#automate’ by default.

Each part of the tagging system are explained below with examples. Examples using them with the extract command are in the Tutorial section and there are full run examples in the “examples” folder of the GitHub repository.

Export Tags

Tag rows are identified by putting ‘#tags’ in the left most column of the data file.
- Typically, there will be nothing underneath of ‘#tags’ except ‘#ignore’ for rows to ignore.
- You may need to add a blank column to the left of your data to accommodate this.
- ‘#tags’ always needs to be on the very left most column, even if your table has several blank columns before it starts.
Cells on this row will have tags over the columns to be exported.
Subsequent non-blank rows are interpreted as data unless the tag ‘#ignore’ is used in the left most column.

Basic Example:

#tags	#sample.id	#.weight;#%units=”g”	#.protocol.id
#ignore	Sample ID	Weight	Treatment
	…	…	…

Keyword Tags

#tags - identifies tag header rows. Must be in the first column of a row.
- Additional tag-value pairs in a cell with #tags are applied to all child records in this section, but are ignored for non-child records.
#ignore - ignore this row. Must be put in the first column of a row.
- Use this to ignore additional header explanation rows and unused data records within a block of records.
#table - change the working table to the indicated table name.

Column Field Format

Tags begin with a ‘#’ symbol and identify a specific relational database table, field, and optionally a field-attribute.
- Format: #table_name.field_name
  Example: #sample.dry_weight
  
  sample - the sample data relational table (workbench worksheet)
  
  dry_weight - the field name.
- Format: #table
  Example: #sample
  
  sample - the sample data relational table (workbench worksheet)
- Format: #.field_name
  Example: #.dry_weight
  
  dry_weight - the field name, for the table last identified.
- Format: #table_name.field_name%attribute_name
  Example: #sample.dry_weight%units
  
  sample - the sample data relational table (workbench worksheet)
  
  dry_weight - the field name.
  
  units - units attribute for the field.
- Format: #.field_name%attribute_name
  Example: #.dry_weight%units
  
  dry_weight - the field name, for the last table identified.
  
  units - units attribute for the field.
- Format: #%attribute_name
  Example: #%units
  
  units - units attribute for a field previously identified in a table previously identified.

Value Field Format

Values for a tag field can be specified directly using an equal sign “=”.
- Format: #table_name.field_name=field_value
  Example: #study.title=”labeled mouse study 278”
  
  study - the study table.
  
  “labeled mouse study 278” - value for the study title field.
- Note that there is a special case for “id” fields. “id” fields cannot be set with an “=”.
- Format: #table_name.field_name%attribute_name=attribute_value
  Example: #sample.dry_weight%units=g
  
  sample - sample table
  
  dry_weight - dry_weight field
  
  units - attribute for dry weight with a value of “g”.
Multiple tags can be specified in a single cell using a semicolon “;”.
- However, only one tag can be left without a direct value for unambiguous interpretation of a column.
- Format: #table_name.field_name;#table_name.field_name=field_value;…
  Example: #sample.dry_weight;#.dry_weight%units=mg
  
  sample - the sample data table.
  
  dry_weight - field for the column interpretation
  
  units - field with a direct value of “mg”.
Values can be combined into a single value using a plus sign “+”.
- Format: #table_name.field_name;#table_name.field_name=field_value+field_value
  Example: #study.title=”labeled mouse study 278 “+#.type
  
  study - the study table.
  
  “labeled mouse study 278 ” - part of value for the study title field.
  
  #.type - part of value of the study title field taken from the study type field.

ID Field Format

ID tags indicate an identifier for a record in a table.
The ID must uniquely identify a record.
There must be an ID tag in every tag row.
Using the same ID in multiple tag rows will add to the existing record.
- Fields with the same name for the same ID across multiple tag rows will automatically become list fields, and new values will be added to the list.
Records can have record IDs from other tables or the same table as fields.
- Format: #table.id
  Example: #sample.id
  
  sample - type of id tag. This often is a table name.
  
  id - indicates that this is an id tag.
- Format: #.id
  Short format can be used when the table is already specified.

List Field Format

List field tags begin with ‘*#’ (asterisk followed by the pound sign) and identify a specific relational database table, field, and optionally a field-attribute that has multiple values.
They have the same format as normal column field tags.
- Format: *#table_name.field_name
  Example: *#sample.dry_weight
  
  sample - the sample data relational table (workbench worksheet)
  
  dry_weight - the field name.
Individual values are separated by commas “,” both in the column cells or in the column tag value.
- Format: *#table_name.field_name=field_value,field_value,…
  Example: *#study.labeling=13C,15N
  
  study - the study table.
  
  13C,15N - two values for the study labeling field.
List field tags can be listed multiple times in a record, with each value(s) appended.

Child Tag Format

Child record tags provides a mechanism for indicating parent-child relationships between records in the same table or between tables.
- The child tag indicates the creation of a new record.
- Subsequent normal tags identify fields in the new child record.
- A special parentID field is added using the first ID tag indicated in the header row.
- Format: #%child.id=id_sub_string
  The value for id_sub_string will be appended to the ID of the child’s parent (parentID) to create the child ID.

Example:

#tags	#sample.id	#%child.id=-media-0h;#.dry_weight;#.dry_weight%units=mg	#%child.id=-media-3h;#.dry_weight;#.dry_weight%units=mg
	KO labelled_1	4.2	8.5
	KO labelled_2	4.7	9.7
	…	…	…

Output JSON:

{
  "sample": {
    "KO labelled_1": {
      "id": "KO labelled_1"
    },
    "KO labelled_1-media-0h": {
      "dry_weight": "4.2",
      "dry_weight%units": "mg",
      "id": "KO labelled_1-media-0h",
      "parentID": "KO labelled_1"
    },
    "KO labelled_1-media-3h": {
      "dry_weight": "8.5",
      "dry_weight%units": "mg",
      "id": "KO labelled_1-media-3h",
      "parentID": "KO labelled_1"
    },
    "KO labelled_2": {
      "id": "KO labelled_2"
    },
    "KO labelled_2-media-0h": {
      "dry_weight": "4.7",
      "dry_weight%units": "mg",
      "id": "KO labelled_2-media-0h",
      "parentID": "KO labelled_2"
    },
    "KO labelled_2-media-3h": {
      "dry_weight": "9.7",
      "dry_weight%units": "mg",
      "id": "KO labelled_2-media-3h",
      "parentID": "KO labelled_2"
    }
  }
}

Field Tracking Tags

Field tracking tags provide a mechanism for copying the latest field value from one table into the records of another.
- The tag indicates which table’s records to add to and which field to track from another table.
- The latest value for the field seen while parsing will be added.
- Useful for adding project and study ids to records in a document with multiple projects or studies.
- Format: #table%track=table.field
- Example: #sample%track=project.id
  will add the project.id field to every sample record.
- A list format can also be used:
  #table%track=table.field1,table.field2,…
- Example: #sample%track=project.id,study.id
- Fields can also be untracked after tracking to stop adding the field to records.
- Format: #table%untrack=table.field
- The list format also works for untrack.
- If a tracked field is specifically given in a table the given value is used over the tracked value.

Example:

#tags	#sample%track=project.id

#tags	#project.id
	Project 1

#tags	#sample.id
	01_A0_Spleen_naive_0days_170427_UKy_GCH_rep1
	02_A1_Spleen_naive_0days_170427_UKy_GCH_rep2

Output JSON:

{
  "project": {
    "Project 1": {
      "id": "Project 1"
    }
  },
  "sample": {
    "01_A0_Spleen_naive_0days_170427_UKy_GCH_rep1": {
      "id": "01_A0_Spleen_naive_0days_170427_UKy_GCH_rep1",
      "project.id": "Project 1",
    },
    "02_A1_Spleen_naive_0days_170427_UKy_GCH_rep2": {
      "id": "02_A1_Spleen_naive_0days_170427_UKy_GCH_rep2",
      "project.id": "Project 1",
    }
  }
}

Modification Tags

Similar to export tags, the modification tag rows are indicated by #tags in the left most column, and #ignore can be used to ignore rows. The general idea behind the modification system is that you first use tags to indicate a field in a table to match to. Then underneath that tag indicate the value in that field to match to. Then another tag in the same row will indicate both what field to modify in the record that has the matching field and what modification to do. Underneath that tag will have the value to do the modification with.

Basic Examples:

#tags	#measurement.compound.value	#measurement.compound.assign	#match=all
	(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd	(S)-2-Acetolactate Glutaric acid Methylsuccinic acid
	…	…

This example replaces the “compound” field with value “(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd” in “measurement” table records with “(S)-2-Acetolactate Glutaric acid Methylsuccinic acid”.

#tags	#measurement.compound.value	#measurement.id.regex	#match=all
	(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd	r’(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd’,r’(S)-2-Acetolactate Glutaric acid Methylsuccinic acid’
	…	…

This example does a regex substitution on the “id” field of records in the “measurement” table if their “compound” field matches “(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd”. In the “id” field “(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd” is substituted with “(S)-2-Acetolactate Glutaric acid Methylsuccinic acid”.

Value Tag

All modification tag rows must start with a value tag after #tags.
This tag indicates which table and field to compare with.
The value underneath the tag will be compared with the value in the indicated field for all of the records in the indicated table to determine if a match is made.
- Signature:
  #[table_name].field_name[%attribute].value

Comparison Type

The comparison type is controlled through the #comparison tag. It can be used to specify the type for each row individually, or for the whole column by using the = sign. Ex. #comparison=exact

There are 4 comparison types, “exact”, “regex”, “levenshtein”, and “regex|exact”.

exact is a simple exact comparison between the comparison value and field value. The given comparison value must be exactly what is in the field value (i.e. a “==” comparison).

regex expects a regex to be in the comparison value and will print a message if it is not. The given regex will be delivered to re.search() for matching with field values.

levenshtein calculates the levenshtein distance between the comparison value and record field values and matches to the field values with the minimum distance.

This means it always matches to something even if the values seem wildly different, so be aware of possible unexpected results.

regex|exact is an intelligent combination of regex and exact. If a regex is specified for the comparison value it will be detected and the type will be regex for that comparison value only, otherwise the comparison type will be exact.

If a type is specified then that type is used regardless of the comparison value, so a regex string with an exact comparison type will try to match exactly.

If the comparison tag is not specified, then the type defaults to “regex|exact”.

exact type modifications are executed first, then regex type, and lastly levenshtein type. Within each type, modifications are executed in order from first to last.

Match Type

Match behavior can be altered further using the #match tag. It can be used to specify the type for each row individually, or for the whole column by using the = sign. Ex. #match=all

There are 4 match types, “first”, “first-nowarn”, “unique”, and “all”.

“first” - the modification is performed only for the first record matched, additional matches beyond the first will print a warning.

“first-nowarn” - the same as first, but won’t print warnings.

“unique” - the modification is only performed if 1 and only 1 record matched.

For levenshtein, this means that only 1 field value can have the minimum distance, if 2 values share the minimum distance then the action won’t take place.

“all” - the modification is done to every record that matches.

If the match tag is not specified, then the type defaults to “first”.

Modifications

There are 6 modifications that can be done, “assign”, “append”, “prepend”, “regex”, “delete”, and “rename”.

assign - will overwrite whatever value is in the field with the indicated assignment value.

Signatures:

#[table_name].field_name[%attribute].assign

*#[table_name].field_name[%attribute].assign

If the indicated assignment field does not exist in the record then it will be added to the record.

An eval function can be used in the form “eval(…)”.

“#field_name#” and “#r’…’#” can be used to construct the assignment value for the record.

All Python language operators can be used. But remember to use “float(#field_name#)” to convert strings to floating point numbers.

Example: eval(float(#intensity#) / float(#normalization#) * 5)

evals that return a list of strings will convert the field to a list field.

Add an asterisk, ‘*’, to the front of the tag to interpret the assignment value as a list and assign that list value to the field.

An eval function can be used, but it must return a list of strings.

The assign modification can be used to change list types to non list types and vice versa.

This can lead to an issue where some records have a list type for the field and some do not.

If that is not intended, then be sure to construct the assign tag such that it matches the type of the field.

For instance, make sure evals return a list if the field should be a list type.

Example:

#tags

#measurement.compound.value

#measurement.compound.assign

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

(S)-2-Acetolactate Glutaric acid Methylsuccinic acid

…

…

append - will append the indicated value to the value in the indicated field.

Signatures:

#[table_name].field_name[%attribute].append

*#[table_name].field_name[%attribute].append

If the indicated append field does not exist in the record, then it will be added to the record.

If the field value is a list and the append value is not a list, then the append value will be appended to each value in the list.

Add an asterisk, ‘*’, to the front of the tag to interpret the append value as a list.

When the append value is a list, the behavior is more complicated.

For each value in the field value list, the append value in the append list at the same index will be appended to the field value.

Examples:

field_value = [“a”, “b”] append_value = [“c”, “d”] result = [“ac”, “bd”]

field_value = [“a”, “b”] append_value = [“c”, “d”, “e”] result = [“ac”, “bd”]

field_value = [“a”, “b”, “e”] append_value = [“c”, “d”] result = [“ac”, “bd”, “e”]

Example:

#tags

#measurement.compound.value

#measurement.sample.id.append

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

(S)-2-Acetolactate Glutaric acid Methylsuccinic acid

…

…

prepend - will prepend the indicated value to the value in the indicated field.

Signatures:

#[table_name].field_name[%attribute].prepend

*#[table_name].field_name[%attribute].prepend

If the indicated prepend field does not exist in the record, then it will be added to the record.

If the field value is a list and the prepend value is not a list, then the prepend value will be prepended to each value in the list.

Add an asterisk, ‘*’, to the front of the tag to interpret the prepend value as a list.

When the prepend value is a list, the behavior is more complicated.

For each value in the field value list, the prepend value in the prepend list at the same index will be prepended to the field value.

Examples:

field_value = [“a”, “b”] prepend_value = [“c”, “d”] result = [“ca”, “db”]

field_value = [“a”, “b”] prepend_value = [“c”, “d”, “e”] result = [“ca”, “db”]

field_value = [“a”, “b”, “e”] prepend_value = [“c”, “d”] result = [“ca”, “db”, “e”]

Example:

#tags

#measurement.compound.value

#measurement.sample.id.prepend

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

(S)-2-Acetolactate Glutaric acid Methylsuccinic acid

…

…

regex - will do a regex substitution on the indicated field using the indicated values.

Signatures:

#[table_name].field_name[%attribute].regex

If the indicated regex field does not exist in the record, then a warning will be printed.

If the field value is a list, then the regex substitution will be done on each element in the list.

Example:

#tags

#measurement.compound.value

#measurement.id.regex

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

r’(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd’,r’(S)-2-Acetolactate Glutaric acid Methylsuccinic acid’

…

…

delete - will remove the field from the record.
- Signatures:
  #[table_name].field_name[%attribute].delete
- “id” fields cannot be deleted. An error will be raised during parsing if it is attempted.
- No value is needed under the tag.
Example:

#tags

#measurement.compound.value

#measurement.mol_formula.delete

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

…

…
rename - will rename the field in the record.
- Signatures:
  #[table_name].field_name[%attribute].rename.field_name[%attribute]
- “id” fields cannot be renamed. An error will be raised during parsing if it is attempted.
- Fields cannot be renamed to the same name. An error will be raised during parsing if it is attempted.
- No value is needed under the tag.
Example:

#tags

#measurement.compound.value

#measurement.mol_formula.rename.molecular_formula

(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd

…

…

Note

ID fields are special.

Modifications to id fields will be propagated to the table key of that record.
They cannot be deleted or renamed.
They are the only time a period, ‘.’, can appear in the field_name.
- Ex. #measurement.intensity.units.assign is malformed, but #measurement.entity.id.assign is not.

Important Points:

Tags in the same row must have the same table. An error will be raised during parsing if they don’t.
The value tag must be before the modification tags.
Modifications are confined to the matched record. It is not possible to modify a record based on another record’s fields or values.
Modifications can be chained together, so that the same field can have multiple modifications.
- This can be utilized effectively, but can also cause hard to diagnose unexpected output.
- Some warnings are printed when fields are modified twice in a way that doesn’t make sense, such as a delete modification after an assign modification, but all other chained modifications are assumed to be intended.

Tag Format Reference:

#table_name.field_name[%attribute].value - identifies table_name containing field_name (with possible attribute name) and associated column with value to match. A regular expression can be given with r’…’.
#comparison - identifies column with type of comparison (exact, regex, levenshtein, regex|exact). Default regex|exact.
#comparison=type - type of comparison (exact, regex, levenshtein, regex|exact).
#match - identifies column with type of match (first, first-nowarn, unique, all). Default first.
#match=type - type of match (first, first-nowarn, unique, all).
#[table_name].field_name[%attribute].assign identifies field to assign and associated column with its value.
*#[table_name].field_name[%attribute].assign identifies field to assign and that the associated column values are a list type.
#[table_name].field_name[%attribute].append identifies field to append to and associated column with its value.
*#[table_name].field_name[%attribute].append identifies field to append to and that the associated column values are a list type.
*#[table_name].field_name[%attribute].prepend identifies field to prepend to and associated column with its value.
*#[table_name].field_name[%attribute].prepend identifies field to prepend to and that the associated column values are a list type.
#[table_name].field_name[%attribute].regex identifies field to apply regex substitution to and associated column with the pair of regex strings of the form r”…”,r”…”.
#[table_name].field_name[%attribute].delete identifies field to delete.
#[table_name].field_name[%attribute].rename.field_name[%attribute] identifies field to rename.

Automation Tags

Automation tag rows, like the other tag rows, are indicated by #tags in the left most column with #ignore used to ignore rows, but #insert and #end tags are also introduced. There are 2 main functions in the automation system. One is to specify a table of header-tag pairs that will be used to automatically add the tags associated with the headers underneath of the headers when it finds them in the data. The other is to specify a block of rows to add to the data exactly as is. This is what introduces the #insert and #end tags.

Insert

The insertion functionality is easy to understand. You simply write whatever you want to add into the data and add #insert above it in the left most column and #end below it in the left most column. Everything in between #insert and #end is simply added as is into the data before it is parsed by the export tagging. A good use case for this is when you have a standard protocol that always needs to be added to some data. Instead of copying it in manually, you can add it to an automation sheet/file and deliver it to the extract command so it can add it for you. The thing to be careful of is to make sure everything in the insert block is valid under the export tagging. It can be tricky to debug a tagging error here because extract won’t be able to tell you that the issue is in the insert block.

Example:

#insert
#tags	#protocol.id	#.type	#.instrument	#.ion_mode	#.ionization	#.instrument_type
	ICMS1	MS	Orbitrap Fusion	NEGATIVE	ESI	IC-FTMS

#end

Header Tagging

The header tagging allows you to automatically put export tags under a cell in tabular data based on the value in the cell. Typically, a table will already have descriptive human readable headers to identify what type of data is in the column. These headers are used to match to and put the associated export tags under them. Any row that has a header match where export tags are added is automatically ignored with the #ignore tag. Just like modification tags and export tags, #tags is used to denote the start of a tag block. An entire block is matched as a whole to a row in the data, so if you have multiple tables to add tags to you should created multiple tag blocks. There are additional tags to help control how a blank is matched, detailed below.

Example:

Data:

Compound	Mol_Formula	C_isomers	SamplID	Intensity	protein_mg
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd	C5H8O4	0	01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench	7989221.834	0.618176844244679
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd	C5H8O4	1	01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench	289287.7334	0.618176844244679

Header Tags:

#tags	#header	#add
	Compound+”-13C”+C_isomers+”-“+SamplID	#measurement.id
	Compound+”-13C”+C_isomers	#measurement.assignment
	Compound	#measurement.compound
	Mol_Formula	#measurement.formula
	SamplID	#sample.id
	“13C”+C_isomers	#measurement.isotopologue;#%type=”13C”
	Intensity	#measurement.raw_intensity;#%type=”spectrometer peak area”
	eval(float(#Intensity#) / float(#protein_mg#))	#measurement.intensity;#%type=”natural abundance corrected and protein normalized peak area”;#%units=”area/g”
		#protocol.id=ICMS1

After Automation:

#ignore			Compound	Mol_Formula	C_isomers	SamplID		Intensity
#tags	#measurement.id	#measurement.assignment	#measurement.compound	#measurement.formula		#sample.id	#measurement.isotopologue;#%type=”13C”	#measurement.raw_intensity;#%type=”spectrometer peak area”	#measurement.intensity;#%type=”natural abundance corrected and protein normalized peak area”;#%units=”area/g”	#protocol.id=ICMS1
	(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd-13C0-01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench	(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd-13C0	(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd	C5H8O4	0	01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench	13C0	7989221.834	12923845.19
	(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd-13C1-01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench	(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd-13C1	(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd	C5H8O4	1	01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench	13C1	289287.7334	467969.2165

JSON Output:

{
  "measurement": {
    "(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C0-01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench": {
      "assignment": "(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd-13C0",
      "compound": "(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd",
      "formula": "C5H8O4",
      "id": "(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C0-01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench",
      "intensity": "12923845.19",
      "intensity%type": "natural abundance corrected and protein normalized peak area",
      "intensity%units": "area/g",
      "isotopologue": "13C0",
      "isotopologue%type": "13C",
      "protocol.id": "ICMS1",
      "raw_intensity": "7989221.83386388",
      "raw_intensity%type": "spectrometer peak area",
      "sample.id": "01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench"
    },
    "(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C1-01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench": {
      "assignment": "(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd-13C1",
      "compound": "(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd",
      "formula": "C5H8O4",
      "id": "(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C1-01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench",
      "intensity": "467969.2165",
      "intensity%type": "natural abundance corrected and protein normalized peak area",
      "intensity%units": "area/g",
      "isotopologue": "13C1",
      "isotopologue%type": "13C",
      "protocol.id": "ICMS1",
      "raw_intensity": "289287.733437356",
      "raw_intensity%type": "spectrometer peak area",
      "sample.id": "01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench"
    }
  }
}

The Tags

#header - header value to match to. Can be a regular expression of the form r’…’.
- All tag blocks must contain a #header tag.
- Additional column headers can be included with blank corresponding tags to help make header row identification unique.
- Cell contents are stripped of leading and trailing white space before comparison with the header value.
- A new column will be created if the headers and literals in quotes are combined with plus signs.
  Example: Name+”-“+Isopotologue+”-“+r’^d+w+ Isotope$’
  
  This functionality means that certain characters can’t be used for literal matching outside of a regex.
  
  For example, if a header name in a data table is “protein+solvent”, then you can’t simply put protein+solvent under #header because it will be interpreted as a concatenation of a “protein” header and a separate “solvent” header.
  
  The easiest way to solve this issue is to use a regular expression. r’protein+solvent’ will match the header correctly.
  
  In general, if you are having difficulty matching a header, try using a regex.
- An eval function can be used in the form “eval(…)”.
  “#header_name#” and “#r’…’#” can be used to indicate specific columns in the row.
  
  All Python language operators can be used. But remember to use “float(#field_name#)” to convert strings to floating point numbers.
  
  Example: eval(float(#Intensity#) / float(#r’.*Normalization’#) * 5)
  
  If the eval returns a list of items, it is converted into a string separated by semicolons.
  
  If the corresponding tag is a list tag, then this will become a list.
#add - tags to add in an inserted row below the column header row.
- All tag blocks must contain a #add tag.
- The actual value under #add does not have to be a valid tag, the value will be copied as is.
- Leave this value blank to add headers that are required to match a block, but don’t need tags.

#required - true|false whether this header description is required. The default is that all header descriptions are required.

Example:

#tags

#header

#add | #required

Compound+”-13C”+C_isomers+”-“+SamplID

#measurement.id | true

Compound+”-13C”+C_isomers

#measurement.assignment | true

Compound

#measurement.compound | true

Mol_Formula

#measurement.formula | false

SamplID

#sample.id | true

If the Mol_Formula header is not found, the tags will still be added, but without the Mol_Formula ones.

#exclude=test_string - test string or regular expression to use for excluding a given header row.

If a header matches the exclude string or regex, then the tags are not inserted regardless of whether the headers match.

Example:

#tags

#header

#add | #exclude=r’Cell Type|Mouse Species’

Compound+”-13C”+C_isomers+”-“+SamplID

#measurement.id |

Compound+”-13C”+C_isomers

#measurement.assignment |

Compound

#measurement.compound |

Mol_Formula

#measurement.formula |

SamplID

#sample.id |

If the “Cell Type” or “Mouse Species” headers are in the row, then don’t add the tags.

Insertions can be inside of header tag blocks, so they are inserted only when a match is made to the header rows. They function just like when they are on their own, except that there is an additional #multiple tag that can be used to control whether the insert happens every time the tag block matches or only the first time.

#multiple=true - will insert on every match.
#multiple=false - will only insert on the first match, this is the default behavior if #multiple is not specified.

Common Use Case Examples

For a real example you can see some in the examples folder of the GitHub repository under the extract folder. The following are excerpts from those examples.

Export Tags

Tagging Subject Entities

#tags		#entity.id;#entity.species=”Mus musculus”;#.species_type=Mouse;#.taxonomy_id=10090;#.type=subject		*#.protocol.id	#entity.replicate	#entity.time_point
#ignore			mouse number	protocol	replicate	time_point
	01	01_A0_naive_0days_UKy_GCH_rep1	A0	naive	1	0
	02	02_A1_naive_0days_UKy_GCH_rep2	A1	naive	2	0
	03	03_A2_naive_0days_UKy_GCH_rep3	A2	naive	3	0
	04	04_B0_syngenic_42days_UKy_GCH_rep1	B0	syngenic	1	42
	05	05_B1_syngenic_42days_UKy_GCH_rep2	B1	syngenic	2	42
	06	06_B2_syngenic_42days_UKy_GCH_rep3	B2	syngenic	3	42
	07	07_C1-1_allogenic_42days_UKy_GCH_rep1	C1-1	allogenic	1	42
	08	08_C1-2_allogenic_42days_UKy_GCH_rep2	C1-2	allogenic	2	42

You can see that the original header row is ignored with “#ignore” and the important columns are tagged with appropriate field names. Not all of the columns are tagged. What needs to be tagged or what is important will vary according to your needs and application. MESSES is largely meant for converting for deposition into a repository so information that is not important or relevant to the repository you are targeting will likely be untagged. Note that the *#.protocol.id tag is a list field even though it has single values underneath. This is be consistent with field types as other records can have multiple protocols associated with them.

Creating Samples From Parents

#tags		#entity.id						#entity%child.id=-polar-ICMS_A;#.replicate=1; #%type=”analytical”;#.weight; #%units=g;*#.protocol.id=polar_extraction,IC-FTMS_preparation; #entity.type=sample
#ignore	Slice #	Parent Sample ID	polar tare (5ml tube)	polar+tare	g polar ext	g polar FTMS A	g polar FTMS B	g polar ICMS A	g polar ICMS B
	01	01_A0_Spleen_naive_0days_170427_UKy_GCH_rep1	3.1476	4.8772	1.7296	0.0983	0.1056	0.2084	0.2083
	02	02_A1_Spleen_naive_0days_170427_UKy_GCH_rep2	3.1782	4.8490	1.6708	0.0983	0.0973	0.2001	0.1949
	03	03_A2_Spleen_naive_0days_170427_UKy_GCH_rep3	3.1621	4.8485	1.6864	0.1029	0.1041	0.1951	0.2043
	04	04_B0_Spleen_syngenic_42days_170427_UKy_GCH_rep1	3.1479	4.7828	1.6349	0.0938	0.0958	0.1885	0.1936
	05	05_B1_Spleen_syngenic_42days_170427_UKy_GCH_rep2	3.1685	4.9067	1.7382	0.0938	0.0994	0.2010	0.2003
	06	06_B2_Spleen_syngenic_42days_170427_UKy_GCH_rep3	3.1735	4.8483	1.6748	0.1003	0.0914	0.2000	0.1891
	07	07_C1-1_Spleen_allogenic_42days_170427_UKy_GCH_rep1	3.1764	4.7631	1.5867	0.0980	0.0916	0.1859	0.1868
	08	08_C1-2_Spleen_allogenic_42days_170427_UKy_GCH_rep2	3.1617	4.8176	1.6559	0.0955	0.0957	0.1982	0.1922

The entities in the 3rd column under the #entity.id tag are assumed to exist elsewhere. The %child tag is used to create a new sample that captures the weight of the polar extraction for the ICMS_A aliquot. The other aliquots are not captured because they are not needed or were not used for the study that is being deposited, but it is a good exercise for the reader to add tags to those columns based on the one shown for ICSM_A.

Modification and Automation Tags

The examples here will both use a common set of measurement data.

Measurement Data:

Compound	Mol_Formula	C_isomers	SamplID	Intensity	Renormalized	Total.NA.Removed	Corrected	Predicted	Comments	Quantified_uM_ratio	Quantified_uM_sequence_ratio	reconstitution_volume_uL	injection_volume_uL	protein_mg	icms_split_ratio	Amount_ProteinAdj_uMol_g_protein_RatioBased	Amount_ProteinAdj_uMol_g_protein_SequenceBased	CommentQuantification
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd	C5H8O4	0	01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench	7989221.834	8447352.892	8490014.365	8447352.892	7839899.288	NA	0	0	20	10	0.618176844	0.25575733	0	0	Legend
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd	C5H8O4	1	01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench	289287.7334	0	8490014.365	0	439597.5524	NA	0	0	20	10	0.618176844	0.25575733	0	0	Compound: name of assigned metabolite, noStd means assigment was NOT verified with standard compound
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd	C5H8O4	2	01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench	43815.55259	34916.99767	8490014.365	34916.99767	42996.61803	NA	0	0	20	10	0.618176844	0.25575733	0	0	Mol_Formula: molucular formula od assigned compound
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd	C5H8O4	3	01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench	6555.515601	5426.466785	8490014.365	5426.466785	6432.98974	NA	0	0	20	10	0.618176844	0.25575733	0	0	C_isomers and/or N_isomers: additional Da units that were measured in a targeted manner for the compound in question
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd	C5H8O4	4	01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench	2049.939279	1939.047016	8490014.365	1939.047016	2011.62489	NA	0	0	20	10	0.618176844	0.25575733	0	0	Intensity: raw XIC intensity data
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd	C5H8O4	5	01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench	401.1818948	378.961431	8490014.365	378.961431	393.6836049	NA	0	0	20	10	0.618176844	0.25575733	0	0	Renormalized: Natural abundance Corrected XIC intensity data
13-BPG	C3H8O10P2	0	01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench	348209.1176	360055.818	560907.2249	360055.818	348209.1176	NA	0.055515033	0.0225354	20	10	0.618176844	0.25575733	0.00702263	0.002850719	Amount_ProteinAdj_uMol_g_protein_RatioBased = amount of quantified isotoplogue in sample in umol/g protein?protein, calculated as Quantified_uM x (RecVol/1000/SplitRatio/mg protein)
13-BPG	C3H8O10P2	1	01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench	14984.22348	3343.128038	560907.2249	3343.128038	14984.22348	NA	0.000515459	0.000209242	20	10	0.618176844	0.25575733	6.52E-05	2.65E-05	Amount_ProteinAdj_uMol_g_protein_SequenceBased = amount of quantified isotoplogue in sample in umol/g protein?protein, calculated as Quantified_uM x (RecVol/1000/SplitRatio/mg protein)
13-BPG	C3H8O10P2	2	01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench	11385.033	11305.71062	560907.2249	11305.71062	11385.033	NA	0.001743166	0.000707609	20	10	0.618176844	0.25575733	0.00022051	8.95E-05
13-BPG	C3H8O10P2	3	01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench	186328.8509	186202.5683	560907.2249	186202.5683	186328.8509	NA	0.028709553	0.011654163	20	10	0.618176844	0.25575733	0.003631747	0.001474247

Not all of the columns will be used and this is just a small portion of a much larger dataset.

Adding a Protocol

#insert
#tags	#protocol.id	#.type	#.machine_type	#.instrument	#.ion_mode	#.ionization	#.instrument_type	#.description	#.chromatography_instrument_name	#.chromatography_type	#.column_name	#.chromatography_description	#.parent_protocol
	ICMS1	measurement	MS	Orbitrap Fusion	NEGATIVE	ESI	IC-FTMS	ICMS Analytical Experiment with detection of compounds by comparison to standards.	Thermo Dionex ICS-5000+	Targeted IC	Dionex IonPac AS11-HC-4um 2 mm i.d. x 250 mm	Targeted IC	IC-FTMS_measurement

#end

This would be one set of directives in the #automate sheet or file. The description tag has been shortened.

Automating Tags

#tags	#header		#add
	Compound+”-13C”+C_isomers+”-“+SamplID		#measurement.id
	Compound+”-13C”+C_isomers		#measurement.assignment
	Compound		#measurement.compound
	Mol_Formula		#measurement.formula
	SamplID		#.entity.id
	“13C”+C_isomers		#measurement.isotopologue;#%type=”13C”
	Intensity		#measurement.raw_intensity;#%type=”spectrometer peak area”
	Renormalized		#measurement.corrected_raw_intensity;#%type=”natural abundance corrected peak area”
	Quantified_uM_sequence_ratio		#measurement.concentration;#%units=uM;#%type=”calculated from standard”
	Amount_ProteinAdj_uMol_g_protein_SequenceBased		#measurement.normalized_concentration;#%type=”protein normalized”;#%units=”uMol/g”
	eval(float(#Renormalized#) / float(#protein_mg#))		#measurement.intensity;#%type=”natural abundance corrected and protein normalized peak area”;#%units=”area/g”
			#.protocol.id=ICMS1

You can see how the values in the #header column correspond to headers in the measurement data, and how the tags in the #add column will be newly created or added on top of the column in the measurement data.

Changing Entity IDs

#tags	#measurement.entity.id.value	#measurement.entity.id.assign	#unique=false	#match=all
	01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench	01_A0_Colon_naive_0days_170427_UKy_GCH_rep1-polar-ICMS_A
	02_A1_Colon_T03-2017_naive_170427_UKy_GCB_rep2-quench	02_A1_Colon_naive_0days_170427_UKy_GCH_rep2-polar-ICMS_A
	03_A2_Colon_T03-2017_naive_170427_UKy_GCB_rep3-quench	03_A2_Colon_naive_0days_170427_UKy_GCH_rep3-polar-ICMS_A

This has been truncated, but you can see how it will overwrite the entity.id field for all records that match the “.value” column. Something like this may be necessary when there is a name mismatch between metadata and measurement data.

Changing IDs

#tags	#measurement.id.value	#measurement.id.regex	#unique=false	#comparison=regex
	r’01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench’	r’01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench’,r’01_A0_Colon_naive_0days_170427_UKy_GCH_rep1-polar-ICMS_A’
	r’02_A1_Colon_T03-2017_naive_170427_UKy_GCB_rep2-quench’	r’02_A1_Colon_T03-2017_naive_170427_UKy_GCB_rep2-quench’,r’02_A1_Colon_naive_0days_170427_UKy_GCH_rep2-polar-ICMS_A’
	r’03_A2_Colon_T03-2017_naive_170427_UKy_GCB_rep3-quench’	r’03_A2_Colon_T03-2017_naive_170427_UKy_GCB_rep3-quench’,r’03_A2_Colon_naive_0days_170427_UKy_GCH_rep3-polar-ICMS_A’

This is similar to and partly because of the previous example that changes the entity.id field. You can see in the Automating Tags example that the id field is created with a concatenation of the column that will become the entity.id field (SamplID). In order to make things match we need to make the same change in the id field, but we can’t simply use an assign, thus the use of regex.

Adding Additional Fields

#tags	#measurement.assignment.value	#measurement.assignment%method.assign	#unique=false	#match=all
	(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C0	database
	(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C1	database
	(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C2	database
	(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C3	database
	(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C4	database
	(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C5	database
	13-BPG-13C0	direct_to_standard
	13-BPG-13C1	indirect_to_standard
	13-BPG-13C2	indirect_to_standard
	13-BPG-13C3	indirect_to_standard

This will add an additional “assignment%method” field to the measurement records whose “assignment” field matches the “.value” column.