Tagging
Terminology
header - a descriptive label at the top of a column in a table of data.
Example: SampleID
tag - used to describe fields in a record of data. They are a semantic descriptive header used to parse a column (field) of data from a table.
Example: #measurement.compound
table - in this section, table can refer to either a table in a data file or a SQL style table in the output. Often there are many small tables in a data file that are combined into a larger one in the extract output.
directives - directions given to MESSES commands using tagged files or their JSONized forms. A table and its tags taken as a whole.
Introduction
In order to extract data from arbitrary tables programmatically, some kind of system has to be devised. This could be something as simple as requiring a table be on the very first row and for the starting row to have column names for every column, but a system such as that would then be fragile. We decided to create a more robust system that could handle more complicated data arrangements and reduce the verbosity to a minimum. The system that was devised was an extra layer of tags added on top of existing tables that tell the extract command how to transform the tabular data into key-based record data similar to JSON or SQL databases. The specific table schema this system was designed for is covered in the Experiment Description Specification section, but it is general enough that it can be used with most table schema.
This initial system served its function well, but it became clear that more functionality was sorely needed. Namely, both a way to add tags programmatically to data and a way to modify record values was needed, so the system was expanded to provide facilities to do both. Ultimately there are 3 parts to the tagging system that are distinct from one another but have similar syntax and ideas. The “export” part is the system used to add directly on top of existing tabular data. It is the base system that must be used for the extraction to work at all. The “automation” part that is used to automate adding “export” tags to tabular data. Based on the header values in your data, you can use “automation” tags to add the “export” tags automatically. A good use case for this is when you have data output by a program in a consistent way. Instead of manually adding export tags to the program output each time, you can create an “automation” page that will add the “export” tags for you. The last “modification” part is the system to modify record values. It can be used to prepend, append, delete, overwrite, or regex substitute values. An example use-case would be to update old naming conventions. Validly tagged files in their tabular or JSON form can be referred to as directives as they direct the actions of the program. To reduce confusion between tags and directives “tags” should generally refer to the extra text added above tables, while “directives” are the tables and tags taken as a whole. Each row of a tagged table is an individual directive.
Each part of the tagging system has to be in their own sheet or file for the extract command. By default, export tags are expected to be in a sheet named ‘#export’, if given an Excel file without specifying a sheet name. If given a CSV file, then this file is expected to have export tags. Modification tags are expected to be in a sheet named ‘#modify’ by default, but can be specified using the –modify option. The option can be used to specify either a different sheet name in the given Excel file, a different Excel file, a different Excel file with a different sheet name, a Google Sheets file, a Google Sheets file with a different sheet name, a JSON file, or a CSV file. Automation tags are similarly specifiedusing the –automate option or otherwise expected in a sheet named ‘#automate’ by default.
Each part of the tagging system are explained below with examples. Examples using them with the extract command are in the Tutorial section and there are full run examples in the “examples” folder of the GitHub repository.
Export Tags
Tag rows are identified by putting ‘#tags’ in the left most column of the data file.
Typically, there will be nothing underneath of ‘#tags’ except ‘#ignore’ for rows to ignore.
You may need to add a blank column to the left of your data to accommodate this.
‘#tags’ always needs to be on the very left most column, even if your table has several blank columns before it starts.
Cells on this row will have tags over the columns to be exported.
Subsequent non-blank rows are interpreted as data unless the tag ‘#ignore’ is used in the left most column.
Tags are read and executed from left to right, so order can matter.
For example, an “id” field must be specified before any “child” fields.
Basic Example:
#tags |
#sample.id |
#.weight;#%units=”g” |
#.protocol.id |
|---|---|---|---|
#ignore |
Sample ID |
Weight |
Treatment |
… |
… |
… |
Keyword Tags
#tags - identifies tag header rows. Must be in the first column of a row.
Additional tag-value pairs in a cell with #tags are applied to all child records in this section, but are ignored for non-child records.
#transpose - indicates that the table below this tag needs to be transposed to align with the provided tags.
Should be specified after “#tags”, i.e. “#tags;#transpose”.
#ignore - ignore this row. Must be put in the first column of a row.
Use this to ignore additional header explanation rows and unused data records within a block of records.
#table - change the working table to the indicated table name.
Column Field Format
Tags begin with a ‘#’ symbol and identify a specific relational database table, field, and optionally a field-attribute.
Format: #table_name.field_name
Example: #sample.dry_weight
sample - the sample data relational table (workbench worksheet)
dry_weight - the field name.
Format: #table
Example: #sample
sample - the sample data relational table (workbench worksheet)
Format: #.field_name
Example: #.dry_weight
dry_weight - the field name, for the table last identified.
Format: #table_name.field_name%attribute_name
Example: #sample.dry_weight%units
sample - the sample data relational table (workbench worksheet)
dry_weight - the field name.
units - units attribute for the field.
Format: #.field_name%attribute_name
Example: #.dry_weight%units
dry_weight - the field name, for the last table identified.
units - units attribute for the field.
Format: #%attribute_name
Example: #%units
units - units attribute for a field previously identified in a table previously identified.
Value Field Format
Values for a tag field can be specified directly using an equal sign “=”.
Format: #table_name.field_name=field_value
Example: #study.title=”labeled mouse study 278”
study - the study table.
“labeled mouse study 278” - value for the study title field.
Note that there is a special case for “id” fields. “id” fields cannot be set with an “=”.
Format: #table_name.field_name%attribute_name=attribute_value
Example: #sample.dry_weight%units=g
sample - sample table
dry_weight - dry_weight field
units - attribute for dry weight with a value of “g”.
Multiple tags can be specified in a single cell using a semicolon “;”.
However, only one tag can be left without a direct value for unambiguous interpretation of a column.
Format: #table_name.field_name;#table_name.field_name=field_value;…
Example: #sample.dry_weight;#.dry_weight%units=mg
sample - the sample data table.
dry_weight - field for the column interpretation
units - field with a direct value of “mg”.
Values can be combined into a single value using a plus sign “+”.
Format: #table_name.field_name;#table_name.field_name=field_value+field_value
Example: #study.title=”labeled mouse study 278 “+#.type
study - the study table.
“labeled mouse study 278 “ - part of value for the study title field.
#.type - part of value of the study title field taken from the study type field.
ID Field Format
ID tags indicate an identifier for a record in a table.
The ID must uniquely identify a record.
There must be an ID tag in every tag row.
Using the same ID in multiple tag rows will add to the existing record.
Fields with the same name for the same ID across multiple tag rows will automatically become list fields, and new values will be added to the list.
Records can have record IDs from other tables or the same table as fields.
Format: #table.id
Example: #sample.id
sample - type of id tag. This often is a table name.
id - indicates that this is an id tag.
Format: #.id
Short format can be used when the table is already specified.
List Field Format
List field tags begin with ‘*#’ (asterisk followed by the pound sign) and identify a specific relational database table, field, and optionally a field-attribute that has multiple values.
They have the same format as normal column field tags.
Format: *#table_name.field_name
Example: *#sample.dry_weight
sample - the sample data relational table (workbench worksheet)
dry_weight - the field name.
Individual values are separated by commas “,” both in the column cells or in the column tag value.
Format: *#table_name.field_name=field_value,field_value,…
Example: *#study.labeling=13C,15N
study - the study table.
13C,15N - two values for the study labeling field.
List field tags can be listed multiple times in a record, with each value(s) appended.
Child Tag Format
Child record tags provides a mechanism for indicating parent-child relationships between records in the same table.
The child tag indicates the creation of a new record.
Subsequent normal tags identify fields in the new child record.
A special parent_id field is added using the first ID tag indicated in the header row.
Format: #%child.id=id_sub_string
The value for id_sub_string will be appended to the ID of the child’s parent (parent_id) to create the child ID.
Example 1:
#tags |
#sample.id |
#%child.id=-media-0h;#.dry_weight;#.dry_weight%units=mg |
#%child.id=-media-3h;#.dry_weight;#.dry_weight%units=mg |
KO labelled_1 |
4.2 |
8.5 |
|
KO labelled_2 |
4.7 |
9.7 |
|
… |
… |
… |
Output JSON:
{
"sample": {
"KO labelled_1": {
"id": "KO labelled_1"
},
"KO labelled_1-media-0h": {
"dry_weight": "4.2",
"dry_weight%units": "mg",
"id": "KO labelled_1-media-0h",
"parent_id": "KO labelled_1"
},
"KO labelled_1-media-3h": {
"dry_weight": "8.5",
"dry_weight%units": "mg",
"id": "KO labelled_1-media-3h",
"parent_id": "KO labelled_1"
},
"KO labelled_2": {
"id": "KO labelled_2"
},
"KO labelled_2-media-0h": {
"dry_weight": "4.7",
"dry_weight%units": "mg",
"id": "KO labelled_2-media-0h",
"parent_id": "KO labelled_2"
},
"KO labelled_2-media-3h": {
"dry_weight": "9.7",
"dry_weight%units": "mg",
"id": "KO labelled_2-media-3h",
"parent_id": "KO labelled_2"
}
}
}
Example 2: This is not the expected way to use child tags, but it can work if you only need to capture the parent child relationship.
#tags |
#sample.id |
#%child.id;#.field1=value1 |
#%child.id;#.field1=value2 |
KO labelled_1 |
child1 |
child2 |
Output JSON:
{
"sample": {
"child1": {
"field1": "value1",
"id": "child1",
"parent_id": "KO labelled_1"
},
"-child2": {
"field1": "value2",
"id": "child2",
"parent_id": "KO labelled_1"
},
"KO labelled_1": {
"id": "KO labelled_1"
}
}
}
crecord Tags
Short for column record, this tag provides a way to indicate that a column should become a new record.
Similar to child tags, but without the parent-child relationship.
All fields specified to the left of the crecord tag will be included in the newly created records, but each row in the crecord column will create a new record.
In the case of multiple crecord tags, additional fields to the right of the crecord will only be added to the crecord on the left of that field.
An example that utilizes automation tags is in the Common Use Case Examples.
Example:
#tags |
#measurement.assignment |
#.ion_species |
#.inchi_key |
#.msi_level |
#.m_z |
#.ret_time |
#.esi_mode |
#%crecord.id=#.assignment+”-sample1”;#.intensity;#.intensity%units=”peak area” |
#%crecord.id=#.assignment+”-sample2”;#.intensity;#.intensity%units=”peak area” |
|---|---|---|---|---|---|---|---|---|---|
#ignore |
annotation |
ion species |
InChiKey |
MSI level |
m/z |
ret.time (min) |
ESI mode |
1 |
2 |
(dehydro)ornithine |
[M+H]+ |
KFZHLVSVDLIACX-UHFFFAOYSA-N |
3 |
131.0872 |
2.363 |
ESI pos |
349 |
995 |
|
.beta.-Nicotinamide adenine dinucleotide |
[M+2H]2+ |
BAWFJGJZGIEFAR-OWXIGJDBSA-N |
2 |
332.5639 |
2.423 |
ESI pos |
229 |
1030 |
Output JSON:
{
"measurement": {
"(dehydro)ornithine-sample1": {
"assignment": "(dehydro)ornithine",
"esi_mode": "ESI pos",
"id": "(dehydro)ornithine-sample1",
"inchi_key": "KFZHLVSVDLIACX-UHFFFAOYSA-N",
"intensity": "349",
"intensity%units": "peak area",
"ion_species": "[M+H]+",
"m_z": "131.0872",
"msi_level": "3",
"ret_time": "2.363"
},
"(dehydro)ornithine-sample2": {
"assignment": "(dehydro)ornithine",
"esi_mode": "ESI pos",
"id": "(dehydro)ornithine-sample2",
"inchi_key": "KFZHLVSVDLIACX-UHFFFAOYSA-N",
"intensity": "995",
"intensity%units": "peak area",
"ion_species": "[M+H]+",
"m_z": "131.0872",
"msi_level": "3",
"ret_time": "2.363"
},
".beta.-Nicotinamide adenine dinucleotide-sample1": {
"assignment": ".beta.-Nicotinamide adenine dinucleotide",
"esi_mode": "ESI pos",
"id": ".beta.-Nicotinamide adenine dinucleotide-sample1",
"inchi_key": "BAWFJGJZGIEFAR-OWXIGJDBSA-N",
"intensity": "229",
"intensity%units": "peak area",
"ion_species": "[M+2H]2+",
"m_z": "332.5639",
"msi_level": "2",
"ret_time": "2.423"
},
".beta.-Nicotinamide adenine dinucleotide-sample2": {
"assignment": ".beta.-Nicotinamide adenine dinucleotide",
"esi_mode": "ESI pos",
"id": ".beta.-Nicotinamide adenine dinucleotide-sample2",
"inchi_key": "BAWFJGJZGIEFAR-OWXIGJDBSA-N",
"intensity": "1030",
"intensity%units": "peak area",
"ion_species": "[M+2H]2+",
"m_z": "332.5639",
"msi_level": "2",
"ret_time": "2.423"
}
}
}
Field Tracking Tags
Field tracking tags provide a mechanism for copying the latest field value from one table into the records of another.
The tag indicates which table’s records to add to and which field to track from another table.
The latest value for the field seen while parsing will be added.
Useful for adding project and study ids to records in a document with multiple projects or studies.
Format: #table%track=table.field
Example: #sample%track=project.id
will add the project.id field to every sample record.
A list format can also be used:
#table%track=table.field1,table.field2,…
Example: #sample%track=project.id,study.id
Fields can also be untracked after tracking to stop adding the field to records.
Format: #table%untrack=table.field
The list format also works for untrack.
If a tracked field is specifically given in a table the given value is used over the tracked value.
Example:
#tags |
#sample%track=project.id |
#tags |
#project.id |
Project 1 |
|
#tags |
#sample.id |
01_A0_Spleen_naive_0days_170427_UKy_GCH_rep1 |
|
02_A1_Spleen_naive_0days_170427_UKy_GCH_rep2 |
Output JSON:
{
"project": {
"Project 1": {
"id": "Project 1"
}
},
"sample": {
"01_A0_Spleen_naive_0days_170427_UKy_GCH_rep1": {
"id": "01_A0_Spleen_naive_0days_170427_UKy_GCH_rep1",
"project.id": "Project 1",
},
"02_A1_Spleen_naive_0days_170427_UKy_GCH_rep2": {
"id": "02_A1_Spleen_naive_0days_170427_UKy_GCH_rep2",
"project.id": "Project 1",
}
}
}
Transposed Data
The transpose tag provides a way to read in data where the headers are row-wise instead of column-wise.
The transpose tag assumes there is a header column in the 2nd position from the left and inserts an ignore tag in the first row after transposing the data.
Example:
#tags;#transpose |
#sample.id |
#%child.id=-media-0h;#.dry_weight;#.dry_weight%units=mg |
#%child.id=-media-3h;#.dry_weight;#.dry_weight%units=mg |
Sample Name |
KO labelled_1 |
KO labelled_2 |
|
Dry Weight 0h |
4.2 |
4.7 |
|
Dry Weight 3h |
8.5 |
9.7 |
Will become:
#tags |
#sample.id |
#%child.id=-media-0h;#.dry_weight;#.dry_weight%units=mg |
#%child.id=-media-3h;#.dry_weight;#.dry_weight%units=mg |
#ignore |
Sample Name |
Dry Weight 0h |
Dry Weight 3h |
KO labelled_1 |
4.2 |
8.5 |
|
KO labelled_2 |
4.7 |
9.7 |
Modification Tags
Similar to export tags, the modification tag rows are indicated by #tags in the left most column, and #ignore can be used to ignore rows. The general idea behind the modification system is that you first use tags to indicate a field in a table to match to. Then underneath that tag indicate the value in that field to match to. Then another tag in the same row will indicate both what field to modify in the record that has the matching field and what modification to do. Underneath that tag will have the value to do the modification with.
Basic Examples:
#tags |
#measurement.compound.value |
#measurement.compound.assign |
#match=all |
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd |
(S)-2-Acetolactate Glutaric acid Methylsuccinic acid |
||
… |
… |
This example replaces the “compound” field with value “(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd” in “measurement” table records with “(S)-2-Acetolactate Glutaric acid Methylsuccinic acid”.
#tags |
#measurement.compound.value |
#measurement.id.regex |
#match=all |
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd |
r’\(S\)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd’,r’(S)-2-Acetolactate Glutaric acid Methylsuccinic acid’ |
||
… |
… |
This example does a regex substitution on the “id” field of records in the “measurement” table if their “compound” field matches “(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd”. In the “id” field “(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd” is substituted with “(S)-2-Acetolactate Glutaric acid Methylsuccinic acid”.
Value Tag
All modification tag rows must start with a value tag after #tags.
This tag indicates which table and field to compare with.
The value underneath the tag will be compared with the value in the indicated field for all of the records in the indicated table to determine if a match is made.
Signature:
#[table_name].field_name[%attribute].value
Comparison Type
The comparison type is controlled through the #comparison tag. It can be used to specify the type for each row individually, or for the whole column by using the = sign. Ex. #comparison=exact
There are 4 comparison types, “exact”, “regex”, “levenshtein”, and “regex|exact”.
exact is a simple exact comparison between the comparison value and field value. The given comparison value must be exactly what is in the field value (i.e. a “==” comparison).
regex expects a regex to be in the comparison value and will print a message if it is not. The given regex will be delivered to re.search() for matching with field values.
levenshtein calculates the levenshtein distance between the comparison value and record field values and matches to the field values with the minimum distance.
This means it always matches to something even if the values seem wildly different, so be aware of possible unexpected results.
regex|exact is an intelligent combination of regex and exact. If a regex is specified for the comparison value it will be detected and the type will be regex for that comparison value only, otherwise the comparison type will be exact.
If a type is specified then that type is used regardless of the comparison value, so a regex string with an exact comparison type will try to match exactly.
If the comparison tag is not specified, then the type defaults to “regex|exact”.
exact type modifications are executed first, then regex type, and lastly levenshtein type. Within each type, modifications are executed in order from first to last.
Match Type
Match behavior can be altered further using the #match tag. It can be used to specify the type for each row individually, or for the whole column by using the = sign. Ex. #match=all
There are 4 match types, “first”, “first-nowarn”, “unique”, and “all”.
“first” - the modification is performed only for the first record matched, additional matches beyond the first will print a warning.
“first-nowarn” - the same as first, but won’t print warnings.
“unique” - the modification is only performed if 1 and only 1 record matched.
For levenshtein, this means that only 1 field value can have the minimum distance, if 2 values share the minimum distance then the action won’t take place.
“all” - the modification is done to every record that matches.
If the match tag is not specified, then the type defaults to “first”.
Modifications
There are 6 modifications that can be done, “assign”, “append”, “prepend”, “regex”, “delete”, and “rename”.
assign - will overwrite whatever value is in the field with the indicated assignment value.
Signatures:
#[table_name].field_name[%attribute].assign
*#[table_name].field_name[%attribute].assign
If the indicated assignment field does not exist in the record then it will be added to the record.
An eval function can be used in the form “eval(…)”.
“#field_name#” and “#r’…’#” can be used to construct the assignment value for the record.
All Python language operators can be used. But remember to use “float(#field_name#)” to convert strings to floating point numbers.
Example: eval(float(#intensity#) / float(#normalization#) * 5)
evals that return a list of strings will convert the field to a list field.
Add an asterisk, ‘*’, to the front of the tag to interpret the assignment value as a list and assign that list value to the field.
An eval function can be used, but it must return a list of strings.
The assign modification can be used to change list types to non list types and vice versa.
This can lead to an issue where some records have a list type for the field and some do not.
If that is not intended, then be sure to construct the assign tag such that it matches the type of the field.
For instance, make sure evals return a list if the field should be a list type.
Example:
#tags
#measurement.compound.value
#measurement.compound.assign
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd
(S)-2-Acetolactate Glutaric acid Methylsuccinic acid
…
…
append - will append the indicated value to the value in the indicated field.
Signatures:
#[table_name].field_name[%attribute].append
*#[table_name].field_name[%attribute].append
If the indicated append field does not exist in the record, then it will be added to the record.
If the field value is a list and the append value is not a list, then the append value will be appended to each value in the list.
Add an asterisk, ‘*’, to the front of the tag to interpret the append value as a list.
When the append value is a list, the behavior is more complicated.
For each value in the field value list, the append value in the append list at the same index will be appended to the field value.
Examples:
field_value = [“a”, “b”] append_value = [“c”, “d”] result = [“ac”, “bd”]
field_value = [“a”, “b”] append_value = [“c”, “d”, “e”] result = [“ac”, “bd”]
field_value = [“a”, “b”, “e”] append_value = [“c”, “d”] result = [“ac”, “bd”, “e”]
Example:
#tags
#measurement.compound.value
#measurement.sample.id.append
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd
(S)-2-Acetolactate Glutaric acid Methylsuccinic acid
…
…
prepend - will prepend the indicated value to the value in the indicated field.
Signatures:
#[table_name].field_name[%attribute].prepend
*#[table_name].field_name[%attribute].prepend
If the indicated prepend field does not exist in the record, then it will be added to the record.
If the field value is a list and the prepend value is not a list, then the prepend value will be prepended to each value in the list.
Add an asterisk, ‘*’, to the front of the tag to interpret the prepend value as a list.
When the prepend value is a list, the behavior is more complicated.
For each value in the field value list, the prepend value in the prepend list at the same index will be prepended to the field value.
Examples:
field_value = [“a”, “b”] prepend_value = [“c”, “d”] result = [“ca”, “db”]
field_value = [“a”, “b”] prepend_value = [“c”, “d”, “e”] result = [“ca”, “db”]
field_value = [“a”, “b”, “e”] prepend_value = [“c”, “d”] result = [“ca”, “db”, “e”]
Example:
#tags
#measurement.compound.value
#measurement.sample.id.prepend
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd
(S)-2-Acetolactate Glutaric acid Methylsuccinic acid
…
…
regex - will do a regex substitution on the indicated field using the indicated values.
Signatures:
#[table_name].field_name[%attribute].regex
If the indicated regex field does not exist in the record, then a warning will be printed.
If the field value is a list, then the regex substitution will be done on each element in the list.
Example:
#tags
#measurement.compound.value
#measurement.id.regex
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd
r’\(S\)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd’,r’(S)-2-Acetolactate Glutaric acid Methylsuccinic acid’
…
…
delete - will remove the field from the record.
Signatures:
#[table_name].field_name[%attribute].delete
“id” fields cannot be deleted. An error will be raised during parsing if it is attempted.
No value is needed under the tag.
Example:
#tags
#measurement.compound.value
#measurement.mol_formula.delete
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd
…
…
rename - will rename the field in the record.
Signatures:
#[table_name].field_name[%attribute].rename.field_name[%attribute]
“id” fields cannot be renamed. An error will be raised during parsing if it is attempted.
Fields cannot be renamed to the same name. An error will be raised during parsing if it is attempted.
No value is needed under the tag.
Example:
#tags
#measurement.compound.value
#measurement.mol_formula.rename.molecular_formula
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd
…
…
Note
ID fields are special.
Modifications to id fields will be propagated to the table key of that record.
They cannot be deleted or renamed.
They are the only time a period, ‘.’, can appear in the field_name.
Ex. #measurement.intensity.units.assign is malformed, but #measurement.entity.id.assign is not.
Important Points:
Tags in the same row must have the same table. An error will be raised during parsing if they don’t.
The value tag must be before the modification tags.
Modifications are confined to the matched record. It is not possible to modify a record based on another record’s fields or values.
Modifications can be chained together, so that the same field can have multiple modifications.
This can be utilized effectively, but can also cause hard to diagnose unexpected output.
Some warnings are printed when fields are modified twice in a way that doesn’t make sense, such as a delete modification after an assign modification, but all other chained modifications are assumed to be intended.
Tag Format Reference:
#table_name.field_name[%attribute].value - identifies table_name containing field_name (with possible attribute name) and associated column with value to match. A regular expression can be given with r’…’.
#comparison - identifies column with type of comparison (exact, regex, levenshtein, regex|exact). Default regex|exact.
#comparison=type - type of comparison (exact, regex, levenshtein, regex|exact).
#match - identifies column with type of match (first, first-nowarn, unique, all). Default first.
#match=type - type of match (first, first-nowarn, unique, all).
#[table_name].field_name[%attribute].assign identifies field to assign and associated column with its value.
*#[table_name].field_name[%attribute].assign identifies field to assign and that the associated column values are a list type.
#[table_name].field_name[%attribute].append identifies field to append to and associated column with its value.
*#[table_name].field_name[%attribute].append identifies field to append to and that the associated column values are a list type.
#[table_name].field_name[%attribute].prepend identifies field to prepend to and associated column with its value.
*#[table_name].field_name[%attribute].prepend identifies field to prepend to and that the associated column values are a list type.
#[table_name].field_name[%attribute].regex identifies field to apply regex substitution to and associated column with the pair of regex strings of the form r”…”,r”…”.
#[table_name].field_name[%attribute].delete identifies field to delete.
#[table_name].field_name[%attribute].rename.field_name[%attribute] identifies field to rename.
Automation Tags
Automation tag rows, like the other tag rows, are indicated by #tags in the left most column with #ignore used to ignore rows, but #insert and #end tags are also introduced. There are 2 main functions in the automation system. One is to specify a table of header-tag pairs that will be used to automatically add the tags associated with the headers underneath of the headers when it finds them in the data. The other is to specify a block of rows to add to the data exactly as is. This is what introduces the #insert and #end tags.
Insert
The insertion functionality is easy to understand. You simply write whatever you want to add into the data and add #insert above it in the left most column and #end below it in the left most column. Everything in between #insert and #end is simply added as is into the data before it is parsed by the export tagging. A good use case for this is when you have a standard protocol that always needs to be added to some data. Instead of copying it in manually, you can add it to an automation sheet/file and deliver it to the extract command so it can add it for you. The thing to be careful of is to make sure everything in the insert block is valid under the export tagging. It can be tricky to debug a tagging error here because extract won’t be able to tell you that the issue is in the insert block.
Example:
#insert |
||||||
#tags |
#protocol.id |
#.type |
#.instrument |
#.ion_mode |
#.ionization |
#.instrument_type |
ICMS1 |
MS |
Orbitrap Fusion |
NEGATIVE |
ESI |
IC-FTMS |
|
#end |
Header Tagging
The header tagging allows you to automatically put export tags under a cell in tabular data based on the value in the cell. Typically, a table will already have descriptive human readable headers to identify what type of data is in the column. These headers are used to match to and put the associated export tags under them. Any row that has a header match where export tags are added is automatically ignored with the #ignore tag. Just like modification tags and export tags, #tags is used to denote the start of a tag block. An entire block is matched as a whole to a row in the data, so if you have multiple tables to add tags to you should created multiple tag blocks. There are additional tags to help control how a blank is matched, detailed below.
Example:
Data:
Compound |
Mol_Formula |
C_isomers |
SamplID |
Intensity |
protein_mg |
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd |
C5H8O4 |
0 |
01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench |
7989221.834 |
0.618176844244679 |
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd |
C5H8O4 |
1 |
01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench |
289287.7334 |
0.618176844244679 |
Header Tags:
#tags |
#header |
#add |
Compound+”-13C”+C_isomers+”-“+SamplID |
#measurement.id |
|
Compound+”-13C”+C_isomers |
#measurement.assignment |
|
Compound |
#measurement.compound |
|
Mol_Formula |
#measurement.formula |
|
SamplID |
#sample.id |
|
“13C”+C_isomers |
#measurement.isotopologue;#%type=”13C” |
|
Intensity |
#measurement.raw_intensity;#%type=”spectrometer peak area” |
|
eval(float(#Intensity#) / float(#protein_mg#)) |
#measurement.intensity;#%type=”natural abundance corrected and protein normalized peak area”;#%units=”area/g” |
|
#protocol.id=ICMS1 |
After Automation:
#tags |
#measurement.id |
#measurement.assignment |
#measurement.compound |
#measurement.formula |
#sample.id |
#measurement.isotopologue;#%type=”13C” |
#measurement.raw_intensity;#%type=”spectrometer peak area” |
#measurement.intensity;#%type=”natural abundance corrected and protein normalized peak area”;#%units=”area/g” |
#protocol.id=ICMS1 |
|
#ignore |
Compound |
Mol_Formula |
C_isomers |
SamplID |
Intensity |
|||||
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd-13C0-01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench |
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd-13C0 |
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd |
C5H8O4 |
0 |
01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench |
13C0 |
7989221.834 |
12923845.19 |
||
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd-13C1-01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench |
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd-13C1 |
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd |
C5H8O4 |
1 |
01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench |
13C1 |
289287.7334 |
467969.2165 |
JSON Output:
{
"measurement": {
"(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C0-01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench": {
"assignment": "(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd-13C0",
"compound": "(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd",
"formula": "C5H8O4",
"id": "(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C0-01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench",
"intensity": "12923845.19",
"intensity%type": "natural abundance corrected and protein normalized peak area",
"intensity%units": "area/g",
"isotopologue": "13C0",
"isotopologue%type": "13C",
"protocol.id": "ICMS1",
"raw_intensity": "7989221.83386388",
"raw_intensity%type": "spectrometer peak area",
"sample.id": "01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench"
},
"(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C1-01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench": {
"assignment": "(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd-13C1",
"compound": "(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd",
"formula": "C5H8O4",
"id": "(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C1-01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench",
"intensity": "467969.2165",
"intensity%type": "natural abundance corrected and protein normalized peak area",
"intensity%units": "area/g",
"isotopologue": "13C1",
"isotopologue%type": "13C",
"protocol.id": "ICMS1",
"raw_intensity": "289287.733437356",
"raw_intensity%type": "spectrometer peak area",
"sample.id": "01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench"
}
}
}
The Tags
#header - header value to match to. Can be a regular expression of the form r’…’.
All tag blocks must contain a #header tag.
Additional column headers can be included with blank corresponding tags to help make header row identification unique.
Cell contents are stripped of leading and trailing white space before comparison with the header value.
A new column will be created if the headers, literals in quotes, and/or regular expressions are combined with plus signs.
Example: Name+”-“+Isopotologue+”-“+r’^\d+\w+ Isotope$’
This functionality means that certain characters can’t be used for literal matching outside of a regex.
For example, if a header name in a data table is “protein+solvent”, then you can’t simply put protein+solvent under #header because it will be interpreted as a concatenation of a “protein” header and a separate “solvent” header.
The easiest way to solve this issue is to use a regular expression. r’protein\+solvent’ will match the header correctly.
In general, if you are having difficulty matching a header, try using a regex.
If multiple headers or regular expressions in the #header column match the same header in the data, then there will be an error because 2 sets of tags cannot be placed in the same spot.
The above does not apply when creating a new column, however. The same column in the data can be used to create multiple new columns.
An eval function can be used in the form “eval(…)”.
“#header_name#” and “#r’…’#” can be used to indicate specific columns in the row.
All Python language operators can be used. But remember to use “float(#field_name#)” to convert strings to floating point numbers.
Example: eval(float(#Intensity#) / float(#r’.*Normalization’#) * 5)
If the eval returns a list of items, it is converted into a string separated by semicolons.
If the corresponding tag is a list tag, then this will become a list.
#add - tags to add in an inserted row below the column header row.
All tag blocks must contain a #add tag.
The actual value under #add does not have to be a valid tag, the value will be copied as is.
Leave this value blank to add headers that are required to match a block, but don’t need tags.
“#HEADER#” and “#INCREMENT#” are 2 reserved words that can be used to build tags based on the headers.
#HEADER# will be replaced with the value of the header that was matched to the value given in #header.
#INCREMENT# will be replaced with the number of the match.
For example, if the value in #header is r’Sample_\d’, the value for #INCREMENT# will be 1 for first header to match, 2 for the second, etc.
The number only goes up when the exact same header to be created tries to be created.
For example, in the below example the increment went to 2 because there were 2 “G” columns in the data so #measurment.field3_G was attempted to be created twice, so there is a field3_G_1 and field3_G_2.
Example:
Data:
A
B
C
D
E
F
G
G
a1
b1
c1
d1
e1
f1
g1
g2
Header Tags:
#tags
#header
#add
#allow_duplicates
A
#measurement.id
r’F|G’
#measurement.field3_#HEADER#_#INCREMENT#
true
r’B|C’+r’F|G’
#measurement.#HEADER#_#INCREMENT#
true
After Automation:
#tags
#measurement.id
#measurement.field3_F_1
#measurement.field3_G_1
#measurement.field3_G_2
#measurement.BF_1
#measurement.BG_1
#measurement.BG_2
#measurement.CF_1
#measurement.CG_1
#measurement.CG_2
#ignore
A
B
C
D
E
F
G
G
a1
b1
c1
d1
e1
f1
g1
g2
b1f1
b1g1
b1g2
c1f1
c1g1
c1g2
This is a contrived example to illustrate how the reserved words work, but a more common way to use these are in the Common Use Case Examples below. Note that the header “r’B|C’+r’F|G’” created 6 new columns instead of just 1 because a new one is created for each combination of matches to the regular expressions.
#required - true|false whether this header description is required. The default is that all header descriptions are required.
Example:
#tags
#header
#add
#required
Compound+”-13C”+C_isomers+”-“+SamplID
#measurement.id
true
Compound+”-13C”+C_isomers
#measurement.assignment
true
Compound
#measurement.compound
true
Mol_Formula
#measurement.formula
false
SamplID
#sample.id
true
If the Mol_Formula header is not found, the tags will still be added, but without the Mol_Formula ones.
#allow_duplicates# - true|false whether this header description can match multiple times in the same row and add tags to each match. The default is that duplicates are not allowed.
If this is true, then #HEADER# and/or #INCREMENT# must be used in the tag(s) under #add to ensure the uniqueness of tags.
See the example above in the #add section to see an example using this tag.
#exclude=test_string - test string or regular expression to use for excluding a given header row.
If a header matches the exclude string or regex, then the tags are not inserted regardless of whether the headers match.
Example:
#tags
#header
#add
#exclude=r’Cell Type|Mouse Species’
Compound+”-13C”+C_isomers+”-“+SamplID
#measurement.id
Compound+”-13C”+C_isomers
#measurement.assignment
Compound
#measurement.compound
Mol_Formula
#measurement.formula
SamplID
#sample.id
If the “Cell Type” or “Mouse Species” headers are in the row, then don’t add the tags.
Insertions can be inside of header tag blocks, so they are inserted only when a match is made to the header rows. They function just like when they are on their own, except that there is an additional #multiple tag that can be used to control whether the insert happens every time the tag block matches or only the first time.
#multiple=true - will insert on every match.
#multiple=false - will only insert on the first match, this is the default behavior if #multiple is not specified.
#transpose - indicates that the table to add tags to is row wise instead of column wise.
Example:
Data:
Sample Name
KO labelled_1
KO labelled_2
Dry Weight 0h
4.2
4.7
Dry Weight 3h
8.5
9.7
Header Tags:
#tags;#transpose
#header
#add
Sample Name
#sample.id
Dry Weight 0h
#%child.id=-media-0h;#.dry_weight;#.dry_weight%units=mg
Dry Weight 3h
#%child.id=-media-3h;#.dry_weight;#.dry_weight%units=mg
After Automation:
#tags;#transpose
#sample.id
#%child.id=-media-0h;#.dry_weight;#.dry_weight%units=mg
#%child.id=-media-3h;#.dry_weight;#.dry_weight%units=mg
Sample Name
KO labelled_1
KO labelled_2
Dry Weight 0h
4.2
4.7
Dry Weight 3h
8.5
9.7
#copy - indicates that the table to add tags to should be copied before adding tags.
This is useful for when you need to use the same data multiple times to create new records.
The below example uses one core table to create a parent entities, children, and grandchildren.
Example:
Data:
patient
Sample volume
Patient 1
10.8 mg
Patient 2
25.2 mg
Header Tags:
#tags;#copy
#header
#add
patient
#entity.id;#.type=sample;#.sample_type=experimental
Sample volume
#entity.resected_tissue_weight;#%units=mg
*#.protocol.id=resected_tissue_collection,tissue_quench,frozen_tissue_slice
#entity%child.id=-polar_extraction;#.type=sample;#.sample_type=experimental;*#.protocol.id=polar_extraction
#tags;#copy
#header
#add
patient+’-polar_extraction’
#entity.id
#entity%child.id=-metabolism;#.type=sample;*#.protocol.id=GC-MS_preparation;#.injection_volume=0.5;#%units=uL;#.reconstitution_volume=100;#%units=uL;#.extracted_volume=110;#%units=uL
After Automation:
patient
Sample volume
Patient 1
10.8 mg
Patient 2
25.2 mg
#tags
#entity.id
#entity%child.id=-metabolism;#.type=sample;*#.protocol.id=GC-MS_preparation;#.injection_volume=0.5;#%units=uL;#.reconstitution_volume=100;#%units=uL;#.extracted_volume=110;#%units=uL
#ignore
patient
Sample volume
Patient 1
10.8 mg
Patient 1-polar_extraction
Patient 2
25.2 mg
Patient 2-polar_extraction
#tags
#entity.id;#.type=sample;#.sample_type=experimental
#entity.resected_tissue_weight;#%units=mg
*#.protocol.id=resected_tissue_collection,tissue_quench,frozen_tissue_slice
#entity%child.id=-polar_extraction;#.type=sample;#.sample_type=experimental;*#.protocol.id=polar_extraction
#ignore
patient
Sample volume
Patient 1
10.8 mg
Patient 2
25.2 mg
JSON Output:
{ "entity": { "Patient 1": { "id": "Patient 1", "protocol.id": [ "resected_tissue_collection", "tissue_quench", "frozen_tissue_slice" ], "resected_tissue_weight": "10.8 mg", "resected_tissue_weight%units": "mg", "sample_type": "experimental", "type": "sample" }, "Patient 1-polar_extraction": { "id": "Patient 1-polar_extraction", "parent_id": "Patient 1", "protocol.id": [ "polar_extraction" ], "sample_type": "experimental", "type": "sample" }, "Patient 1-polar_extraction-metabolism": { "extracted_volume": "110", "extracted_volume%units": "uL", "id": "Patient 1-polar_extraction-metabolism", "injection_volume": "0.5", "injection_volume%units": "uL", "parent_id": "Patient 1-polar_extraction", "protocol.id": [ "GC-MS_preparation" ], "reconstitution_volume": "100", "reconstitution_volume%units": "uL", "type": "sample" }, "Patient 2": { "id": "Patient 2", "protocol.id": [ "resected_tissue_collection", "tissue_quench", "frozen_tissue_slice" ], "resected_tissue_weight": "25.2 mg", "resected_tissue_weight%units": "mg", "sample_type": "experimental", "type": "sample" }, "Patient 2-polar_extraction": { "id": "Patient 2-polar_extraction", "parent_id": "Patient 2", "protocol.id": [ "polar_extraction" ], "sample_type": "experimental", "type": "sample" }, "Patient 2-polar_extraction-metabolism": { "extracted_volume": "110", "extracted_volume%units": "uL", "id": "Patient 2-polar_extraction-metabolism", "injection_volume": "0.5", "injection_volume%units": "uL", "parent_id": "Patient 2-polar_extraction", "protocol.id": [ "GC-MS_preparation" ], "reconstitution_volume": "100", "reconstitution_volume%units": "uL", "type": "sample" } } }
#filter - indicates that the data to be tagged should be filtered before being tagged.
If a #filter tag is used, the #copy tag must also be used.
The general structure of the #filter tag is #filter=expressions.
There are 3 kinds of expressions.
Match using an identical single value will only keep the rows where the indicated column has the indicated value.
Ex: column1:value1
Match using a regular expression will keep the rows where the value in the indicated column matches the given regular expression.
Ex: column1:r’your_regex’
Match using the keyword “unique” will keep the rows with first instance of a unique value in the indicated column.
Ex: column1:unique
You can put double quotes around an expression value to enforce an indentical match.
Ex: column1:”unique” will match only rows that have the value unique (without double quotes).
Logical and (&) and or (|) operators can be used to filter with multiple columns.
Ex: column1:unique & column2:value1
Example:
Data:
patient
Sample volume
Patient 1
10.8 mg
Patient 2
25.2 mg
Header Tags:
#tags;#copy
#header
#add
#filter=patient:Patient 1
patient
#entity.id;#.type=sample;#.sample_type=experimental
Sample volume
#entity.resected_tissue_weight;#%units=mg
*#.protocol.id=resected_tissue_collection,tissue_quench,frozen_tissue_slice
After Automation:
patient
Sample volume
Patient 1
10.8 mg
Patient 2
25.2 mg
#tags
#entity.id;#.type=sample;#.sample_type=experimental
#entity.resected_tissue_weight;#%units=mg
*#.protocol.id=resected_tissue_collection,tissue_quench,frozen_tissue_slice
#ignore
patient
Sample volume
Patient 1
10.8 mg
JSON Output:
{ "entity": { "Patient 1": { "id": "Patient 1", "protocol.id": [ "resected_tissue_collection", "tissue_quench", "frozen_tissue_slice" ], "resected_tissue_weight": "10.8 mg", "resected_tissue_weight%units": "mg", "sample_type": "experimental", "type": "sample" } } }
#sort - indicates that the data to be tagged should be sorted before being tagged.
If a #sort tag is used, the #copy tag must also be used.
The general structure of the #sort tag is #sort=column_name1:ascending,column_name2:descending.
The part after the “=” is a comma separated list of column headers and either “ascending” or “descending”, separated by a colon.
The order of the columns in the list is the order they will be sorted in.
Ex: #sort=column_name1:ascending,column_name2:descending will sort column_name1 in ascending order and then column_name2 in descending order.
Example:
Data:
patient
Sample volume
Patient 1
10.8 mg
Patient 2
25.2 mg
Header Tags:
#tags;#copy
#header
#add
#sort=patient:descending
patient
#entity.id;#.type=sample;#.sample_type=experimental
Sample volume
#entity.resected_tissue_weight;#%units=mg
*#.protocol.id=resected_tissue_collection,tissue_quench,frozen_tissue_slice
After Automation:
patient
Sample volume
Patient 1
10.8 mg
Patient 2
25.2 mg
#tags
#entity.id;#.type=sample;#.sample_type=experimental
#entity.resected_tissue_weight;#%units=mg
*#.protocol.id=resected_tissue_collection,tissue_quench,frozen_tissue_slice
#ignore
patient
Sample volume
Patient 2
25.2 mg
Patient 1
10.8 mg
JSON Output:
{ "entity": { "Patient 1": { "id": "Patient 1", "protocol.id": [ "resected_tissue_collection", "tissue_quench", "frozen_tissue_slice" ], "resected_tissue_weight": "10.8 mg", "resected_tissue_weight%units": "mg", "sample_type": "experimental", "type": "sample" }, "Patient 2": { "id": "Patient 2", "protocol.id": [ "resected_tissue_collection", "tissue_quench", "frozen_tissue_slice" ], "resected_tissue_weight": "25.2 mg", "resected_tissue_weight%units": "mg", "sample_type": "experimental", "type": "sample" } } }
Note that the JSON output is not in descending order. This is because the JSON is sorted before saving so that there is consistency if key orderings are different. Sorting can still be used for other things such as sorting the order of lists.
Common Use Case Examples
For a real example you can see some in the examples folder of the GitHub repository under the extract folder. The following are excerpts from those examples.
Export Tags
Tagging Subject Entities
#tags |
#entity.id;#entity.species=”Mus musculus”;#.species_type=Mouse;#.taxonomy_id=10090;#.type=subject |
*#.protocol.id |
#entity.replicate |
#entity.time_point |
|||
|---|---|---|---|---|---|---|---|
#ignore |
mouse number |
protocol |
replicate |
time_point |
|||
01 |
01_A0_naive_0days_UKy_GCH_rep1 |
A0 |
naive |
1 |
0 |
||
02 |
02_A1_naive_0days_UKy_GCH_rep2 |
A1 |
naive |
2 |
0 |
||
03 |
03_A2_naive_0days_UKy_GCH_rep3 |
A2 |
naive |
3 |
0 |
||
04 |
04_B0_syngenic_42days_UKy_GCH_rep1 |
B0 |
syngenic |
1 |
42 |
||
05 |
05_B1_syngenic_42days_UKy_GCH_rep2 |
B1 |
syngenic |
2 |
42 |
||
06 |
06_B2_syngenic_42days_UKy_GCH_rep3 |
B2 |
syngenic |
3 |
42 |
||
07 |
07_C1-1_allogenic_42days_UKy_GCH_rep1 |
C1-1 |
allogenic |
1 |
42 |
||
08 |
08_C1-2_allogenic_42days_UKy_GCH_rep2 |
C1-2 |
allogenic |
2 |
42 |
You can see that the original header row is ignored with “#ignore” and the important columns are tagged with appropriate field names. Not all of the columns are tagged. What needs to be tagged or what is important will vary according to your needs and application. MESSES is largely meant for converting for deposition into a repository so information that is not important or relevant to the repository you are targeting will likely be untagged. Note that the *#.protocol.id tag is a list field even though it has single values underneath. This is be consistent with field types as other records can have multiple protocols associated with them.
Creating Samples From Parents
#tags |
#entity.id |
#entity%child.id=-polar-ICMS_A;#.replicate=1; #%type=”analytical”;#.weight; #%units=g;*#.protocol.id=polar_extraction,IC-FTMS_preparation; #entity.type=sample |
|||||||
|---|---|---|---|---|---|---|---|---|---|
#ignore |
Slice # |
Parent Sample ID |
polar tare (5ml tube) |
polar+tare |
g polar ext |
g polar FTMS A |
g polar FTMS B |
g polar ICMS A |
g polar ICMS B |
01 |
01_A0_Spleen_naive_0days_170427_UKy_GCH_rep1 |
3.1476 |
4.8772 |
1.7296 |
0.0983 |
0.1056 |
0.2084 |
0.2083 |
|
02 |
02_A1_Spleen_naive_0days_170427_UKy_GCH_rep2 |
3.1782 |
4.8490 |
1.6708 |
0.0983 |
0.0973 |
0.2001 |
0.1949 |
|
03 |
03_A2_Spleen_naive_0days_170427_UKy_GCH_rep3 |
3.1621 |
4.8485 |
1.6864 |
0.1029 |
0.1041 |
0.1951 |
0.2043 |
|
04 |
04_B0_Spleen_syngenic_42days_170427_UKy_GCH_rep1 |
3.1479 |
4.7828 |
1.6349 |
0.0938 |
0.0958 |
0.1885 |
0.1936 |
|
05 |
05_B1_Spleen_syngenic_42days_170427_UKy_GCH_rep2 |
3.1685 |
4.9067 |
1.7382 |
0.0938 |
0.0994 |
0.2010 |
0.2003 |
|
06 |
06_B2_Spleen_syngenic_42days_170427_UKy_GCH_rep3 |
3.1735 |
4.8483 |
1.6748 |
0.1003 |
0.0914 |
0.2000 |
0.1891 |
|
07 |
07_C1-1_Spleen_allogenic_42days_170427_UKy_GCH_rep1 |
3.1764 |
4.7631 |
1.5867 |
0.0980 |
0.0916 |
0.1859 |
0.1868 |
|
08 |
08_C1-2_Spleen_allogenic_42days_170427_UKy_GCH_rep2 |
3.1617 |
4.8176 |
1.6559 |
0.0955 |
0.0957 |
0.1982 |
0.1922 |
The entities in the 3rd column under the #entity.id tag are assumed to exist elsewhere. The %child tag is used to create a new sample that captures the weight of the polar extraction for the ICMS_A aliquot. The other aliquots are not captured because they are not needed or were not used for the study that is being deposited, but it is a good exercise for the reader to add tags to those columns based on the one shown for ICSM_A.
Modification and Automation Tags
The examples here will both use a common set of measurement data.
Measurement Data:
Compound |
Mol_Formula |
C_isomers |
SamplID |
Intensity |
Renormalized |
Total.NA.Removed |
Corrected |
Predicted |
Comments |
Quantified_uM_ratio |
Quantified_uM_sequence_ratio |
reconstitution_volume_uL |
injection_volume_uL |
protein_mg |
icms_split_ratio |
Amount_ProteinAdj_uMol_g_protein_RatioBased |
Amount_ProteinAdj_uMol_g_protein_SequenceBased |
CommentQuantification |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd |
C5H8O4 |
0 |
01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench |
7989221.834 |
8447352.892 |
8490014.365 |
8447352.892 |
7839899.288 |
NA |
0 |
0 |
20 |
10 |
0.618176844 |
0.25575733 |
0 |
0 |
Legend |
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd |
C5H8O4 |
1 |
01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench |
289287.7334 |
0 |
8490014.365 |
0 |
439597.5524 |
NA |
0 |
0 |
20 |
10 |
0.618176844 |
0.25575733 |
0 |
0 |
Compound: name of assigned metabolite, noStd means assigment was NOT verified with standard compound |
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd |
C5H8O4 |
2 |
01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench |
43815.55259 |
34916.99767 |
8490014.365 |
34916.99767 |
42996.61803 |
NA |
0 |
0 |
20 |
10 |
0.618176844 |
0.25575733 |
0 |
0 |
Mol_Formula: molucular formula od assigned compound |
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd |
C5H8O4 |
3 |
01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench |
6555.515601 |
5426.466785 |
8490014.365 |
5426.466785 |
6432.98974 |
NA |
0 |
0 |
20 |
10 |
0.618176844 |
0.25575733 |
0 |
0 |
C_isomers and/or N_isomers: additional Da units that were measured in a targeted manner for the compound in question |
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd |
C5H8O4 |
4 |
01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench |
2049.939279 |
1939.047016 |
8490014.365 |
1939.047016 |
2011.62489 |
NA |
0 |
0 |
20 |
10 |
0.618176844 |
0.25575733 |
0 |
0 |
Intensity: raw XIC intensity data |
(S)-2-Acetolactate_Glutaric acid_Methylsuccinic acid_MP_NoStd |
C5H8O4 |
5 |
01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench |
401.1818948 |
378.961431 |
8490014.365 |
378.961431 |
393.6836049 |
NA |
0 |
0 |
20 |
10 |
0.618176844 |
0.25575733 |
0 |
0 |
Renormalized: Natural abundance Corrected XIC intensity data |
13-BPG |
C3H8O10P2 |
0 |
01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench |
348209.1176 |
360055.818 |
560907.2249 |
360055.818 |
348209.1176 |
NA |
0.055515033 |
0.0225354 |
20 |
10 |
0.618176844 |
0.25575733 |
0.00702263 |
0.002850719 |
Amount_ProteinAdj_uMol_g_protein_RatioBased = amount of quantified isotoplogue in sample in umol/g protein?protein, calculated as Quantified_uM x (RecVol/1000/SplitRatio/mg protein) |
13-BPG |
C3H8O10P2 |
1 |
01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench |
14984.22348 |
3343.128038 |
560907.2249 |
3343.128038 |
14984.22348 |
NA |
0.000515459 |
0.000209242 |
20 |
10 |
0.618176844 |
0.25575733 |
6.52E-05 |
2.65E-05 |
Amount_ProteinAdj_uMol_g_protein_SequenceBased = amount of quantified isotoplogue in sample in umol/g protein?protein, calculated as Quantified_uM x (RecVol/1000/SplitRatio/mg protein) |
13-BPG |
C3H8O10P2 |
2 |
01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench |
11385.033 |
11305.71062 |
560907.2249 |
11305.71062 |
11385.033 |
NA |
0.001743166 |
0.000707609 |
20 |
10 |
0.618176844 |
0.25575733 |
0.00022051 |
8.95E-05 |
|
13-BPG |
C3H8O10P2 |
3 |
01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench |
186328.8509 |
186202.5683 |
560907.2249 |
186202.5683 |
186328.8509 |
NA |
0.028709553 |
0.011654163 |
20 |
10 |
0.618176844 |
0.25575733 |
0.003631747 |
0.001474247 |
Not all of the columns will be used and this is just a small portion of a much larger dataset.
Adding a Protocol
#insert |
||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#tags |
#protocol.id |
#.type |
#.machine_type |
#.instrument |
#.ion_mode |
#.ionization |
#.instrument_type |
#.description |
#.chromatography_instrument_name |
#.chromatography_type |
#.column_name |
#.chromatography_description |
#.parent_protocol |
|
ICMS1 |
measurement |
MS |
Orbitrap Fusion |
NEGATIVE |
ESI |
IC-FTMS |
ICMS Analytical Experiment with detection of compounds by comparison to standards. |
Thermo Dionex ICS-5000+ |
Targeted IC |
Dionex IonPac AS11-HC-4um 2 mm i.d. x 250 mm |
Targeted IC |
IC-FTMS_measurement |
||
#end |
This would be one set of directives in the #automate sheet or file. The description tag has been shortened.
Automating Tags
#tags |
#header |
#add |
|
|---|---|---|---|
Compound+”-13C”+C_isomers+”-“+SamplID |
#measurement.id |
||
Compound+”-13C”+C_isomers |
#measurement.assignment |
||
Compound |
#measurement.compound |
||
Mol_Formula |
#measurement.formula |
||
SamplID |
#.entity.id |
||
“13C”+C_isomers |
#measurement.isotopologue;#%type=”13C” |
||
Intensity |
#measurement.raw_intensity;#%type=”spectrometer peak area” |
||
Renormalized |
#measurement.corrected_raw_intensity;#%type=”natural abundance corrected peak area” |
||
Quantified_uM_sequence_ratio |
#measurement.concentration;#%units=uM;#%type=”calculated from standard” |
||
Amount_ProteinAdj_uMol_g_protein_SequenceBased |
#measurement.normalized_concentration;#%type=”protein normalized”;#%units=”uMol/g” |
||
eval(float(#Renormalized#) / float(#protein_mg#)) |
#measurement.intensity;#%type=”natural abundance corrected and protein normalized peak area”;#%units=”area/g” |
||
#.protocol.id=ICMS1 |
You can see how the values in the #header column correspond to headers in the measurement data, and how the tags in the #add column will be newly created or added on top of the column in the measurement data.
Changing Entity IDs
#tags |
#measurement.entity.id.value |
#measurement.entity.id.assign |
#unique=false |
#match=all |
|---|---|---|---|---|
01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench |
01_A0_Colon_naive_0days_170427_UKy_GCH_rep1-polar-ICMS_A |
|||
02_A1_Colon_T03-2017_naive_170427_UKy_GCB_rep2-quench |
02_A1_Colon_naive_0days_170427_UKy_GCH_rep2-polar-ICMS_A |
|||
03_A2_Colon_T03-2017_naive_170427_UKy_GCB_rep3-quench |
03_A2_Colon_naive_0days_170427_UKy_GCH_rep3-polar-ICMS_A |
This has been truncated, but you can see how it will overwrite the entity.id field for all records that match the “.value” column. Something like this may be necessary when there is a name mismatch between metadata and measurement data.
Changing IDs
#tags |
#measurement.id.value |
#measurement.id.regex |
#unique=false |
#comparison=regex |
|---|---|---|---|---|
r’01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench’ |
r’01_A0_Colon_T03-2017_naive_170427_UKy_GCB_rep1-quench’,r’01_A0_Colon_naive_0days_170427_UKy_GCH_rep1-polar-ICMS_A’ |
|||
r’02_A1_Colon_T03-2017_naive_170427_UKy_GCB_rep2-quench’ |
r’02_A1_Colon_T03-2017_naive_170427_UKy_GCB_rep2-quench’,r’02_A1_Colon_naive_0days_170427_UKy_GCH_rep2-polar-ICMS_A’ |
|||
r’03_A2_Colon_T03-2017_naive_170427_UKy_GCB_rep3-quench’ |
r’03_A2_Colon_T03-2017_naive_170427_UKy_GCB_rep3-quench’,r’03_A2_Colon_naive_0days_170427_UKy_GCH_rep3-polar-ICMS_A’ |
This is similar to and partly because of the previous example that changes the entity.id field. You can see in the Automating Tags example that the id field is created with a concatenation of the column that will become the entity.id field (SamplID). In order to make things match we need to make the same change in the id field, but we can’t simply use an assign, thus the use of regex.
Adding Additional Fields
#tags |
#measurement.assignment.value |
#measurement.assignment%method.assign |
#unique=false |
#match=all |
|---|---|---|---|---|
(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C0 |
database |
|||
(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C1 |
database |
|||
(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C2 |
database |
|||
(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C3 |
database |
|||
(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C4 |
database |
|||
(S)-2-Acetolactate Glutaric acid Methylsuccinic acid-13C5 |
database |
|||
13-BPG-13C0 |
direct_to_standard |
|||
13-BPG-13C1 |
indirect_to_standard |
|||
13-BPG-13C2 |
indirect_to_standard |
|||
13-BPG-13C3 |
indirect_to_standard |
This will add an additional “assignment%method” field to the measurement records whose “assignment” field matches the “.value” column.
The next example uses a new set of measurement data.
Automating crecords
Measurement Data:
identifier |
annotation |
ion species |
InChiKey |
MSI level |
m/z |
ret.time (min) |
ESI mode |
1 |
2 |
3 |
Pool |
Pool |
Blank |
Blank |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2.36_131.09 |
(dehydro)ornithine |
[M+H]+ |
KFZHLVSVDLIACX-UHFFFAOYSA-N |
3 |
131.0872 |
2.36 |
ESI pos |
349 |
995 |
597 |
651 |
722 |
195 |
216 |
2.42_332.56 |
.beta.-Nicotinamide adenine dinucleotide |
[M+2H]2+ |
BAWFJGJZGIEFAR-OWXIGJDBSA-N |
2 |
332.5639 |
2.42 |
ESI pos |
229 |
1030 |
1482 |
1242 |
1409 |
6 |
6 |
Automation Tags:
#tags |
#header |
#add |
#allow_duplicates |
|---|---|---|---|
identifier |
#measurement.identifier |
||
annotation |
#measurement.assignment |
||
ion species |
#measurement.ion |
||
InChiKey |
#measurement.inchi_key |
||
MSI level |
#measurement.msi_level;#.msi_level%type=MSI_2014 |
||
m/z |
#measurement.moverz_quant |
||
ret.time (min) |
#measurement.retention_time;#%units=min |
||
ESI mode |
#measurement.esi_mode |
||
r’^d+$’ |
#%crecord.id=#.identifier+”-“+#.assignment+”-#HEADER#”+”-“+#.esi_mode;#.intensity;#.intensity%units=”peak height”;#.intensity%type=”peak height normalized to the sum of internal standards”;#.entity.id=”#HEADER#”;#.protocol.id=LC-MS-MS1 |
TRUE |
|
r’^Pool|Blank$’ |
#%crecord.id=#.identifier+”-“+#.assignment+”-#HEADER#-#INCREMENT#”+”-“+#.esi_mode;#.intensity;#.intensity%units=”peak height”;#.intensity%type=”peak height normalized to the sum of internal standards”;#.entity.id=”#HEADER#_#INCREMENT#”;#.protocol.id=LC-MS-MS1 |
TRUE |
After Automation:
#tags |
#measurement.identifier |
#measurement.assignment |
#measurement.ion |
#measurement.inchi_key |
#measurement.msi_level;#.msi_level%type=MSI_2014 |
#measurement.moverz_quant |
#measurement.retention_time;#%units=min |
#measurement.esi_mode |
#%crecord.id=#.identifier+”-“+#.assignment+”-1”+”-“+#.esi_mode;#.intensity;#.intensity%units=”peak height”;#.intensity%type=”peak height normalized to the sum of internal standards”;#.entity.id=”1”;#.protocol.id=LC-MS-MS1 |
#%crecord.id=#.identifier+”-“+#.assignment+”-2”+”-“+#.esi_mode;#.intensity;#.intensity%units=”peak height”;#.intensity%type=”peak height normalized to the sum of internal standards”;#.entity.id=”2”;#.protocol.id=LC-MS-MS1 |
#%crecord.id=#.identifier+”-“+#.assignment+”-3”+”-“+#.esi_mode;#.intensity;#.intensity%units=”peak height”;#.intensity%type=”peak height normalized to the sum of internal standards”;#.entity.id=”3”;#.protocol.id=LC-MS-MS1 |
#%crecord.id=#.identifier+”-“+#.assignment+”-Pool-1”+”-“+#.esi_mode;#.intensity;#.intensity%units=”peak height”;#.intensity%type=”peak height normalized to the sum of internal standards”;#.entity.id=”Pool_1”;#.protocol.id=LC-MS-MS1 |
#%crecord.id=#.identifier+”-“+#.assignment+”-Pool-2”+”-“+#.esi_mode;#.intensity;#.intensity%units=”peak height”;#.intensity%type=”peak height normalized to the sum of internal standards”;#.entity.id=”Pool_2”;#.protocol.id=LC-MS-MS1 |
#%crecord.id=#.identifier+”-“+#.assignment+”-Blank-1”+”-“+#.esi_mode;#.intensity;#.intensity%units=”peak height”;#.intensity%type=”peak height normalized to the sum of internal standards”;#.entity.id=”Blank_1”;#.protocol.id=LC-MS-MS1 |
#%crecord.id=#.identifier+”-“+#.assignment+”-Blank-2”+”-“+#.esi_mode;#.intensity;#.intensity%units=”peak height”;#.intensity%type=”peak height normalized to the sum of internal standards”;#.entity.id=”Blank_2”;#.protocol.id=LC-MS-MS1 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#ignore |
identifier |
annotation |
ion species |
InChiKey |
MSI level |
m/z |
ret.time (min) |
ESI mode |
1 |
2 |
3 |
Pool |
Pool |
Blank |
Blank |
2.36_131.09 |
(dehydro)ornithine |
[M+H]+ |
KFZHLVSVDLIACX-UHFFFAOYSA-N |
3 |
131.0872 |
2.363 |
ESI pos |
349 |
995 |
597 |
651 |
722 |
195 |
216 |
|
2.42_332.56 |
.beta.-Nicotinamide adenine dinucleotide |
[M+2H]2+ |
BAWFJGJZGIEFAR-OWXIGJDBSA-N |
2 |
332.5639 |
2.423 |
ESI pos |
229 |
1030 |
1482 |
1242 |
1409 |
6 |
6 |
|
This is a real example taken from an upload of West Coast Metabolomics data. The full example is in the examples folder on GitHub.