Tagging
Terminology
header - a descriptive label at the top of a column in a table of data.
Example: SampleID
tag - used to describe fields in a record of data. They are a semantic descriptive header used to parse a column (field) of data from a table.
Example: #measurement.compound
table - in this section, table can refer to either a table in a data file or a SQL style table in the output. Often there are many small tables in a data file that are combined into a larger one in the extract output.
directives - directions given to MESSES commands using tagged files or their JSONized forms. A table and its tags taken as a whole.
Introduction
In order to extract data from arbitrary tables programmatically, some kind of system has to be devised. This could be something as simple as requiring a table be on the very first row and for the starting row to have column names for every column, but a system such as that would then be fragile. We decided to create a more robust system that could handle more complicated data arrangements and reduce the verbosity to a minimum. The system that was devised was an extra layer of tags added on top of existing tables that tell the extract command how to transform the tabular data into key-based record data similar to JSON or SQL databases. The specific table schema this system was designed for is covered in the Experiment Description Specification section, but it is general enough that it can be used with most table schema.
This initial system served its function well, but it became clear that more functionality was sorely needed. Namely, both a way to add tags programmatically to data and a way to modify record values was needed, so the system was expanded to provide facilities to do both. Ultimately there are 3 parts to the tagging system that are distinct from one another but have similar syntax and ideas. The “export” part is the system used to add directly on top of existing tabular data. It is the base system that must be used for the extraction to work at all. The “automation” part that is used to automate adding “export” tags to tabular data. Based on the header values in your data, you can use “automation” tags to add the “export” tags automatically. A good use case for this is when you have data output by a program in a consistent way. Instead of manually adding export tags to the program output each time, you can create an “automation” page that will add the “export” tags for you. The last “modification” part is the system to modify record values. It can be used to prepend, append, delete, overwrite, or regex substitute values. An example use-case would be to update old naming conventions. Validly tagged files in their tabular or JSON form can be referred to as directives as they direct the actions of the program. To reduce confusion between tags and directives “tags” should generally refer to the extra text added above tables, while “directives” are the tables and tags taken as a whole. Each row of a tagged table is an individual directive.
Each part of the tagging system has to be in their own sheet or file for the extract command. By default, export tags are expected to be in a sheet named ‘#export’, if given an Excel file without specifying a sheet name. If given a CSV file, then this file is expected to have export tags. Modification tags are expected to be in a sheet named ‘#modify’ by default, but can be specified using the –modify option. The option can be used to specify either a different sheet name in the given Excel file, a different Excel file, a different Excel file with a different sheet name, a Google Sheets file, a Google Sheets file with a different sheet name, a JSON file, or a CSV file. Automation tags are similarly specifiedusing the –automate option or otherwise expected in a sheet named ‘#automate’ by default.
Each part of the tagging system are explained below with examples. Examples using them with the extract command are in the Tutorial section and there are full run examples in the “examples” folder of the GitHub repository.
Common Use Case Examples
For a real example you can see some in the examples folder of the GitHub repository under the extract folder. The following are excerpts from those examples.
Export Tags
Tagging Subject Entities
#tags |
#entity.id;#entity.species=”Mus musculus”;#.species_type=Mouse;#.taxonomy_id=10090;#.type=subject |
*#.protocol.id |
#entity.replicate |
#entity.time_point |
|||
---|---|---|---|---|---|---|---|
#ignore |
mouse number |
protocol |
replicate |
time_point |
|||
01 |
01_A0_naive_0days_UKy_GCH_rep1 |
A0 |
naive |
1 |
0 |
||
02 |
02_A1_naive_0days_UKy_GCH_rep2 |
A1 |
naive |
2 |
0 |
||
03 |
03_A2_naive_0days_UKy_GCH_rep3 |
A2 |
naive |
3 |
0 |
||
04 |
04_B0_syngenic_42days_UKy_GCH_rep1 |
B0 |
syngenic |
1 |
42 |
||
05 |
05_B1_syngenic_42days_UKy_GCH_rep2 |
B1 |
syngenic |
2 |
42 |
||
06 |
06_B2_syngenic_42days_UKy_GCH_rep3 |
B2 |
syngenic |
3 |
42 |
||
07 |
07_C1-1_allogenic_42days_UKy_GCH_rep1 |
C1-1 |
allogenic |
1 |
42 |
||
08 |
08_C1-2_allogenic_42days_UKy_GCH_rep2 |
C1-2 |
allogenic |
2 |
42 |
You can see that the original header row is ignored with “#ignore” and the important columns are tagged with appropriate field names. Not all of the columns are tagged. What needs to be tagged or what is important will vary according to your needs and application. MESSES is largely meant for converting for deposition into a repository so information that is not important or relevant to the repository you are targeting will likely be untagged. Note that the *#.protocol.id tag is a list field even though it has single values underneath. This is be consistent with field types as other records can have multiple protocols associated with them.
Creating Samples From Parents
#tags |
#entity.id |
#entity%child.id=-polar-ICMS_A;#.replicate=1; #%type=”analytical”;#.weight; #%units=g;*#.protocol.id=polar_extraction,IC-FTMS_preparation; #entity.type=sample |
|||||||
---|---|---|---|---|---|---|---|---|---|
#ignore |
Slice # |
Parent Sample ID |
polar tare (5ml tube) |
polar+tare |
g polar ext |
g polar FTMS A |
g polar FTMS B |
g polar ICMS A |
g polar ICMS B |
01 |
01_A0_Spleen_naive_0days_170427_UKy_GCH_rep1 |
3.1476 |
4.8772 |
1.7296 |
0.0983 |
0.1056 |
0.2084 |
0.2083 |
|
02 |
02_A1_Spleen_naive_0days_170427_UKy_GCH_rep2 |
3.1782 |
4.8490 |
1.6708 |
0.0983 |
0.0973 |
0.2001 |
0.1949 |
|
03 |
03_A2_Spleen_naive_0days_170427_UKy_GCH_rep3 |
3.1621 |
4.8485 |
1.6864 |
0.1029 |
0.1041 |
0.1951 |
0.2043 |
|
04 |
04_B0_Spleen_syngenic_42days_170427_UKy_GCH_rep1 |
3.1479 |
4.7828 |
1.6349 |
0.0938 |
0.0958 |
0.1885 |
0.1936 |
|
05 |
05_B1_Spleen_syngenic_42days_170427_UKy_GCH_rep2 |
3.1685 |
4.9067 |
1.7382 |
0.0938 |
0.0994 |
0.2010 |
0.2003 |
|
06 |
06_B2_Spleen_syngenic_42days_170427_UKy_GCH_rep3 |
3.1735 |
4.8483 |
1.6748 |
0.1003 |
0.0914 |
0.2000 |
0.1891 |
|
07 |
07_C1-1_Spleen_allogenic_42days_170427_UKy_GCH_rep1 |
3.1764 |
4.7631 |
1.5867 |
0.0980 |
0.0916 |
0.1859 |
0.1868 |
|
08 |
08_C1-2_Spleen_allogenic_42days_170427_UKy_GCH_rep2 |
3.1617 |
4.8176 |
1.6559 |
0.0955 |
0.0957 |
0.1982 |
0.1922 |
The entities in the 3rd column under the #entity.id tag are assumed to exist elsewhere. The %child tag is used to create a new sample that captures the weight of the polar extraction for the ICMS_A aliquot. The other aliquots are not captured because they are not needed or were not used for the study that is being deposited, but it is a good exercise for the reader to add tags to those columns based on the one shown for ICSM_A.