User Guide

MESSES

Current library version Supported Python versions Build status Code coverage information GitHub project

MESSES (Metadata from Experimental SpreadSheets Extraction System) is a Python package that facilitates the conversion of tabular data into other formats. We call it MESSES because we try to convert other people’s metadata messes into clean, well-structured, JSONized metadata. It was initially created to pull mass spectrometry (MS) and nuclear magnetic resonance (NMR) experimental data into a database, but has been generalized to work with all tabular data. The key to this is the tagging system. Simply add a layer of tags to any tabular data and MESSES can transform it into an intermediate JSON representation and then convert it to any of the supported formats.

Currently Supported Formats:

The process of going from your raw experimental data to submission to an online repository is not an easy one, but MESSES was created to make it easier. MESSES breaks up the process into 3 steps: extract, validate, and convert. The extraction step adds a layer of tags to your raw tabular data, which may be automatable, and then extracts it into a JSONized form that it is more interoperable and more standardized. The validation step ensures the data that was extracted is valid against the Experiment Description Specification, the Protocol Dependent Schema, any additional JSON schema you wish to provide, and a built in schema specific for the format you wish to convert to. The conversion step converts the extracted data to the form that is accepted by the online repository. There is an initial steep learning curve. But once the extraction, validation, and conversion settings are worked out, this process can be easily added to your data generation and analysis workflows.

Although any kind of data schema can be used for extraction into JSON, conversion to another format from the extracted JSON does rely on the data being in a specific schema. A generalized schema was developed for MESSES that should be able to comprehensively describe most experimental designs and data. This schema is described in the Experiment Description Specification section of the documentation. But original data entry, manual tagging of tabular data, and even automated tagging facilities can be messy, generating errors in the extracted JSONized representation. So MESSES includes a validate command to help make sure your data is in line with your project parameters and data schema.

The MESSES package is primarily designed as a command-line tool to convert raw tabular data (Excel or CSV formatted) into other well-structured data formats. But the package can be used as a library and extended to handle additional data conversion use-cases.

Installation

The MESSES package runs under Python 3.10+. Use pip to install. Starting with Python 3.4, pip is included by default. Be sure to use the latest version of pip as older versions are known to have issues grabbing all dependencies.

Install on Linux, Mac OS X

python3 -m pip install messes

Install on Windows

py -3 -m pip install messes

Upgrade on Linux, Mac OS X

python3 -m pip install messes --upgrade

Upgrade on Windows

py -3 -m pip install messes --upgrade

Note: If py is not installed on Windows (e.g. Python was installed via the Windows store rather than from the official Python website), the installation command is the same as Linux and Mac OS X.

Note: If the messes console script is not found on Windows, the CLI can be used via python3 -m messes or py -3.10 -m messes or path\to\console\script\messes.exe. Alternatively, the directory where the console script is located can be added to the Path environment variable. For example, the console script may be installed at:

c:\users\<username>\appdata\local\programs\python\python310\Scripts\

Quickstart

It is unlikely that you will have data that is tagged and ready to be converted, so it is highly recommended to first read the documentation on tagging and the Experiment Description Specification so that you can properly tag your data first.

The expected workflow is to use the “extract” command to transform your tabular data into JSON, then use the “validate” command to validate the JSON based on your specific project schema, fix errors and warnings in the original data, repeat steps 1-3 until there are no more errors, and then use the “convert” command to transform the validated JSON into your final preferred data format. The validate command can be skipped, but it is not recommended.

A basic error free run may look like:

messes extract your_data.csv --output your_data.json
messes validate json your_data.json --pds your_schema.json --format desired_format
messes convert desired_format your_data.json your_format_data

MESSES’s behavior can be quite complex, so it is highly encouraged to read the guide and tutorial. There are also examples available in the examples folder on the GitHub repository and in a figshare.

Mac OS Note

When you try to run the program on Mac OS, you may get an SSL error.

certificate verify failed: unable to get local issuer certificate

This is due to a change in Mac OS and Python. To fix it, go to to your Python folder in Applications and run the Install Certificates.command shell command in the /Applications/Python 3.x folder. This should fix the issue.

License

This package is distributed under the BSD license.

Get the source code

Code is available on GitHub: https://github.com/MoseleyBioinformaticsLab/messes

You can either clone the public repository:

$ https://github.com/MoseleyBioinformaticsLab/messes.git

Or, download the tarball and/or zipball:

$ curl -OL https://github.com/MoseleyBioinformaticsLab/messes/tarball/main

$ curl -OL https://github.com/MoseleyBioinformaticsLab/messes/zipball/main

Once you have a copy of the source, you can embed it in your own Python package, or install it into your system site-packages easily:

$ python3 setup.py install

Dependencies

The MESSES package depends on several Python libraries. The pip command will install all dependencies automatically, but if you wish to install them manually, run the following commands:

  • docopt for creating the command-line interface.
    • To install docopt, run the following:

      python3 -m pip install docopt  # On Linux, Mac OS X
      py -3 -m pip install docopt    # On Windows
      
  • jsonschema for validating JSON.
    • To install the jsonschema Python library, run the following:

      python3 -m pip install jsonschema  # On Linux, Mac OS X
      py -3 -m pip install jsonschema    # On Windows
      
  • pandas for easy data manipulation.
    • To install the pandas Python library, run the following:

      python3 -m pip install pandas  # On Linux, Mac OS X
      py -3 -m pip install pandas    # On Windows
      
  • openpyxl for saving Excel files in pandas.
    • To install the openpyxl Python library, run the following:

      python3 -m pip install openpyxl  # On Linux, Mac OS X
      py -3 -m pip install openpyxl    # On Windows
      
  • xlsxwriter for saving Excel files in pandas.
    • To install the xlsxwriter Python library, run the following:

      python3 -m pip install xlsxwriter  # On Linux, Mac OS X
      py -3 -m pip install xlsxwriter    # On Windows
      
  • jellyfish for saving Excel files in pandas.
    • To install the jellyfish Python library, run the following:

      python3 -m pip install jellyfish  # On Linux, Mac OS X
      py -3 -m pip install jellyfish    # On Windows
      
  • mwtab for saving Excel files in pandas.
    • To install the mwtab Python library, run the following:

      python3 -m pip install mwtab  # On Linux, Mac OS X
      py -3 -m pip install mwtab    # On Windows
      

Developers

Any developers that wish to contribute should do so through GitHub Issues and pull requests.