recount3 Local Run Helpers • rc3helpers

rc3helpers is a little package designed around making it easier work with local samples in a few ways:

Running recount-pump and recount-unify;
Copying the output of recount-unify into the directory structure that allows importing local data into the recount3 R package.
Checking that something hasn’t happened to your *.fq.gz files during download or copying between locations.

Installation

You can install the development version of rc3helpers from GitHub with:

# install.packages("pak")
pak::pak("MoseleyBioinformaticsLab/rc3helpers")

Running Pump & Unify

This assumes you’ve already gotten everything setup for running pump and unify locally. You can find instructions for that on the monorail-external README.

I realize that the rc3_run_pump_unify function call takes a lot of arguments. However, given that I’m horrible at writing shell scripts, and typically have more than a single sample to run on local hardware, not in the cloud like the Langmead lab does, this saves me a lot of typing as I only have to give it the directory of samples, besides the other locations.

# NOT RUN, for example only
unify_output = rc3_run_pump_unify(
  fasta = "path/to/fasta.fq.gz",
  outputs = "path/to/outputs/dir",
  monorail = "path/to/monorail/repo",
  recount_pump = "path/to/recount-pump.XXX.sif",
  recount_unify = "path/to/recount-unify.XXX.sif",
  reference_path = "path/to/references",
  reference = "hg38",
  studyid = "studyid",
  shortid = "shortid",
  ncore = 2,
  run = "both",
  delete_oldpump = "yes",
  run_or_show = "run"
)

When you are processing a larger number of samples (I’ve seen it happen with 17 samples), one or more of the samples might fail during the pump stage. recount-unify should die at that point from checking the samples, as rc3_run_pump_unify defaults to expecting the same samples in both the pump and unify stage. You should verify what caused the sample to error by checking the log file, normally in path/to/outputs/output/sampleid_att0/std.out. As long as it doesn’t say decompression failed, you should be able to delete the sample directory, and then set run = "pump", delete_oldpump = "no" to just do the pump part on the samples that previously failed. When that is complete, you can then do it again, with run = "unify", and it will generate the unify results.

Custom Organisms

For the monorail argument, this can either be the top level directory that is normally monorail-external, or a vector of length 2, giving the paths to both of the shell scripts for orchestrating recount-pump and recount-unify respectively. This is useful when you are working with a custom organism outside of human and mouse. For example, I have been working with zebrafish, so I created a slightly modified script that handles the recount-unify step properly for zebrafish, and I would define the monorail argument like this, so that the unify step calls the correct shell script:

monorail = c("monorail-external/singularity/run_recount_pump.sh", 
             "monorail-external/sngularity/run_recount_unify_mz11.sh")

Samples and IDs

This script currently expects each sample to have two fasta files that are named:

sample_id_1.fq.gz
sample_id_2.fq.gz

monorail-external also assumes that the last two places of any identifier such as sample_id, study_id or shortid do not contain an underscore or any other non-alphanumeric character, i.e. they should contain [a-z] and or [0-9].

Copying Unify Outputs

Two important notes:

As of recount3 v 1.22.0, there is no option to have an organism outside of human or mouse, so we always use a directory of one of those two organisms.
This also means that only certain annotations / references are supported. Therefore, we create symbolic links to the files that recount3 expects to be there.

I previously ran two human SRA samples in local mode through the pump and unify workflow, with the short_id sratest.

library(rc3helpers)
tmp_dir = tempdir()
rc3_dirs = rc3_setup_directory(base_dir = tmp_dir, short_id = "sratest")
#> ℹ Using "/tmp/Rtmpdlp16N" to create underlying folders ...
#> ℹ Creating directories: "/tmp/Rtmpdlp16N/human/data_sources/sratest/base_sums", "/tmp/Rtmpdlp16N/human/data_sources/sratest/exon_sums", "/tmp/Rtmpdlp16N/human/data_sources/sratest/gene_sums", "/tmp/Rtmpdlp16N/human/data_sources/sratest/junctions", "/tmp/Rtmpdlp16N/human/data_sources/sratest/metadata", "/tmp/Rtmpdlp16N/human/annotations/gene_sums", and "/tmp/Rtmpdlp16N/human/annotations/exon_sums"

The directories created should look like this:

human
├── annotations
│   ├── exon_sums
│   └── gene_sums
└── data_sources
    └── sratest
        ├── base_sums
        ├── exon_sums
        ├── gene_sums
        ├── junctions
        └── metadata

The unify output looks like this:

├── exon_sums_per_study
│   └── st
│       └── sratest
│           ├── sratest.exon_sums.sratest.ERCC.gz
│           ├── sratest.exon_sums.sratest.F006.gz
│           ├── sratest.exon_sums.sratest.G026.gz
│           ├── sratest.exon_sums.sratest.G029.gz
│           ├── sratest.exon_sums.sratest.R109.gz
│           └── sratest.exon_sums.sratest.SIRV.gz
├── gene_sums_per_study
│   └── st
│       └── sratest
│           ├── sratest.gene_sums.sratest.ERCC.gz
│           ├── sratest.gene_sums.sratest.F006.gz
│           ├── sratest.gene_sums.sratest.G026.gz
│           ├── sratest.gene_sums.sratest.G029.gz
│           ├── sratest.gene_sums.sratest.R109.gz
│           └── sratest.gene_sums.sratest.SIRV.gz
├── ids.tsv
├── junction_counts_per_study
│   └── st
│       └── sratest
│           ├── sratest.junctions.sratest.ALL.ID.gz
│           ├── sratest.junctions.sratest.ALL.MM.gz
│           ├── sratest.junctions.sratest.ALL.RR.gz
│           ├── sratest.junctions.sratest.UNIQUE.ID.gz
│           ├── sratest.junctions.sratest.UNIQUE.MM.gz
│           └── sratest.junctions.sratest.UNIQUE.RR.gz
├── junctions.bgz
├── junctions.bgz.tbi
├── junctions.sqlite
|
|
|

Where this is a model organism, you are going to need the reference sets of things downloaded too. We can download the gene and exon sum annotations from recount3, in our case for G026.

wget http://duffel.rail.bio/recount3/human/new_annotations/gene_sums/human.gene_sums.G026.gtf.gz
wget http://duffel.rail.bio/recount3/human/new_annotations/exon_sums/human.exon_sums.G026.gtf.gz

We can pass the list of directories created by rc3_setup_directories to rc3_copy_unify_outputs:

rc3_dirs |>
  rc3_copy_unify_outputs(
    short_id = "sratest",
    unify_directory = "/path/to/unify",
    references_directory = "/path/to/references/hg38"
  )

Which will create this directory structure. You may see an error when working with human data, as the function tries to link a file that actually already exists. That’s OK.

human
├── annotations                                                                                                                                       
│   ├──exon_sums                                         
│   │   ├── human.exon_sums.G026.gtf.gz                                                                                                                    
│   └── gene_sums                                                                                                                                
│       ├── human.gene_sums.G026.gtf.gz                                                                                                                 
└── data_sources
    └── sratest
        ├── base_sums
        ├── exon_sums
        │   └── st
        │       └── sratest
        │           ├── sratest.exon_sums.sratest.ERCC.gz
        │           ├── sratest.exon_sums.sratest.F006.gz
        │           ├── sratest.exon_sums.sratest.G026.gz
        │           ├── sratest.exon_sums.sratest.G029.gz
        │           ├── sratest.exon_sums.sratest.R109.gz
        │           └── sratest.exon_sums.sratest.SIRV.gz
        ├── gene_sums                                                                                                                                    
        │   └── st                                                                                                                                                 
        │       └── sratest
        │           ├── sratest.gene_sums.sratest.ERCC.gz
        │           ├── sratest.gene_sums.sratest.F006.gz
        │           ├── sratest.gene_sums.sratest.G026.gz
        │           ├── sratest.gene_sums.sratest.G029.gz
        │           ├── sratest.gene_sums.sratest.R109.gz
        │           └── sratest.gene_sums.sratest.SIRV.gz
        ├── junctions
        │   └── st
        │       └── sratest
        │           ├── sratest.junctions.sratest.ALL.ID.gz
        │           ├── sratest.junctions.sratest.ALL.MM.gz
        │           ├── sratest.junctions.sratest.ALL.RR.gz
        │           ├── sratest.junctions.sratest.UNIQUE.ID.gz
        │           ├── sratest.junctions.sratest.UNIQUE.MM.gz
        │           └── sratest.junctions.sratest.UNIQUE.RR.gz
        └── metadata
            ├── sratest.recount_project.MD.gz
            └── st
                └── sratest
                    ├── sratest.recount_project.sratest.MD.gz
                    ├── sratest.recount_qc.sratest.MD.gz
                    └── sratest.sratest.sratest.MD.gz

Then we can load up recount3 and start working with this dataset:

library(recount3)
hp = available_projects(recount3_url = tmp_dir)
rse_genes = create_rse(hp[1, ], recount3_url = tmp_dir)

Checking fq.gz

Sometimes, if there is an issue with your filesystem, or the disks where data is being stored, when you copy or move a file, something can go wrong with the file. recount-pump depends on decompressing your fasta files into the docker container to run, so if anything gets corrupted with the fq.gz file, it just won’t run.

So, assuming you have gzip installed on your system, we can check either an entire directory or particular files before running recount-pump and unify.

# Not run
file_checks = rc3_check_gz(fasta)

This returns a data.frame, with any error messages emitted from gzip for each file. Ideally, you see “NA” in every entry before running the pump and unify steps.