CLI Reference

The recount3 command-line tool implements a discover -> manifest -> materialize workflow.

Command-line interface for recount3.

A discover -> manifest -> materialize workflow for the recount3 data mirror.

Summary

Use recount3 to:

  • ids - Emit unique sample and project IDs.

  • search - Discover resources and print a machine-readable manifest

    (JSONL or TSV).

  • download - Materialize resources from a manifest (dir or .zip).

  • bundle - Operate on multiple resources (e.g., stack count matrices).

  • smoke-test - Small connectivity test for CI / local validation.

Quick start

Discover a handful of gene-level count files, save a manifest, and download:

recount3 search gene-exon \\
    organism=human data_source=sra genomic_unit=gene project=SRP009615 \\
    --format=jsonl > manifest.jsonl

recount3 download --from=manifest.jsonl --dest=./downloads --jobs=8

Or stream directly, without an intermediate file:

recount3 search annotations \\
    organism=human genomic_unit=gene annotation_extension=G026 \\
    --format=jsonl | \\
    recount3 download --from=- --dest=./annots

Commands

ids
  Emit unique ID lists. By default prints to stdout.

  Flags:
    --organism=human|mouse|""   Empty means all organisms.
    --samples-out=<file>        Write samples to a text file (else stdout).
    --projects-out=<file>       Write projects to a text file (else stdout).

search
  Discover resources and print a manifest (JSONL or TSV). Filters are passed
  as space-separated key=value tokens.

  Output:
    By default results are written to stdout (pipe-friendly). Use
    --output <file> to write a specific file, or --outdir <dir> to
    create a timestamped filename in that directory.

  Modes and required filters:
    annotations   organism, genomic_unit, annotation_extension
    gene-exon     organism, data_source, genomic_unit, project
                  (optional: annotation_extension; default G026)
    junctions     organism, data_source, project
                  (optional: junction_type=ALL, junction_extension=MM)
    metadata      organism, data_source, table_name, project
    bigwig        organism, data_source, project, sample
    project       organism, data_source, project
                  (optional: genomic_unit=gene,exon;
                  annotation=default|all|gencode_v26,gencode_v29
                    Human-readable name or convenience alias.
                    'default' -> primary annotation (G026 for
                    human, M023 for mouse). 'all' -> every
                    available annotation. A comma list of
                    names (e.g. gencode_v26) or raw extension
                    codes (e.g. G026) also works.
                  annotation_extension=G026,G029
                    Raw annotation file-extension codes. When
                    set, overrides 'annotation' completely.
                    Use this when you already know the exact
                    code(s) you need.
                  junction_type=ALL;
                  junction_extension=MM,RR,ID;
                  include_metadata=true|false;
                  include_bigwig=true|false)
    sources       organism
    source-meta   organism, data_source

  Example:
    recount3 search junctions \\
        organism=human data_source=sra project=SRP009615 \\
        junction_type=ALL junction_extension=MM --format=tsv

download
  Materialize resources from a manifest file or one inline JSON object.
  Writes one JSONL progress event per resource to stdout.

  Source:
    --from=<path>|-       Read JSONL manifest from file or stdin ('-').
    --inline='<json>'     One JSON object for a single resource.

  Destination:
    --dest=<dir-or-zip>   Directory or .zip file path.
    --overwrite           Overwrite existing files (dir mode only).

  Behavior:
    --jobs=<n>            Max parallel downloads (default 8).
    --cache=MODE          Cache behavior (default: enable). MODE is one of:
                          enable - use cache; disable - bypass cache;
                          update - force re-download then cache.

bundle stack-counts
  Concatenate compatible count matrices (gene/exon or junctions).

  Required:
    --from=<manifest>     JSONL manifest (or '-' for stdin).
    --out=<path>          Output file (.csv, .tsv, .tsv.gz, or .parquet).

  Options:
    --compat=family|feature    Compatibility mode (default: family).
    --join=inner|outer         Pandas join type (default: inner).
    --axis=0|1                 Concatenate rows (0) or columns (1).
    --verify-integrity         Fail on duplicate index after concat.

smoke-test
  Download a few tiny files to verify connectivity and configuration.

  Options:
    --dest=<dir>          Destination directory (default ./recount3-smoke).
    --limit=<n>           Number of resources to attempt (default 1).

Input and output formats

JSONL (a.k.a. NDJSON) – one JSON object per line; ideal for streaming, grepping, and piping. Used for both search output and download input.

Each manifest line contains all resource description fields plus two convenience keys: url (the fully qualified HTTP URL) and arcname (the destination path inside a .zip archive). For example (one record, wrapped for readability):

{"resource_type":"count_files_gene_or_exon","organism":"human",
 "data_source":"sra","genomic_unit":"gene","project":"SRP009615",
 "sample":null,"annotation_extension":"G026","junction_type":null,
 "junction_extension":null,"table_name":null,
 "url":".../sra/gene_sums/15/SRP009615/sra.gene_sums.SRP009615.G026.gz",
 "arcname":"human/data_sources/.../sra.gene_sums.SRP009615.G026.gz"}

download writes one progress event per resource to stdout:

{"url":"...","status":"ok","dest":"/path/to/file"}
{"url":"...","status":"skipped","dest":"/existing/file"}
{"url":"...","status":"error","dest":null,"error":"<repr>"}

TSV – tab-separated text for quick human scanning or spreadsheet import. TSV is available for search --format=tsv only; download expects JSONL.

Configuration

Configuration is centralized in Config. Values come from, in order of decreasing precedence: CLI flags, environment variables, then library defaults. The relevant environment variables are:

RECOUNT3_URL               Base URL (trailing slash added automatically)
RECOUNT3_CACHE_DIR         Directory for the on-disk cache
RECOUNT3_CACHE_DISABLE     "1" disables cache, anything else enables
RECOUNT3_HTTP_TIMEOUT      HTTP timeout in seconds (int)
RECOUNT3_MAX_RETRIES       Max retry attempts for transient errors (int)
RECOUNT3_INSECURE_SSL      "1" to disable TLS verification (unsafe; https
                           base URLs only, no-op for default http mirror)
RECOUNT3_USER_AGENT        Custom HTTP User-Agent string
RECOUNT3_CHUNK_SIZE        Streaming chunk size in bytes

Global flags mirror these settings: --base-url, --cache-dir, --timeout, --retries, --insecure-ssl, --user-agent, --chunk-size.

Logging

Logging defaults to INFO. Use --quiet for WARNING or --verbose for DEBUG. Log messages follow pattern-string formatting (not f-strings), per the Google guide, and include greppable context (e.g., url=..., dest=...).

Exit codes

0    Success
1    Malformed --inline JSON in download
2    Fatal error (missing filters, I/O failures, bad configuration; also
     argparse validation errors such as unrecognized flags)
3    Partial failure in download (some items failed)
130  Interrupted (Ctrl-C)

Security and safety

  • TLS verification is on by default for https:// mirrors. --insecure-ssl disables it and should only be used to debug certificate issues. It applies only to https:// base URLs and is a no-op for the default http:// Duffel mirror; the AWS Open Data and JHU IDIES https mirrors have valid certificates and need no flag.

  • The cache reduces repeated downloads. Choose --cache=disable to bypass it when correctness requires a direct fetch.

Performance tips

  • Increase --jobs to improve throughput when network-bound.

  • Keep the cache enabled for repeated workflows.

  • Use streaming pipelines with JSONL and standard tools (jq, grep, head/tail) to avoid loading everything into memory.

Example recipes

List human SRA data sources, then download their metadata:

recount3 search sources organism=human --format=jsonl > sources.jsonl
recount3 search source-meta organism=human data_source=sra \\
    --format=jsonl > meta.jsonl
recount3 download --from=meta.jsonl --dest=./meta

Stack gene-level matrices across samples and write Parquet:

recount3 search gene-exon \\
    organism=human data_source=sra genomic_unit=gene project=SRP009615 \\
    --format=jsonl > counts.jsonl
recount3 bundle stack-counts --from=counts.jsonl --compat=family \\
    --join=inner --axis=1 --out=counts.parquet

Troubleshooting

  • “Missing required filters”: Check the mode-specific filter list above.

  • “json.JSONDecodeError”: Ensure your manifest is valid JSONL. Each line must be one JSON object.

  • Permission/Path errors: Verify --dest exists (or its parent for .zip) and is writable; on shared filesystems, reduce --jobs to avoid pressure.

  • TLS/SSL errors: Try updating CA certs, or as a last resort temporarily use --insecure-ssl to isolate the issue.

Import safety

Only defines functions and constants. Performs no I/O at import time so it is safe to run under pydoc and unit tests.

Full usage

Run any subcommand with --help for the full option list:

recount3 --help
recount3 search --help
recount3 download --help
recount3 bundle stack-counts --help
recount3 bundle se --help
recount3 bundle rse --help
recount3 smoke-test --help