Tutorial
This tutorial walks through the recount3 Python API end-to-end: resource
discovery, downloading, assembly of
SummarizedExperiment /
RangedSummarizedExperiment objects, sample
metadata merging, count normalization and scaling, BigWig coverage access, and
management of the on-disk cache.
For the recount3 command-line tool, see CLI Reference. For full per-symbol
documentation, see API Reference.
Installation
The core package depends only on NumPy, pandas, and SciPy. The two optional extras enable features used throughout this tutorial:
python3 -m pip install recount3 # core only
python3 -m pip install "recount3[biocpy]" # + SummarizedExperiment
python3 -m pip install "recount3[bigwig]" # + pyBigWig
python3 -m pip install "recount3[biocpy,bigwig]"
What each extra enables:
biocpyis required forcreate_rse(),to_summarized_experiment(),to_ranged_summarized_experiment(), and every helper inrecount3.sethat returns or operates on a BiocPy object.bigwigis required only when you callload()on a BigWig resource or use theBigWigFilereader directly.
If an optional dependency is missing, the affected function raises
ImportError on first use; the remainder of the package stays importable
and functional.
Quick start
Note
Every example in this tutorial retrieves data from a live recount3 mirror
and therefore requires network access. Downloaded files are cached under
~/.cache/recount3/files (see Cache and configuration), so
re-running an example reuses the local copy rather than downloading again.
The most direct path from a project identifier to an analysis-ready BiocPy
object is create_rse(). It requires the biocpy extra and
performs discovery, downloads, metadata merging, and range assembly in a single
call:
from recount3 import create_rse
rse = create_rse(
project="SRP009615",
organism="human",
annotation_label="gencode_v26",
)
print(rse.shape) # (n_features, n_samples)
print(rse.get_column_names()[:5])
This single call is sufficient for the most common workflow; it is expanded in
Layer 1: Building experiments with create_rse below. The remainder of this tutorial describes the steps that
create_rse performs internally and the lower-level components to use when
finer control is required.
The three layers of the API
recount3 exposes the same workflow at three levels of abstraction:
Layer |
Primary entry point |
Recommended when |
|---|---|---|
High-level: BiocPy builders |
|
You want one project as a |
Mid-level: bundles |
|
You combine multiple projects, filter resources, or stack matrices yourself. |
Low-level: resources |
|
You want fine-grained control over a single file’s URL, download, or parser. |
Each layer is a thin wrapper around the next. create_rse calls
R3ResourceBundle.discover internally; R3ResourceBundle aggregates
R3Resource objects. Because the layers share a common set of types, they
interoperate freely: a bundle obtained from discover can be filtered at
Layer 2 and then handed to the same builders that create_rse invokes.
Layer 1: Building experiments with create_rse
create_rse() is the recommended entry point for the most
common workflow: one project, one organism, one annotation, one assembled
RangedSummarizedExperiment. Requires the
biocpy extra.
Gene-level RSE (default)
from recount3 import create_rse
rse = create_rse(
project="SRP009615",
organism="human",
annotation_label="gencode_v26", # or "gencode_v29", "fantom6_cat", "refseq", "ercc", "sirv"
)
You may pass the raw extension code instead of a label:
rse = create_rse(
project="SRP009615",
organism="human",
annotation_extension="G026",
)
When both are supplied, annotation_extension takes precedence. Discover the
available labels with annotation_options():
from recount3 import annotation_options
annotation_options("human")
# {'gencode_v26': 'G026', 'gencode_v29': 'G029', 'fantom6_cat': 'F006',
# 'refseq': 'R109', 'ercc': 'ERCC', 'sirv': 'SIRV'}
annotation_options("mouse")
# {'gencode_v23': 'M023'}
Exon-level and junction-level
exon_rse = create_rse(
project="SRP009615",
organism="human",
genomic_unit="exon",
annotation_label="gencode_v26",
)
junction_rse = create_rse(
project="SRP009615",
organism="human",
genomic_unit="junction",
)
For junctions, recount3 prefers the RR sidecar for genomic
coordinates; pass prefer_rr_junction_coordinates=False to disable
this.
Falling back to a plain SummarizedExperiment
If genomic ranges cannot be derived (for example, the GTF is missing or an
RR file is unavailable), create_rse raises ValueError by
default. Pass allow_fallback_to_se=True to receive a plain
SummarizedExperiment in that case:
experiment = create_rse(
project="SRP009615",
organism="human",
allow_fallback_to_se=True,
)
Operations performed by create_rse
For one (organism, data_source, project) triple, it:
discovers gene/exon/junction counts, the matching annotation GTF, and the five project metadata tables;
downloads everything into the on-disk cache;
stacks the count matrix into a feature × sample DataFrame;
merges, namespaces, and aligns the metadata tables to the count columns (including a
BigWigURLcolumn constructed per sample);parses the GTF (or RR file, for junctions) to attach genomic ranges;
constructs the BiocPy object.
To deviate from any of those steps (multiple projects, custom metadata filtering, stacking only some matrices, or a different join policy), use Layer 2.
Layer 2: Resource bundles
R3ResourceBundle is a container of
R3Resource objects with helpers for filtering,
loading, stacking, and converting to BiocPy objects.
Discovering resources for one or more projects
from recount3 import R3ResourceBundle
bundle = R3ResourceBundle.discover(
organism="human",
data_source="sra",
project="SRP009615",
)
print(f"Found {len(bundle.resources)} resources.")
# Found 10 resources.
By default, discover returns gene + exon counts, the default annotation
GTF for each of those units (gene and exon), the five metadata tables, and the
default junction artifact (MM). For SRP009615 this is the ten resources
counted above: 2 counts + 2 annotation GTFs + 5 metadata tables + 1 junction
file. Override with:
bundle = R3ResourceBundle.discover(
organism="human",
data_source="sra",
project="SRP009615",
genomic_units=("gene",),
annotations=("G026", "G029"), # or "default" / "all"
junction_exts=("MM", "RR"),
include_metadata=True,
include_bigwig=False,
)
Multi-project bundles
Pass an iterable for any of organism, data_source, or
project. discover computes the Cartesian product and produces a
single combined bundle:
multi = R3ResourceBundle.discover(
organism="human",
data_source="sra",
project=["SRP009615", "SRP001558"],
genomic_units=("gene",),
)
print(f"Combined: {len(multi.resources)} resources, 2 projects.")
# Combined: 15 resources, 2 projects.
The count is 15, not 14: each project contributes 7 project-specific resources
(1 gene count + 1 junction file + 5 metadata tables), and the gene annotation
GTF is shared across both projects, so it is deduplicated to a single resource
(7 × 2 + 1 = 15). Note that the junction artifact is included by default
regardless of genomic_units; pass junction_exts=() to omit it.
When a bundle spans more than one (organism, data_source, project)
triple, its organism/data_source/project attributes are left
as None to avoid misrepresenting its identity; per-resource fields
remain authoritative.
Supported values:
organism:"human","mouse"data_source:"sra","gtex","tcga"
Filtering bundles
Bundles are returned by-value from filter();
the original is not mutated. Each keyword maps to a field on the
underlying R3ResourceDescription, and
accepts any FieldSpec:
a single string: exact match
an iterable of strings: membership test
a callable
(value) -> bool: predicate
gene_counts = bundle.filter(
resource_type="count_files_gene_or_exon",
genomic_unit="gene",
)
gene_or_exon = bundle.filter(genomic_unit=["gene", "exon"])
gencode_only = bundle.filter(
annotation_extension=lambda ext: ext and ext.startswith("G"),
)
no_metadata = bundle.filter(resource_type="metadata_files", invert=True)
Convenience aliases provide shortcuts for the most common filters:
only_counts(),
only_metadata(),
bigwigs(),
exclude_metadata().
Note
Filtering on a field that a resource does not have (for example,
filtering on genomic_unit when metadata files have no genomic unit)
excludes those resources from the result. Combine filters explicitly
when this matters: bundle.filter(resource_type=..., genomic_unit=...).
Stacking count matrices
stack_count_matrices() concatenates count
DataFrames. It does not take a genomic_unit argument, so filter the
bundle first to choose which family you want:
gene_counts_df = (
bundle
.filter(resource_type="count_files_gene_or_exon", genomic_unit="gene")
.stack_count_matrices(compat="feature")
)
print(gene_counts_df.shape) # (n_features, n_samples)
# (63856, 12)
junction_counts_df = (
bundle
.filter(resource_type="count_files_junctions", junction_extension="MM")
.stack_count_matrices()
)
Compatibility checking is controlled by compat:
compat="family"(default): gene/exon may mix with gene/exon; junctions stay with junctions.compat="feature": stricter; the feature space must match exactly: the same genomic unit (gene versus exon) for gene/exon counts, or the same junction subtype for junctions. (The annotation build is not constrained.)
Mixing incompatible resources raises
CompatibilityError.
Building SummarizedExperiment / RangedSummarizedExperiment from a bundle
The bundle methods below are what create_rse calls internally:
se = bundle.to_summarized_experiment(genomic_unit="gene")
rse = bundle.to_ranged_summarized_experiment(
genomic_unit="gene",
annotation_extension="G026",
allow_fallback_to_se=False,
)
The same functions are available as standalone wrappers in
recount3.se (build_summarized_experiment(),
build_ranged_summarized_experiment()) for symmetry with
create_rse.
Downloading a bundle’s files in parallel
download() materializes every resource in a
bundle to a local destination. Because retrieval is I/O-bound, resources are
fetched concurrently by a pool of worker threads sized by max_workers
(default 8). This is the same mechanism used by the recount3 download
command-line tool:
bundle.download(dest="./downloads", max_workers=8)
dest may be a directory (each resource written as a separate file) or a
path ending in .zip (resources written into a single archive). The
cache keyword (named cache_mode on
download()) accepts the same values: "enable",
"update", "disable".
Layer 3: Individual resources
R3Resource is the lowest level: one file, one URL, one
cache entry, one parser. Use it when you want to download or load a
specific file without going through the bundle machinery.
A resource is built from a description. Descriptions are typed
dataclasses with field validation; the recommended constructor is the
R3ResourceDescription factory, which routes to the
appropriate subclass based on resource_type:
from recount3 import R3Resource, R3ResourceDescription
desc = R3ResourceDescription(
resource_type="count_files_gene_or_exon",
organism="human",
data_source="sra",
genomic_unit="gene",
project="SRP009615",
annotation_extension="G026", # required for gene/exon counts
)
res = R3Resource(desc)
print(res.url) # fully-qualified URL on the recount3 mirror
# http://duffel.rail.bio/recount3/human/data_sources/sra/gene_sums/15/SRP009615/sra.gene_sums.SRP009615.G026.gz
res.download(path=None, cache_mode="enable") # cache only, no local copy
df = res.load() # pandas.DataFrame
print(df.shape)
The full description catalog:
Resource type |
Description class |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Downloading
download() has three forms, controlled by
path:
res.download(path=None) # cache only
res.download(path="/data/recount3") # copy into a directory
res.download(path="/data/recount3.zip") # append to a ZIP archive
cache_mode controls cache interaction:
"enable"(default): use cached copy if present; download if not."update": force a fresh download, then overwrite the cache."disable": bypass the cache entirely (only valid whenpathis a directory or.zip).
Loading
load() parses the cached file. The return type
depends on the resource:
Resource type |
|
|---|---|
Gene/exon counts |
|
Junction MM (with ID sidecar) |
|
Junction ID / RR |
|
Metadata tables / source listings |
|
BigWig |
The parsed object is cached on the resource; subsequent load() calls
return the same instance until you call
clear_loaded() (or pass force=True).
Searching without a bundle
If you want a flat list of resources rather than a bundle, the
recount3.search helpers return list[R3Resource] directly. Each
takes StringOrIterable for every parameter and
returns one resource per Cartesian-product combination:
from recount3 import (
search_count_files_gene_or_exon,
search_metadata_files,
search_bigwig_files,
)
counts = search_count_files_gene_or_exon(
organism="human",
data_source="sra",
genomic_unit="gene",
project="SRP009615",
annotation_extension="G026",
)
meta = search_metadata_files(
organism="human",
data_source="sra",
project="SRP009615",
table_name=("recount_project", "recount_qc", "recount_seq_qc",
"recount_pred", "sra"),
)
bigwigs = search_bigwig_files(
organism="human",
data_source="sra",
project="SRP009615",
sample=["SRR387777", "SRR387778"],
)
The single-call equivalent is search_project_all()
(used internally by R3ResourceBundle.discover).
Working with sample metadata
When create_rse or
to_ranged_summarized_experiment()
assembles an RSE, it merges all available per-project metadata tables
into column_data, namespacing non-key columns by their table of
origin (e.g. recount_qc__star__all_mapped_reads).
Access it as a pandas DataFrame:
col_df = rse.get_column_data().to_pandas()
col_df.columns[:10]
Expanding SRA sample attributes
In an assembled RSE, SRA samples carry an sra__sample_attributes column
(the sra metadata table namespaced with __ as described above) that
encodes key–value pairs in the form "age;;67.78|disease;;Control|...".
recount3.se.expand_sra_attributes() parses these into separate columns.
It accepts either a DataFrame or an SE/RSE object, and recognizes both the
namespaced sra__sample_attributes name and the R-style
sra.sample_attributes spelling. Each parsed attribute becomes a new column
named sra_attribute.<key> (for example, sra_attribute.disease):
from recount3.se import expand_sra_attributes
rse2 = expand_sra_attributes(rse)
col_df = rse2.get_column_data().to_pandas()
sra_cols = [c for c in col_df.columns if c.startswith("sra_attribute.")]
Normalization and scaling
recount3 distributes coverage-sum counts (”raw_counts” assay), not
read counts. recount3.se provides recount3-compatible helpers to
convert and normalize them. All require a
RangedSummarizedExperiment.
Approximate read counts
from recount3.se import compute_read_counts
reads = compute_read_counts(rse) # pandas DataFrame, integer-rounded
Values are rounded to whole reads by default; pass round_to_integers=False
to retain the fractional estimates.
Per-sample scale factors
Two methods are supported, matching the R recount3 reference:
from recount3.se import compute_scale_factors, transform_counts
sf_auc = compute_scale_factors(rse, by="auc")
sf_mapreads = compute_scale_factors(rse, by="mapped_reads")
Apply scale factors to the assay:
scaled = transform_counts(rse, by="auc") # default
scaled = transform_counts(rse, by="mapped_reads", target_read_count=4e7)
TPM (gene/exon only, needs feature widths)
from recount3.se import compute_tpm
tpm = compute_tpm(rse) # requires rowRanges with widths
recount3.se.is_paired_end() and the other helpers documented in
recount3.se accept either a DataFrame of metadata or an SE/RSE
object. See the API reference for full signatures.
BigWig coverage
Per-sample BigWig coverage files are not included by default; pass
include_bigwig=True (or use the search_bigwig_files helper)
to add them. Requires the bigwig extra.
bundle = R3ResourceBundle.discover(
organism="human",
data_source="sra",
project="SRP009615",
include_bigwig=True,
)
for res, bw in bundle.iter_bigwig():
with bw:
print(res.description.sample, bw.chroms("chr1"))
mean_chr1 = bw.stats("chr1", 0, 1_000_000, type="mean")[0]
BigWigFile is a thin wrapper around
pyBigWig. Its main methods are chroms(), header(), values(),
stats(), intervals(), and close():
bw_res = bundle.bigwigs().resources[0]
bw = bw_res.load() # BigWigFile
with bw: # closes the handle on exit
values = bw.values("chr1", 0, 1000, numpy=True)
Cache and configuration
Downloaded files are stored under ~/.cache/recount3/files by default.
The recount3.config helpers let you inspect and prune the cache:
from recount3 import (
recount3_cache,
recount3_cache_files,
recount3_cache_rm,
)
print(recount3_cache()) # cache directory Path
files = recount3_cache_files(pattern="*.gtf.gz")
# Dry-run a deletion first:
to_remove = recount3_cache_rm(
predicate=lambda p: ".junctions." in p.name,
dry_run=True,
)
recount3_cache_rm(predicate=lambda p: ".junctions." in p.name)
Configuration precedence is, from lowest to highest: library defaults,
environment variables, an explicit Config passed to a
resource or search function. The supported environment variables are:
Variable |
Effect |
|---|---|
|
Base URL of the recount3 mirror |
|
On-disk cache directory |
|
|
|
Network timeout (seconds) |
|
Transient-error retry attempts |
|
|
|
Custom |
|
Streaming chunk size (bytes) |
Note
recount3 publishes the same file layout on several interchangeable public
mirrors, so RECOUNT3_URL may point at any of them: the Duffel load
balancer (http://duffel.rail.bio/recount3/, the default), AWS Open Data
(https://recount-opendata.s3.amazonaws.com/recount3/release/), or JHU
IDIES (https://data.idies.jhu.edu/recount3/data/). RECOUNT3_INSECURE_SSL
affects only https mirrors; it is a no-op for the default http mirror.
For programmatic use, construct a Config and pass it
explicitly:
from pathlib import Path
from recount3 import Config, R3Resource, R3GeneOrExonCounts
cfg = Config(
base_url="http://duffel.rail.bio/recount3/",
timeout=60,
insecure_ssl=False,
max_retries=5,
user_agent="my-pipeline/0.1",
cache_dir=Path("/scratch/recount3_cache"),
cache_disabled=False,
chunk_size=1024 * 1024,
)
res = R3Resource(
R3GeneOrExonCounts(
organism="human", data_source="sra", genomic_unit="gene",
project="SRP009615", annotation_extension="G026",
),
config=cfg,
)
Errors and troubleshooting
All recount3 exceptions derive from Recount3Error,
so a single except clause catches every package-specific failure:
Exception |
Raised when |
|---|---|
|
Bad config (env var, cache dir, option combinations) |
|
Network/I-O failure during download |
|
Cached file parsed empty, malformed, or shape-mismatched |
|
Incompatible resources combined in a stack/build |
Common pitfalls
ImportError: summarizedexperiment is requiredInstall the BiocPy extra:
pip install "recount3[biocpy]".KeyError: Missing required field: annotation_extensionGene and exon descriptions need an annotation code. Pass it explicitly (
annotation_extension="G026") or usecreate_rse, which resolves a default for you.TypeError: stack_count_matrices() got an unexpected keyword 'genomic_unit'Filter the bundle before calling stack:
bundle.filter(genomic_unit="gene").stack_count_matrices().ValueError: Could not derive genomic ranges …The required GTF (gene/exon) or RR file (junction) was not in the bundle, or the annotation code does not match the count files. Either include the right annotation, or pass
allow_fallback_to_se=True.CompatibilityError: Incompatible count families …You tried to stack gene/exon counts together with junctions. Filter to one family first, or stack each family separately.
Where to go next
API Reference: full per-symbol reference for all public modules.
CLI Reference: the
recount3command-line tool, which mirrors this API as a discover -> manifest -> materialize workflow.The recount3 raw-files documentation describes the underlying file layout (URLs, sharding, annotation codes). Note that this upstream page (not this tutorial) contains several inaccuracies.