Tutorial
This tutorial walks through common workflows using the recount3 Python
package. All examples assume the package is installed:
pip install "recount3[biocpy]"
For BigWig support add the bigwig extra as well:
pip install "recount3[biocpy,bigwig]"
Discovering Resources for a Project
R3ResourceBundle is the main entry point.
Call discover() to fetch every
resource associated with a project:
from recount3 import R3ResourceBundle
bundle = R3ResourceBundle.discover(
organism="human",
data_source="sra",
project="SRP009615",
)
print(f"Found {len(bundle.resources)} resources.")
Supported organisms are "human" and "mouse". Supported data sources
are "sra", "gtex", and "tcga".
Stacking Count Matrices
Use stack_count_matrices() to
concatenate all gene-level (or exon-level, junction-level) count files
from a bundle into a single pandas.DataFrame:
counts = bundle.only_counts().stack_count_matrices(genomic_unit="gene")
print(counts.shape) # (n_genes, n_samples)
Building a SummarizedExperiment
Requires the biocpy extra (biocframe, summarizedexperiment).
se = bundle.to_summarized_experiment(genomic_unit="gene")
print(se)
Building a RangedSummarizedExperiment
A RangedSummarizedExperiment attaches
genomic coordinate ranges to each feature row. A GTF annotation file is
required; include it in the bundle via search_project_all (done
automatically by discover()) or
add an annotation resource manually.
rse = bundle.to_ranged_summarized_experiment(genomic_unit="gene")
print(rse)
print(rse.shape) # (n_genes, n_samples)
print(rse.colnames[:5]) # first five sample IDs
Accessing Metadata
Column metadata (sample annotations) is merged from every available metadata table and aligned to the count columns automatically:
col_data = rse.coldata.to_pandas()
print(col_data.columns.tolist())
Filtering Bundles
Bundles are immutable snapshots. Use filter()
to narrow down by any description field:
gene_only = bundle.filter(genomic_unit="gene")
annots = bundle.filter(resource_type="annotations")
metadata = bundle.only_metadata()
Working with Individual Resources
Each R3Resource knows its URL and can be
downloaded and loaded independently:
from recount3 import R3GeneOrExonCounts, R3Resource
desc = R3GeneOrExonCounts(
organism="human",
data_source="sra",
genomic_unit="gene",
project="SRP009615",
sample="SRR387777",
)
res = R3Resource(desc)
res.download(path=None, cache_mode="enable") # saves to local cache
df = res.load() # returns pd.DataFrame
print(df.head())
Searching Without a Bundle
The recount3.search helpers return lists of
R3Resource objects directly:
from recount3 import search_count_files_gene_or_exon
resources = search_count_files_gene_or_exon(
organism="human",
data_source="sra",
genomic_unit="gene",
project="SRP009615",
)
print(f"{len(resources)} count files found.")
Cache Management
Downloaded files are cached under ~/.cache/recount3/files by default.
Use the recount3.config helpers to inspect or clear the cache:
from recount3 import recount3_cache, recount3_cache_files, recount3_cache_rm
print(recount3_cache()) # Path to the cache directory
print(recount3_cache_files()) # List all cached files
# Remove only junction files (dry run first):
removed = recount3_cache_rm(
predicate=lambda p: ".junctions." in p.name,
dry_run=True,
)
print(f"Would remove {len(removed)} files.")
The cache directory and other settings can be overridden via environment
variables (RECOUNT3_CACHE_DIR, RECOUNT3_URL, etc.) or by
constructing a custom Config and passing it to
any resource or search function.
SRA Attribute Expansion
recount3 stores SRA sample attributes as a single pipe-delimited string
column. recount3.se.build_ranged_summarized_experiment() expands
them automatically, but you can also call the helper directly:
from recount3.se import expand_sra_attributes
rse_expanded = expand_sra_attributes(rse)
col_data = rse_expanded.coldata.to_pandas()
# Columns like "sra_attribute.age", "sra_attribute.disease" are now present.