API Reference

Typed Python API for the recount3 RNA-seq data repository.

recount3 is a large-scale uniformly processed RNA-seq resource covering tens of thousands of human and mouse samples across multiple data sources (SRA, GTEx, TCGA). This package provides a small, typed Python API for discovering, downloading, and loading recount3 data files.

The public surface is intentionally flat: core classes and helper functions are re-exported here for discovery and convenience.

Typical usage example: high-level (recommended for most cases):

from recount3 import create_rse

rse = create_rse(
    project="SRP009615",
    organism="human",
    annotation_label="gencode_v26",
)

Typical usage example: lower-level (multi-project or custom workflows):

from recount3 import R3ResourceBundle

bundle = R3ResourceBundle.discover(
    organism="human",
    data_source="sra",
    project=["SRP009615", "SRP001558"],
)
counts = bundle.filter(
    resource_type="count_files_gene_or_exon",
    genomic_unit="gene",
).stack_count_matrices()

Note

Several features require optional dependencies:

BiocPy (biocframe, summarizedexperiment, genomicranges): required for create_rse(), to_ranged_summarized_experiment(), and the compute utilities in recount3.se. Install with pip install "recount3[biocpy]".
pyBigWig: required for BigWig file access via BigWigFile. Install with pip install "recount3[bigwig]".

recount3.config

Configuration helpers for recount3.

This module centralizes configuration so that environment-dependent values are not hidden as mutable module globals. Values are read once via default_config() and can be overridden by constructing Config directly. CLI flags in recount3.cli take the highest precedence and override both environment variables and Config defaults.

The module also exposes three cache utility functions: recount3_cache(), recount3_cache_files(), and recount3_cache_rm().

Environment variables (all optional):

RECOUNT3_URL: base URL of the recount3 mirror (default: http://duffel.rail.bio/recount3/).
RECOUNT3_CACHE_DIR: directory for the on-disk file cache (default: ~/.cache/recount3/files).
RECOUNT3_CACHE_DISABLE: set to "1" to disable caching entirely.
RECOUNT3_HTTP_TIMEOUT: HTTP request timeout in seconds (default: 60).
RECOUNT3_MAX_RETRIES: maximum retry attempts for transient errors (default: 3).
RECOUNT3_INSECURE_SSL: set to "1" to skip TLS certificate verification. This only affects https:// base URLs; it is a no-op for the default http:// Duffel mirror (see “Mirrors” below).
RECOUNT3_USER_AGENT: custom User-Agent header string.
RECOUNT3_CHUNK_SIZE: streaming chunk size in bytes (default: 1048576, i.e. 1 MiB).

Mirrors:

recount3 publishes the same relative file layout on several interchangeable public mirrors; RECOUNT3_URL / base_url works with any of them unchanged:

Duffel load balancer (default): http://duffel.rail.bio/recount3/
AWS Open Data: https://recount-opendata.s3.amazonaws.com/recount3/release/
JHU IDIES (Dataverse): https://data.idies.jhu.edu/recount3/data/

The package is coupled to recount3’s layout convention rather than to any one host, so switching mirrors is purely a base_url change. TLS settings apply only to https:// endpoints: the default Duffel mirror is plain http (no TLS, so insecure_ssl is irrelevant), the AWS and JHU mirrors are https with valid certificates (no flag needed), and RECOUNT3_INSECURE_SSL / --insecure-ssl is meaningful only for an https endpoint presenting an untrusted or self-signed certificate.

Typical usage example:

import dataclasses
from pathlib import Path
from recount3.config import default_config, recount3_cache, recount3_cache_rm

# Read the current cache directory (creates it if absent):
cache_dir = recount3_cache()

# Use a custom cache location for this session (override one field of
# the environment-derived defaults; Config is immutable):
cfg = dataclasses.replace(
    default_config(), cache_dir=Path("/scratch/recount3_cache")
)
custom_cache_dir = recount3_cache(cfg)

# Remove cached files matching a pattern (dry run first):
to_delete = recount3_cache_rm(dry_run=True)
recount3_cache_rm(predicate=lambda p: "sra" in str(p))

class recount3.config.Config(base_url: str, timeout: int, insecure_ssl: bool, max_retries: int, user_agent: str, cache_dir: Path, cache_disabled: bool, chunk_size: int = 1048576)[source]

Immutable configuration bag.

Parameters:

base_url (str)
timeout (int)
insecure_ssl (bool)
max_retries (int)
user_agent (str)
cache_dir (Path)
cache_disabled (bool)
chunk_size (int)

base_url

Base URL for the recount3 mirror (ends with a trailing slash). Defaults to the Duffel load balancer; the AWS Open Data and JHU IDIES mirrors serve the same layout (see the module docstring “Mirrors”).

Type:: str

timeout

Network timeout in seconds.

Type:: int

insecure_ssl

True to disable TLS certificate verification (not recommended). Only affects https:// base URLs; a no-op for the default http:// Duffel mirror.

Type:: bool

max_retries

Max HTTP retry attempts for transient errors.

Type:: int

user_agent

Custom HTTP User-Agent.

Type:: str

cache_dir

Cache directory for downloaded files.

Type:: pathlib.Path

cache_disabled

If True, disable cache behavior globally.

Type:: bool

chunk_size

Default chunk size in bytes for streaming copies.

Type:: int

recount3.config.default_config() → Config[source]

Return configuration constructed from environment variables.

Returns:: A Config populated from the environment.
Return type:: Config

Notes

Values are parsed to sensible types and the base URL is normalized to include a trailing slash (matching the original behavior).

recount3.config.recount3_cache(config: Config | None = None) → Path[source]

Return the cache directory used for recount3 downloads.

This helper normalizes and materializes the cache directory based on the provided configuration (or the default configuration when omitted).

Parameters:: config (Config | None) – Optional configuration. If None, default_config() is used.
Returns:: Absolute pathlib.Path to the cache directory.
Return type:: Path

recount3.config.recount3_cache_files(config: Config | None = None, *, pattern: str | None = None) → list[Path][source]

List cached files managed by recount3.

Parameters:

config (Config | None) – Optional configuration. If None, default_config() is used.
pattern (str | None) – Optional glob-style pattern (as accepted by pathlib.Path.rglob()) to filter files relative to the cache root, for example "*.tsv.gz" or "*__SRP123456*". If None, all files are returned.

Returns:

A sorted list of pathlib.Path objects pointing to cached files. If the cache directory does not exist yet, an empty list is returned.

Return type:

list[Path]

recount3.config.recount3_cache_rm(*, config: Config | None = None, predicate: Callable[[Path], bool] | None = None, dry_run: bool = False) → list[Path][source]

Remove cached files that match a predicate.

This helper is analogous to the R-side recount3_cache_rm(): it walks the cache directory and removes any file for which predicate returns True. Directories are left in place.

Parameters:

config (Config | None) – Optional configuration. If None, default_config() is used.
predicate (Callable[[Path], bool] | None) – Callable taking a pathlib.Path and returning True if the file should be removed. If None, all cached files are selected.
dry_run (bool) – If True, do not delete any files and only report which paths would be removed.

Returns:

A sorted list of pathlib.Path objects that were removed (or would be removed when dry_run is True).

Raises:

OSError – If filesystem operations fail during deletion.

Return type:

list[Path]

Examples

List everything that would be removed, without deleting:

to_delete = recount3_cache_rm(dry_run=True)

Remove all cached files (empty the cache):

recount3_cache_rm()

Remove only files related to the "sra" data source:

recount3_cache_rm(predicate=lambda p: "sra" in str(p))

recount3.resource

Resource orchestration: URL resolution, caching, downloading, and loading.

This module defines R3Resource, the central class of the recount3 package. Every downloadable file (count matrices, metadata tables, annotation GTFs, BigWig coverage files, junction counts) is represented as an R3Resource. The class manages the full lifecycle of a resource:

Description -> URL: a R3ResourceDescription provides the structured parameters (organism, project, genomic unit, …) that are used to construct the deterministic recount3 mirror URL.
URL -> cache: download() fetches the file over HTTP and stores it in a persistent on-disk cache keyed by URL hash.
Cache -> materialization: the cached file can be hard-linked or copied to a user-supplied directory, or appended to a ZIP archive.
Cache -> in-memory object: load() parses the cached file into an appropriate Python object (see Notes below).

Typical usage example:

from recount3 import R3Resource, R3GeneOrExonCounts

desc = R3GeneOrExonCounts(
    organism="human",
    data_source="sra",
    genomic_unit="gene",
    project="SRP009615",
    annotation_extension="G026",
)
res = R3Resource(desc)

# Cache the file (no local copy):
res.download(path=None, cache_mode="enable")

# Or copy into a directory:
dest = res.download(path="/data/recount3")

# Parse into a DataFrame (for count resources):
df = res.load()

Note

Downloads are protected by a module-level threading.Lock, so multiple threads can safely call download() on resources sharing a common cache path without corrupting the cache.

Note

load() returns different types depending on the resource type:

Gene/exon count resources -> pandas.DataFrame (features x samples).
Junction MM resources -> a sparse-backed pandas.DataFrame (built via pandas.DataFrame.sparse.from_spmatrix()).
Junction ID/RR resources -> pandas.DataFrame.
BigWig resources -> BigWigFile.

recount3.resource.build_url(resource_type: str, *, config: Config | None = None, **fields: Any) → str[source]

Build the absolute URL for a resource from its description fields.

Constructs the appropriate R3ResourceDescription subclass via the type registry and joins its repository-relative path onto the configured base URL. The per-resource-type path logic lives in each description’s url_path(); this function only supplies the base URL and the join, mirroring what R3Resource.__post_init__() does. Useful when only a URL is needed and a full R3Resource would be overkill (e.g. populating a BigWigURL metadata column).

Parameters:

resource_type (str) – Registered resource-type key (e.g. "bigwig_files").
config (Config | None) – Optional Config; the global default is used when omitted.
**fields (Any) – Description fields for the resource type (organism, data_source, project, sample, …).

Returns:

The full, absolute network URL for the resource.

Raises:

KeyError, ValueError – If resource_type is unknown or the supplied fields are invalid for it (propagated from the description factory).

Return type:

str

class recount3.resource.R3Resource(description: R3ResourceDescription, url: str | None = None, filepath: str | None = None, config: Config | None = None)[source]

Manage one recount3 file: URL, cache, materialization, loading.

Parameters:

description (R3ResourceDescription)
url (str | None)
filepath (str | None)
config (Config | None)

description

An instance of R3ResourceDescription that specifies the metadata and the hierarchical path used to correctly locate and define the resource.

Type:: recount3._descriptions.R3ResourceDescription

url

The full, absolute network URL pointing to the remote resource. If not explicitly provided during initialization, it is derived by joining the configured base URL with the description’s relative URL.

Type:: str | None

filepath

An optional string representing the absolute local path where the resource was successfully materialized (either copied or linked).

Type:: str | None

config

An optional Config instance dictating strict network and cache behaviors. If omitted, the global default configuration is dynamically applied.

Type:: recount3.config.Config | None

classmethod from_mapping(mapping: Mapping[str, Any], *, config: Config | None = None) → R3Resource[source]

Rehydrate an R3Resource from a serialized mapping.

Builds the resource’s R3ResourceDescription from the mapping’s description fields and returns a configured resource. Derived convenience keys this class emits when serialized (url, arcname) are ignored; the canonical URL is recomputed from the description and configuration.

Parameters:

mapping (Mapping[str, Any]) – JSON-like mapping for a single resource (e.g. a JSONL line), containing resource_type and the fields for that type.
config (Config | None) – Optional Config for URL/cache behavior; the global default is used when omitted.

Returns:

A configured R3Resource.

Raises:

KeyError – If resource_type is missing.
ValueError – If the fields are invalid for that resource type.

Return type:

R3Resource

property arcname: str: Deterministic ZIP arcname derived from the URL path.

download(path: str | None = None, *, cache_mode: Literal['enable', 'disable', 'update'] = 'enable', overwrite: bool = False, chunk_size: int | None = None) → str | None[source]

Ensure resource availability and optionally materialize it.

Transitions the remote resource to the local system. Caches the file, writes it to a specific directory, or appends it to a ZIP archive depending on the arguments provided.

Parameters:

path (str | None) – Target destination. If None, performs a cache-only download. If a directory path, links or copies the file there. If a ‘.zip’ path, injects the file into the archive using arcname.
cache_mode (Literal['enable', 'disable', 'update']) – Caching behavior. ‘enable’ uses existing cache, ‘disable’ streams directly to path without caching, ‘update’ forces a cache refresh before materialization.
overwrite (bool) – If True, replaces existing files at the destination.
chunk_size (int | None) – Byte size for streaming operations. Defaults to the configured chunk size.

Returns:

The final file path if materialized to a directory. None if performing a cache-only download or appending to a ZIP archive.

Raises:

ValueError – Combinations are invalid (e.g., path=None with cache_mode=’disable’) or path has an unsupported format.

Return type:

str | None

Examples

Cache the file without copying it anywhere:

res.download(path=None, cache_mode="enable")

Copy the cached file into a local directory:

dest = res.download(path="/data/recount3")

Append the file to a ZIP archive:

res.download(path="/data/recount3.zip")

Force a cache refresh before copying:

dest = res.download(path="/data/recount3", cache_mode="update")

load(*, force: bool = False) → object[source]

Parse the resource into an appropriate in-memory Python object.

Downloads and caches the resource if missing. Uses the resource description to determine the parsing strategy (e.g., BigWig, tabular counts, junctions). Caches the resulting object internally to prevent redundant disk I/O.

Parameters:

force (bool) – If True, bypasses the in-memory object cache and re-parses data directly from disk.

Returns:

The parsed object. Tabular and junction counts return a pandas.DataFrame. BigWig files return a recount3._bigwig.BigWigFile instance.

Raises:

FileNotFoundError – The file is missing from the cache post-download.
LoadError – Parsing fails, matrix shapes mismatch, or the resource type is currently unsupported.

Return type:

object

Examples

Load a gene/exon count matrix as a DataFrame:

counts_df = gene_count_res.load()  # -> pd.DataFrame

Load a BigWig coverage file (close it when done):

bw = bigwig_res.load()  # -> BigWigFile
vals = bw.values("chr1", 0, 1_000_000)
bw.close()

Re-parse from disk, bypassing the in-memory cache:

counts_df = res.load(force=True)

is_loaded() → bool[source]

Check if the resource currently holds a parsed in-memory object.

Returns:: True if an object is cached in memory, False otherwise.
Return type:: bool

get_loaded() → object | None[source]

Retrieve the parsed in-memory object without triggering disk I/O.

Returns:: The loaded object if present, otherwise None.
Return type:: object | None

clear_loaded() → None[source]

Evict the in-memory cache and close file handles if open.

Does not delete or modify the on-disk file cache.

Return type:: None

recount3.bundle

Resource bundles, project discovery, and concatenation helpers.

This module defines R3ResourceBundle, a general-purpose container for groups of R3Resource objects.

Bundles support lazy loading, filtering by description fields, project-aware discovery, and high-level helpers for combining recount3 resources into BiocPy objects such as SummarizedExperiment and RangedSummarizedExperiment.

When discovery covers exactly one (organism, data_source, project) triple, the bundle’s organism, data_source, and project attributes are set accordingly. For multi-project bundles these attributes are None to avoid misrepresenting the identity.

Filtering with FieldSpec

filter() accepts a FieldSpec for each description field. The following forms are accepted:

A string: exact match (e.g. genomic_unit="gene").
An iterable of strings: keep if the field is any of the given values.
A callable: called with the field value; truthy return keeps the resource.
None (default): no filtering on that field.

Typical usage example:

from recount3 import R3ResourceBundle

bundle = R3ResourceBundle.discover(
    organism="human",
    data_source="sra",
    project="SRP009615",
)

# Filter to gene-level count resources and stack into a DataFrame:
counts = bundle.filter(
    resource_type="count_files_gene_or_exon",
    genomic_unit="gene",
).stack_count_matrices()

# Filter with a callable predicate:
meta_only = bundle.filter(
    resource_type=lambda t: t == "metadata_files"
)

Note

The to_summarized_experiment() and to_ranged_summarized_experiment() methods require the BiocPy package summarizedexperiment, which might be difficult to install on Windows. Install with:

pip install "recount3[biocpy]"

class recount3.bundle.R3ResourceBundle(resources: list[R3Resource] = <factory>, organism: str | None = None, data_source: str | None = None, project: str | None = None)[source]

Container for a set of recount3.resource.R3Resource objects.

Bundles act as the primary orchestration primitive in this package. They keep track of a collection of resources and provide helpers for loading, filtering, project-aware workflows, and high-level operations such as stacking matrices or building BiocPy objects.

A bundle may optionally be associated with a single project identity via the organism, data_source, and project attributes. If multiple projects are combined into one bundle, these attributes are left as None.

Parameters:

resources (list[R3Resource])
organism (str | None)
data_source (str | None)
project (str | None)

resources

The list of resources contained in the bundle.

Type:: list[recount3.resource.R3Resource]

organism

Optional organism identifier (for example, "human" or "mouse") when the bundle is project-scoped.

Type:: str | None

data_source

Optional data source name (for example, "sra", "gtex", or "tcga") when the bundle is project-scoped.

Type:: str | None

project

Optional study or project identifier (for example, "SRP009615") when the bundle is project-scoped.

Type:: str | None

classmethod discover(*, organism: str | Iterable[str], data_source: str | Iterable[str], project: str | Iterable[str], genomic_units: tuple[str, ...] = ('gene', 'exon'), annotations: str | Iterable[str] = 'default', junction_exts: tuple[str, ...] = ('MM',), junction_type: str = 'ALL', include_metadata: bool = True, include_bigwig: bool = False, strict: bool = True, deduplicate: bool = True) → R3ResourceBundle[source]

Discover resources for one or more projects and return a bundle.

It can operate on a single (organism, data_source, project) triple or on the Cartesian product of multiple values for each identifier.

When discovery spans more than one project, the returned bundle will contain resources from all projects and the bundle-level organism, data_source, and project attributes will be left as None to avoid misrepresenting the identity.

Parameters:

organism (str | Iterable[str]) – Single organism name or iterable of names.
data_source (str | Iterable[str]) – Single data source or iterable of data sources.
project (str | Iterable[str]) – Single project identifier or iterable of identifiers.
genomic_units (tuple[str, ...]) – Gene expression feature levels to include; for example, ("gene", "exon").
annotations (str | Iterable[str]) – Annotation selection; either "default", "all", a single annotation code, or an iterable of codes or labels understood by recount3.search.annotation_ext().
junction_exts (tuple[str, ...]) – Junction file extensions to include; typically ("MM",) for junction counts, with "RR" and/or "ID" added for coordinates or IDs.
junction_type (str) – Junction type selector, such as "ALL".
include_metadata (bool) – Whether to include the 5 project metadata tables in the result.
include_bigwig (bool) – Whether to include per-sample BigWig coverage resources.
strict (bool) – If True, propagate errors for invalid inputs or missing projects. If False, attempts that fail validation are skipped.
deduplicate (bool) – If True, remove duplicated resources across discovered projects.

Returns:

A new R3ResourceBundle populated with discovered resources. When discovery covers exactly one (organism, data_source, project) triple, the resulting bundle’s organism, data_source, and project attributes are set accordingly.

Raises:

ValueError – If all of organism, data_source, or project evaluate to an empty collection after normalization.
recount3.errors.ConfigurationError – If the underlying search logic reports configuration problems.

Return type:

R3ResourceBundle

Examples

Discover all default resources for a single project:

bundle = R3ResourceBundle.discover(
    organism="human",
    data_source="sra",
    project="SRP009615",
)

Discover gene counts only across two projects:

bundle = R3ResourceBundle.discover(
    organism="human",
    data_source="sra",
    project=["SRP009615", "SRP001558"],
    genomic_units=("gene",),
)

Include BigWig coverage files alongside counts:

bundle = R3ResourceBundle.discover(
    organism="human",
    data_source="sra",
    project="SRP009615",
    include_bigwig=True,
)

add(res: R3Resource) → None[source]

Add a resource to the bundle.

Parameters:: res (R3Resource) – The resource to append to resources.
Return type:: None

extend(resources_iter: Iterable[R3Resource]) → None[source]

Extend the bundle with additional resources.

Parameters:: resources_iter (Iterable[R3Resource]) – Iterable of resources to add to the bundle.
Return type:: None

load(*, strict: bool = True, force: bool = False) → R3ResourceBundle[source]

Load all resources and cache their data on each instance.

This method iterates over resources and calls recount3.resource.R3Resource.load() on each one.

Parameters:

strict (bool) – If True, stop at the first exception and re-raise it. If False, skip resources that fail to load.
force (bool) – If True, force a reload even when data is already cached on the resource.

Returns:

This R3ResourceBundle instance, to enable chaining.

Return type:

R3ResourceBundle

iter_loaded(*, resource_type: str | None = None, autoload: bool = False) → Iterator[tuple[R3Resource, Any]][source]

Yield (resource, data) pairs for resources with loaded data.

When autoload is True, resources that have not yet been loaded are passed through R3Resource.load() before yielding.

Parameters:

resource_type (str | None) – Optional resource-type filter applied to res.description.resource_type.
autoload (bool) – If True, automatically load resources that have not yet been loaded.

Yields:

Tuples of (resource, loaded_data) for each resource that matches the optional resource_type filter and either already has cached data or can be loaded successfully.

Return type:

Iterator[tuple[R3Resource, Any]]

iter_bigwig(*, autoload: bool = True) → Iterator[tuple[R3Resource, BigWigFile]][source]

Yield (resource, bigwig) pairs for BigWig resources.

Parameters:: autoload (bool) – If True, automatically load BigWig resources that have not yet been loaded.
Yields:: Pairs of recount3.resource.R3Resource and recount3._bigwig.BigWigFile objects.
Return type:: Iterator[tuple[R3Resource, BigWigFile]]

get_loaded(*, resource_type: str | None = None, autoload: bool = False) → list[Any][source]

Return loaded data objects for resources in the bundle.

Parameters:

resource_type (str | None) – Optional resource-type filter applied to res.description.resource_type.
autoload (bool) – If True, automatically load any resources that are not yet loaded.

Returns:

A list of loaded data objects corresponding to resources in the bundle.

Return type:

list[Any]

Return a new bundle containing resources that match criteria.

Each keyword argument corresponds to an attribute on recount3._descriptions.R3ResourceDescription. Values are interpreted using recount3.search.match_spec(), allowing simple values, iterables of values, or callables.

Parameters:

resource_type (str | Iterable[str] | Callable[[Any], bool] | None) – Resource type filter.
organism (str | Iterable[str] | Callable[[Any], bool] | None) – Organism filter.
data_source (str | Iterable[str] | Callable[[Any], bool] | None) – Data source filter.
genomic_unit (str | Iterable[str] | Callable[[Any], bool] | None) – Genomic unit filter.
project (str | Iterable[str] | Callable[[Any], bool] | None) – Project identifier filter.
sample (str | Iterable[str] | Callable[[Any], bool] | None) – Sample identifier filter.
table_name (str | Iterable[str] | Callable[[Any], bool] | None) – Metadata table name filter.
junction_type (str | Iterable[str] | Callable[[Any], bool] | None) – Junction type filter.
annotation_extension (str | Iterable[str] | Callable[[Any], bool] | None) – Annotation code filter.
junction_extension (str | Iterable[str] | Callable[[Any], bool] | None) – Junction extension filter.
predicate (Callable[[R3Resource], bool] | None) – Optional callback that receives each resource and returns True if it should be kept.
invert (bool) – If True, invert the final match decision.

Returns:

A new R3ResourceBundle containing only the resources that match all supplied filters and the optional predicate.

Return type:

R3ResourceBundle

Examples

Keep only gene-level resources:

gene_bundle = bundle.filter(genomic_unit="gene")

Keep gene or exon resources (iterable form):

ge_bundle = bundle.filter(genomic_unit=["gene", "exon"])

Keep resources whose type contains “count” (callable form):

counts = bundle.filter(
    resource_type=lambda t: "count" in (t or "")
)

Invert a filter to exclude metadata tables:

no_meta = bundle.filter(
    resource_type="metadata_files", invert=True
)

only_counts() → R3ResourceBundle[source]

Return a bundle restricted to gene/exon or junction count files.

Returns:: A new R3ResourceBundle containing only resources whose resource_type is "count_files_gene_or_exon" or "count_files_junctions".
Return type:: R3ResourceBundle

only_metadata() → R3ResourceBundle[source]

Return a bundle restricted to metadata resources.

Returns:: A new R3ResourceBundle containing only resources whose resource_type is "metadata_files".
Return type:: R3ResourceBundle

exclude_metadata() → R3ResourceBundle[source]

Return a bundle with metadata resources removed.

Returns:: A new R3ResourceBundle that excludes resources whose resource_type is "metadata_files".
Return type:: R3ResourceBundle

where(predicate: Callable[[R3Resource], bool]) → R3ResourceBundle[source]

Predicate-based helper that forwards to filter().

Parameters:: predicate (Callable[[R3Resource], bool]) – Function that receives each resource and returns True if it should be retained in the result.
Returns:: A new R3ResourceBundle with only resources for which predicate returned True.
Return type:: R3ResourceBundle

counts() → R3ResourceBundle[source]

Return a sub-bundle containing only count-file resources.

This is a convenience alias for only_counts().

Return type:: R3ResourceBundle

metadata() → R3ResourceBundle[source]

Return a sub-bundle containing only metadata resources.

This is a convenience alias for only_metadata().

Return type:: R3ResourceBundle

bigwigs() → R3ResourceBundle[source]

Return a sub-bundle containing only BigWig resources.

Returns:: A new R3ResourceBundle containing only resources whose type is "bigwig_files".
Return type:: R3ResourceBundle

samples(*, organism: str | None = None, data_source: str | None = None, project: str | None = None) → list[str][source]

Return the list of sample identifiers associated with a project.

The default behavior uses the bundle’s stored project identity, as recorded by discover(). Explicit keyword arguments can be provided to override or define the identity when the bundle was not created by discover().

Parameters:

organism (str | None) – Optional organism identifier override.
data_source (str | None) – Optional data source override.
project (str | None) – Optional project identifier override.

Returns:

A sorted list of sample identifiers for the resolved project.

Raises:

ValueError – If the project cannot be resolved or validated.

Return type:

list[str]

stack_count_matrices(*, join_policy: str = 'inner', axis: int = 1, verify_integrity: bool = False, autoload: bool = True, compat: Literal['family', 'feature'] = 'family') → DataFrame[source]

Concatenate count matrices (gene/exon or junction) as DataFrames.

Parameters:

join_policy (str) – Join policy passed to pandas.concat().
axis (int) – Concatenation axis passed to pandas.concat().
verify_integrity (bool) – If True, raise when labels are not unique along the concatenation axis.
autoload (bool) – If True, automatically load resources prior to concatenation.
compat (Literal['family', 'feature']) – Compatibility mode. "family" enforces that all inputs come from the same high-level family (gene/exon or junction), while "feature" enforces an identical feature space (for example, same genomic unit and junction subtype).

Returns:

A pandas.DataFrame containing the concatenated count matrices.

Raises:

recount3.errors.CompatibilityError – If incompatible count resources are mixed in a way that violates compat.
TypeError – If a loaded object is not a pandas.DataFrame.
ValueError – If no applicable resources are present or if no loaded count matrices are found.

Return type:

DataFrame

Examples

Stack gene counts across all projects in the bundle:

df = bundle.filter(
    resource_type="count_files_gene_or_exon",
    genomic_unit="gene",
).stack_count_matrices()

Require an identical feature space; fails if gene and exon are mixed (the annotation build is not constrained):

df = bundle.filter(
    resource_type="count_files_gene_or_exon"
).stack_count_matrices(compat="feature")

to_summarized_experiment(*, genomic_unit: str, annotation_extension: str | None = None, assay_name: str = 'raw_counts', join_policy: str = 'inner', autoload: bool = True) → summarizedexperiment.SummarizedExperiment[source]

Build a BiocPy SummarizedExperiment from this bundle.

This method stacks compatible count matrices, merges available sample metadata, and constructs a BiocPy SummarizedExperiment using a compatibility-aware constructor that supports multiple versions of the summarizedexperiment package.

Parameters:

genomic_unit (str) – Genomic unit to summarize, such as "gene", "exon", or "junction".
annotation_extension (Optional[str]) – Optional annotation code for gene or exon summarizations (for example, "G026"). When provided and genomic_unit is gene or exon, only count resources with matching annotation are used.
assay_name (str) – Name assigned to the coverage-sum assay within the SummarizedExperiment. (default: "raw_counts").
join_policy (str) – Join policy used when concatenating counts across resources.
autoload (bool) – If True, load resources when needed.

Returns:

A BiocPy SummarizedExperiment instance.

Raises:

ImportError – If BiocPy packages are not installed.
ValueError – If no counts are found or shapes are inconsistent.
TypeError – If the underlying SummarizedExperiment constructor rejects all compatibility variants.

Return type:

summarizedexperiment.SummarizedExperiment

to_ranged_summarized_experiment(*, genomic_unit: str, annotation_extension: str | None = None, prefer_rr_junction_coordinates: bool = True, assay_name: str = 'raw_counts', join_policy: str = 'inner', autoload: bool = True, allow_fallback_to_se: bool = False) → summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment[source]

Build a BiocPy RangedSummarizedExperiment when possible.

For "gene" and "exon" genomic units, row ranges are derived from a matching GTF(.gz) annotation resource. For "junction", this method prefers an RR table (junction coordinates) when available.

When row ranges cannot be resolved and allow_fallback_to_se is True, a plain SummarizedExperiment is returned instead.

Parameters:

genomic_unit (str) – One of "gene", "exon", or "junction".
annotation_extension (Optional[str]) – Annotation code for gene/exon assays, if desired.
prefer_rr_junction_coordinates (bool) – If True, prefer RR junction files for coordinate definitions when they are available.
assay_name (str) – Name assigned to the coverage-sum assay within the SummarizedExperiment (default: "raw_counts").
join_policy (str) – Join policy across projects when stacking.
autoload (bool) – If True, load resources transparently.
allow_fallback_to_se (bool) – If True, construct a plain SummarizedExperiment when genomic ranges cannot be derived for the requested combination.

Returns:

A RangedSummarizedExperiment instance, or a plain SummarizedExperiment when allow_fallback_to_se is True and ranges are unavailable.

Raises:

ImportError – If BiocPy packages are not installed.
ValueError – If counts are missing or ranges cannot be determined and allow_fallback_to_se is False.
TypeError – If the RangedSummarizedExperiment constructor rejects all compatibility variants.

Return type:

summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment

download(*, dest: str = '.', overwrite: bool = False, cache: Literal['enable', 'disable', 'update'] = 'enable', max_workers: int = 8) → None[source]

Download all resources in the bundle to a local destination.

This method is a convenience wrapper around recount3.resource.R3Resource.download() for each contained resource. Downloading is I/O-bound, so by default the resources are fetched concurrently using a pool of worker threads (the same mechanism the recount3 download CLI command uses). For more advanced workflows (per-resource event logs, JSONL progress) prefer the command-line interface, recount3 download.

Concurrency is safe because per-resource downloads are coordinated by a shared threading.Lock over the on-disk cache and by per-path locks for .zip archives, and files are materialized atomically. See recount3.resource for details.

Parameters:

dest (str) – Destination directory or .zip path. When a directory is provided, each resource is materialized as a separate file under that directory. When a path ending in .zip is provided, resources are written into that archive.
overwrite (bool) – If True, allow overwriting existing files in directory mode.
cache (Literal['enable', 'disable', 'update']) – Cache behavior: "enable", "disable", or "update" as defined by recount3.types.CacheMode.
max_workers (int) – Maximum number of parallel download threads. Values <= 1 download sequentially. The effective worker count is also capped at the number of resources in the bundle.

Raises:

ValueError – Propagated from underlying resource download failures, for example when an unsupported cache mode is selected.

Return type:

None

recount3.search

Discovery and search helpers for the recount3 data repository.

This module provides two tiers of functions:

Tier 1: Type-specific search wrappers

These functions accept field values and return a list of R3Resource objects:

search_annotations(): annotation GTF files
search_count_files_gene_or_exon(): gene/exon count matrices
search_count_files_junctions(): junction count files (MM/ID/RR)
search_metadata_files(): per-project metadata tables
search_bigwig_files(): per-sample BigWig coverage files
search_data_sources(): organism-level data-source index
search_data_source_metadata(): data-source metadata listings
search_project_all(): convenience orchestrator that calls all of the above for a given project and returns a combined resource list

Tier 2: Discovery helpers

These functions download and parse metadata to return structured results. All are part of the public top-level surface (re-exported from recount3):

available_samples(): DataFrame of samples available across data sources
available_projects(): DataFrame of projects with sample counts
project_homes(): DataFrame mapping projects to their data-source home URLs
samples_for_project(): sample IDs for a specific project
create_sample_project_lists(): build per-data-source sample/project lists (backs the ids CLI command)
annotation_options(): mapping of annotation names to extension codes
annotation_ext(): resolve a name or code to a canonical extension string

StringOrIterable and the Cartesian product pattern

All Tier 1 functions accept StringOrIterable for each parameter: either a single string or an iterable of strings. When iterables are supplied, the function computes the Cartesian product across all parameters and returns one resource per unique combination. For example, passing project=["SRP009615", "SRP001558"] and genomic_unit=["gene", "exon"] produces four resources.

Annotation names and extension codes

Gene/exon count resources are keyed by an annotation extension code (e.g. "G026"). Human-readable annotation labels (e.g. "gencode_v26") are resolved to their codes via annotation_ext(). Use annotation_options() to list all available mappings for an organism.

Typical usage example:

from recount3 import search_count_files_gene_or_exon, search_project_all

# Single project, gene-level counts:
resources = search_count_files_gene_or_exon(
    organism="human",
    data_source="sra",
    genomic_unit="gene",
    project="SRP009615",
)

# All resource types for a project in one call:
all_res = search_project_all(
    organism="human",
    data_source="sra",
    project="SRP009615",
)

recount3.search.match_spec(value: object | None, spec: str | Iterable[str] | Callable[[Any], bool] | None) → bool[source]

Return True if value satisfies the selection spec.

Parameters:

value (object | None) – The candidate value to test.
spec (str | Iterable[str] | Callable[[Any], bool] | None) –
The selection specification. Accepted forms:
- None: matches everything.
- A callable: called with value; truthy return means match.
- A string or iterable of strings: matches if value is among the normalised tuple of strings.

Returns:

True if value matches spec, False otherwise.

Return type:

bool

recount3.search.search_annotations(*, organism: str | Iterable[str], genomic_unit: str | Iterable[str], annotation_extension: str | Iterable[str], strict: bool = True, deduplicate: bool = True) → list[R3Resource][source]

Return R3Resource objects for annotation GTF files.

Constructs the Cartesian product of all provided values and returns one R3Resource per unique combination.

Parameters:

organism (str | Iterable[str]) – One or more organism names (e.g. "human", "mouse").
genomic_unit (str | Iterable[str]) – One or more genomic units (e.g. "gene", "exon").
annotation_extension (str | Iterable[str]) – One or more annotation extension strings (e.g. "G026").
strict (bool) – If True (default), re-raise any exception encountered while constructing a resource. If False, silently skip invalid combinations.
deduplicate (bool) – If True (default), discard resources whose resolved URL duplicates an earlier entry in the result list.

Returns:

A list of R3Resource objects, one per unique (organism, genomic_unit, annotation_extension) combination.

Return type:

list[R3Resource]

recount3.search.search_count_files_gene_or_exon(*, organism: str | Iterable[str], data_source: str | Iterable[str], genomic_unit: str | Iterable[str], project: str | Iterable[str], annotation_extension: str | Iterable[str] = ('G026',), strict: bool = True, deduplicate: bool = True) → list[R3Resource][source]

Return gene/exon count R3Resource objects.

Each resource points to a per-project gene or exon count matrix. Constructs the Cartesian product of all provided values and returns one R3Resource per unique combination.

Parameters:

organism (str | Iterable[str]) – One or more organism names (e.g. "human", "mouse").
data_source (str | Iterable[str]) – One or more data-source identifiers (e.g. "sra", "gtex").
genomic_unit (str | Iterable[str]) – One or more genomic units ("gene" or "exon").
project (str | Iterable[str]) – One or more project identifiers (e.g. "SRP009615").
annotation_extension (str | Iterable[str]) – One or more annotation extension strings. Defaults to ("G026",).
strict (bool) – If True (default), re-raise any exception encountered while constructing a resource. If False, silently skip invalid combinations.
deduplicate (bool) – If True (default), discard resources whose resolved URL duplicates an earlier entry in the result list.

Returns:

A list of R3Resource objects, one per unique (organism, data_source, genomic_unit, project, annotation_extension) combination.

Return type:

list[R3Resource]

Examples

Single project, gene-level counts:

resources = search_count_files_gene_or_exon(
    organism="human",
    data_source="sra",
    genomic_unit="gene",
    project="SRP009615",
)

Multiple projects return one resource per project (Cartesian product):

resources = search_count_files_gene_or_exon(
    organism="human",
    data_source="sra",
    genomic_unit="gene",
    project=["SRP009615", "SRP001558"],
)

recount3.search.search_count_files_junctions(*, organism: str | Iterable[str], data_source: str | Iterable[str], project: str | Iterable[str], junction_type: str | Iterable[str] = 'ALL', junction_extension: str | Iterable[str] = 'MM', strict: bool = True, deduplicate: bool = True) → list[R3Resource][source]

Return junction count R3Resource objects.

Each resource points to a per-project junction count file. Constructs the Cartesian product of all provided values and returns one R3Resource per unique combination. Junction files are distributed as a triplet of sidecar files sharing a common stem: a MatrixMarket matrix (.MM.gz), a sample-ID table (.ID.gz), and a row-ranges table (.RR.gz). The junction_extension selects which of these three files is the primary resource.

Parameters:

organism (str | Iterable[str]) – One or more organism names (e.g. "human", "mouse").
data_source (str | Iterable[str]) – One or more data-source identifiers (e.g. "sra", "gtex").
project (str | Iterable[str]) – One or more project identifiers (e.g. "SRP009615").
junction_type (str | Iterable[str]) – One or more junction-type tokens. Defaults to "ALL".
junction_extension (str | Iterable[str]) – One or more file-extension tokens selecting the junction sidecar file to retrieve. Accepted values are "MM" (count matrix), "ID" (sample IDs), and "RR" (row ranges). Defaults to "MM".
strict (bool) – If True (default), re-raise any exception encountered while constructing a resource. If False, silently skip invalid combinations.
deduplicate (bool) – If True (default), discard resources whose resolved URL duplicates an earlier entry in the result list.

Returns:

A list of R3Resource objects, one per unique (organism, data_source, project, junction_type, junction_extension) combination.

Return type:

list[R3Resource]

recount3.search.search_metadata_files(*, organism: str | Iterable[str], data_source: str | Iterable[str], table_name: str | Iterable[str], project: str | Iterable[str], strict: bool = True, deduplicate: bool = True) → list[R3Resource][source]

Return project metadata R3Resource objects.

Each resource points to a per-project metadata table. Constructs the Cartesian product of all provided values and returns one R3Resource per unique combination.

Parameters:

organism (str | Iterable[str]) – One or more organism names (e.g. "human", "mouse").
data_source (str | Iterable[str]) – One or more data-source identifiers (e.g. "sra", "gtex").
table_name (str | Iterable[str]) – One or more metadata table name suffixes (e.g. "recount_project", "recount_qc").
project (str | Iterable[str]) – One or more project identifiers (e.g. "SRP009615").
strict (bool) – If True (default), re-raise any exception encountered while constructing a resource. If False, silently skip invalid combinations.
deduplicate (bool) – If True (default), discard resources whose resolved URL duplicates an earlier entry in the result list.

Returns:

A list of R3Resource objects, one per unique (organism, data_source, table_name, project) combination.

Return type:

list[R3Resource]

recount3.search.search_bigwig_files(*, organism: str | Iterable[str], data_source: str | Iterable[str], project: str | Iterable[str], sample: str | Iterable[str], strict: bool = True, deduplicate: bool = True) → list[R3Resource][source]

Return BigWig coverage R3Resource objects.

Each resource points to a per-sample BigWig coverage file. Constructs the Cartesian product of all provided values and returns one R3Resource per unique combination.

Parameters:

organism (str | Iterable[str]) – One or more organism names (e.g. "human", "mouse").
data_source (str | Iterable[str]) – One or more data-source identifiers (e.g. "sra", "gtex").
project (str | Iterable[str]) – One or more project identifiers (e.g. "SRP009615").
sample (str | Iterable[str]) – One or more sample identifiers (e.g. a rail_id or SRR accession).
strict (bool) – If True (default), re-raise any exception encountered while constructing a resource. If False, silently skip invalid combinations.
deduplicate (bool) – If True (default), discard resources whose resolved URL duplicates an earlier entry in the result list.

Returns:

A list of R3Resource objects, one per unique (organism, data_source, project, sample) combination.

Return type:

list[R3Resource]

recount3.search.search_data_sources(*, organism: str | Iterable[str], strict: bool = True, deduplicate: bool = True) → list[R3Resource][source]

Return data-source index R3Resource objects.

Each resource resolves to the homes_index file for one organism, which lists the available data sources for that organism.

Parameters:

organism (str | Iterable[str]) – One or more organism names (e.g. "human", "mouse").
strict (bool) – If True (default), re-raise any exception encountered while constructing a resource. If False, silently skip invalid combinations.
deduplicate (bool) – If True (default), discard resources whose resolved URL duplicates an earlier entry in the result list.

Returns:

A list of R3Resource objects, one per unique organism.

Return type:

list[R3Resource]

recount3.search.search_data_source_metadata(*, organism: str | Iterable[str], data_source: str | Iterable[str], strict: bool = True, deduplicate: bool = True) → list[R3Resource][source]

Return data-source metadata R3Resource objects.

Each resource resolves to the recount_project metadata file for one (organism, data_source) pair, which enumerates all projects within that data source.

Parameters:

organism (str | Iterable[str]) – One or more organism names (e.g. "human", "mouse").
data_source (str | Iterable[str]) – One or more data-source identifiers (e.g. "sra", "gtex").
strict (bool) – If True (default), re-raise any exception encountered while constructing a resource. If False, silently skip invalid combinations.
deduplicate (bool) – If True (default), discard resources whose resolved URL duplicates an earlier entry in the result list.

Returns:

A list of R3Resource objects, one per unique (organism, data_source) combination.

Return type:

list[R3Resource]

recount3.search.create_sample_project_lists(organism: str = '') → tuple[list[str], list[str]][source]

Return (samples, projects) discovered from metadata tables.

This is a compatibility wrapper around available_samples() and available_projects(). It preserves the original ids CLI behavior (simple ID lists) but now benefits from the richer and more robust metadata parsing.

Parameters:: organism (str) – Optional organism filter. Accepts “human” or “mouse” (case-insensitive). An empty string means “all supported organisms”.
Returns:: A tuple (samples, projects), where each element is a sorted list of unique identifier strings.
Return type:: tuple[list[str], list[str]]

recount3.search.available_samples(*, organism: str = 'human', data_sources: str | Iterable[str] | None = None, strict: bool = True) → DataFrame[source]

Return a sample overview similar to recount3::available_samples().

This reads per-data-source *.recount_project.MD.gz tables and returns a normalized DataFrame describing all samples for the requested organism.

Parameters:

organism (str) – Organism to query (“human” or “mouse”).
data_sources (str | Iterable[str] | None) – Optional subset of data sources to include (for example, “sra”, “gtex”, “tcga”). By default all known data sources are used.
strict (bool) – If True, raise a ValueError when no metadata can be found. If False, return an empty DataFrame instead.

Returns:

external_id: Sample identifier in the original source.
project: Project or study identifier.
organism: Canonical organism label (“human” or “mouse”).
file_source: Origin of the raw data (basename only).
date_processed: Processing date in YYYY-MM-DD format.
project_home: recount3 project home path.
project_type: High-level project type (for example, “data_sources” or “collections”).

Additional columns present in the raw metadata are preserved.

Return type:

A DataFrame with at least the following columns when available

Raises:

ValueError – If inputs are invalid or no metadata resources are found and strict is True.
RuntimeError – If metadata resources are found but all fail to load.

recount3.search.available_projects(*, organism: str = 'human', data_sources: str | Iterable[str] | None = None, strict: bool = True) → DataFrame[source]

Return a project overview like recount3::available_projects().

This aggregates the sample-level metadata from available_samples() and summarizes it at the project level.

Parameters:

organism (str) – Organism to query (“human” or “mouse”).
data_sources (str | Iterable[str] | None) – Optional subset of data sources to include.
strict (bool) – Passed through to available_samples().

Returns:

project: Project or study identifier.
organism: Canonical organism label.
file_source: Origin of the raw data (basename only).
project_home: recount3 project home path.
project_type: High-level project type (for example, “data_sources” or “collections”).
n_samples: Number of samples in the project.

Additional project-level columns are preserved.

Return type:

A DataFrame with one row per project and at least

recount3.search.project_homes(*, organism: str = 'human', data_sources: str | Iterable[str] | None = None, strict: bool = True) → DataFrame[source]

Return a project home summary similar to recount3::project_homes().

This is a thin layer on top of available_projects() that collapses projects down to unique project_home paths.

Parameters:

organism (str) – Organism to query (“human” or “mouse”).
data_sources (str | Iterable[str] | None) – Optional subset of data sources to include.
strict (bool) – Passed through to available_projects().

Returns:

A DataFrame with one row per project home and at least the following columns:

project_home: recount3 project home path.
project_type: High-level project type (for example, “data_sources” or “collections”).
organism: Canonical organism label.
file_source: Data source label when available.
n_projects: Number of projects using this home.

Additional columns from available_projects() may appear.

Return type:

DataFrame

recount3.search.annotation_options(organism: str) → dict[str, str][source]

Return annotation options for a given organism.

Parameters:: organism (str) – Organism name (“human” or “mouse”), case-insensitive.
Returns:: A new dict mapping canonical annotation names (for example, “gencode_v26”) to recount3 annotation file extensions (for example, “G026”).
Raises:: ValueError – If the organism is not recognized.
Return type:: dict[str, str]

recount3.search.annotation_ext(organism: str, annotation: str) → str[source]

Return the recount3 annotation extension for a given annotation.

This helper is analogous to the R recount3::annotation_ext() function. It accepts either a canonical annotation name (for example, “gencode_v26”) or a raw extension code (for example, “G026”).

Parameters:

organism (str) – Organism name (“human” or “mouse”), case-insensitive.
annotation (str) – Annotation name or extension code.

Returns:

The recount3 annotation file extension (for example, “G026”).

Raises:

ValueError – If the organism or annotation is not recognized.

Return type:

str

recount3.search.samples_for_project(*, organism: str, data_source: str, project: str) → list[str][source]

Return sample identifiers for a given project.

This helper reads the data-source-level metadata table and extracts sample IDs for the requested project by consulting common column names: “sample”, “sample_id”, “run”, “run_accession”, and “external_id”.

The logic mirrors the R docs recommendation to consult project-level metadata first, while mining samples from data-source metadata when assembling coverage files per sample.

Parameters:

organism (str) – “human” or “mouse”.
data_source (str) – “sra”, “gtex”, or “tcga”.
project (str) – Study identifier (for example, “SRP009615”).

Returns:

Sorted list of unique sample identifiers.

Raises:

ValueError – If the project cannot be validated against the metadata.

Return type:

list[str]

recount3.search.search_project_all(*, organism: str, data_source: str, project: str, genomic_units: Iterable[str] = ('gene', 'exon'), annotations: str | Iterable[str] = 'default', junction_type: str = 'ALL', junction_extension: Iterable[str] = ('MM',), include_metadata: bool = True, include_bigwig: bool = False, strict: bool = True, deduplicate: bool = True) → list[R3Resource][source]

Enumerate all files for a project (counts, junctions, metadata, bw).

This function composes the existing search helpers to implement a one-shot, project-scoped discovery routine. It adheres to recount3’s raw file layout documented at: https://rna.recount.bio/docs/raw-files.html (Sections 6.2-6.4).

Parameters:

organism (str) – “human” or “mouse”.
data_source (str) – “sra”, “gtex”, or “tcga”.
project (str) – Study identifier (for example, “SRP009615”).
genomic_units (Iterable[str]) – Which expression levels to include. Defaults to both.
annotations (str | Iterable[str]) – “default”, “all”, comma-separated string, or an iterable of annotation file extensions (for example, (“G026”, “G029”)).
junction_type (str) – Junction type; typically “ALL”.
junction_extension (Iterable[str]) – Iterable of junction artifacts to include: “MM” (counts), “RR” (coordinates), “ID” (sample IDs).
include_metadata (bool) – Whether to include the five metadata tables.
include_bigwig (bool) – Whether to include per-sample BigWig coverage files.
strict (bool) – If True, raise on invalid parameters; else skip broken items.
deduplicate (bool) – If True, drop duplicates across resource families.

Returns:

A list of R3Resource objects covering the requested bundle.

Raises:

ValueError – If validation fails (for example, missing project).

Return type:

list[R3Resource]

recount3.se

High-level builders and utilities for SummarizedExperiment objects.

This module provides helpers for constructing BiocPy SummarizedExperiment and RangedSummarizedExperiment objects from recount3 projects, plus utilities for working with SRA-style sample attributes and per-sample scaling.

The heavy lifting is implemented as methods on R3ResourceBundle:

to_summarized_experiment()
to_ranged_summarized_experiment()

The functions in this module are thin wrappers around those methods (for example, create_rse()) and convenience utilities that operate on BiocPy objects directly:

expand_sra_attributes(): parse SRA key;;value|... attribute strings into separate columns on a metadata DataFrame or SE/RSE object.
compute_read_counts(): convert coverage-sum counts to approximate read counts using average mapped read length.
compute_tpm(): compute Transcripts Per Million from an RSE with genomic ranges (uses feature widths from row_ranges).
compute_scale_factors(): compute per-sample AUC- or mapped-reads-based scale factors.
transform_counts(): apply scale factors to a count matrix.

Typical usage example:

from recount3 import create_rse
from recount3.se import compute_scale_factors, transform_counts

rse = create_rse(
    project="SRP009615",
    organism="human",
    annotation_label="gencode_v26",
)
sf = compute_scale_factors(rse)  # per-sample factors (for inspection)
scaled = transform_counts(rse, by="auc")  # apply scaling to the matrix

Note

Most functions in this module require BiocPy packages (biocframe, summarizedexperiment, genomicranges). Install them with:

pip install "recount3[biocpy]"

recount3.se.build_summarized_experiment(bundle: R3ResourceBundle, *, genomic_unit: str, annotation_extension: str | None = None, assay_name: str = 'raw_counts', join_policy: str = 'inner', autoload: bool = True) → summarizedexperiment.SummarizedExperiment[source]

Create a SummarizedExperiment.

Builds the experiment from a resource bundle.

This is a convenience wrapper around recount3.bundle.R3ResourceBundle.to_summarized_experiment().

Parameters:

bundle (R3ResourceBundle) – Resource bundle containing counts and metadata.
genomic_unit (str) – One of "gene", "exon", or "junction".
annotation_extension (str | None) – Optional annotation code for gene/exon assays (for example, "G026").
assay_name (str) – Name for the count assay in the SE.
join_policy (str) – Join policy across projects (pandas concatenation join).
autoload (bool) – If True, load resources transparently.

Returns:

A summarizedexperiment.SummarizedExperiment instance.

Return type:

summarizedexperiment.SummarizedExperiment

recount3.se.build_ranged_summarized_experiment(bundle: R3ResourceBundle, *, genomic_unit: str, annotation_extension: str | None = None, prefer_rr_junction_coordinates: bool = True, assay_name: str = 'raw_counts', join_policy: str = 'inner', autoload: bool = True, allow_fallback_to_se: bool = False) → summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment[source]

Create a RangedSummarizedExperiment.

Returns a ranged experiment when genomic ranges can be resolved. This is a convenience wrapper around recount3.bundle.R3ResourceBundle.to_ranged_summarized_experiment().

Parameters:

bundle (R3ResourceBundle) – Resources for counts, metadata, and (optionally) annotations.
genomic_unit (str) – One of "gene", "exon", or "junction".
annotation_extension (str | None) – Annotation code for gene/exon, if desired.
prefer_rr_junction_coordinates (bool) – Prefer RR for junction coordinates when present.
assay_name (str) – Name for the count assay in the output.
join_policy (str) – Join policy across projects when stacking.
autoload (bool) – If True, load resources transparently.
allow_fallback_to_se (bool) – If True, return a plain SE when ranges are unavailable.

Returns:

A summarizedexperiment.RangedSummarizedExperiment object, or a plain summarizedexperiment.SummarizedExperiment when allow_fallback_to_se is True and ranges cannot be resolved.

Return type:

summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment

recount3.se.create_ranged_summarized_experiment(*, project: str, genomic_unit: str = 'gene', organism: str = 'human', data_source: str = 'sra', annotation_label: str | None = None, annotation_extension: str | None = None, junction_type: str = 'ALL', junction_extensions: Sequence[str] | None = None, include_metadata: bool = True, include_bigwig: bool = False, assay_name: str = 'raw_counts', join_policy: str = 'inner', autoload: bool = True, allow_fallback_to_se: bool = False, strict: bool = True) → summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment[source]

High-level helper that mirrors recount3’s create_rse() in R.

This function hides the intermediate bundle construction step by:

Discovering all resources for a project via recount3.bundle.R3ResourceBundle.discover().

Stacking expression matrices across projects and samples.

Resolving genomic ranges (GTF for gene/exon; RR for junctions).

Building a BiocPy RangedSummarizedExperiment (or SE fallback) using the bundle methods.

Parameters:

project (str) – Study or project identifier (for example, "SRP009615").
genomic_unit (str) – One of {"gene", "exon", "junction"}.
organism (str) – Organism identifier ("human" or "mouse").
data_source (str) – Data source ("sra", "gtex", or "tcga").
annotation_label (str | None) – Optional human-readable annotation label for gene/exon assays, such as "gencode_v26" or "gencode_v29". Ignored for junction-level assays.
annotation_extension (str | None) – Optional explicit annotation extension (for example, "G026"). When provided, this takes precedence over annotation_label. Ignored for junction-level assays.
junction_type (str) – Junction type; typically "ALL".
junction_extensions (Sequence[str] | None) – Iterable of junction artifact extensions to include (for example, ("MM",) or ("MM", "RR")). If None, the default depends on genomic_unit: ("MM", "RR") for "junction" (so genomic ranges can be attached to each row), and ("MM",) for "gene" / "exon" (the RR sidecar does not apply to those units).
include_metadata (bool) – Whether to include the five project metadata tables in the underlying bundle (recommended).
include_bigwig (bool) – Whether to include per-sample BigWig coverage resources in the bundle. These can be large.
join_policy (str) – Join policy across projects when stacking count matrices (passed to build_ranged_summarized_experiment()).
autoload (bool) – If True, load resources transparently.
allow_fallback_to_se (bool) – If True, construct a plain SE when genomic ranges cannot be derived for the requested combination.
strict (bool) – If True, propagate validation errors from the search layer (for example, missing projects or incompatible combinations).
assay_name (str)

Returns:

A summarizedexperiment.RangedSummarizedExperiment object, or a plain summarizedexperiment.SummarizedExperiment when allow_fallback_to_se is True and ranges cannot be resolved.

Raises:

ImportError – If BiocPy packages are not installed.
ValueError – If inputs are invalid or no counts are found.
TypeError – If the underlying RangedSummarizedExperiment constructor rejects all compatibility variants.

Return type:

summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment

Examples

Build an RSE for a human SRA gene-count project (GENCODE 26):

rse = create_ranged_summarized_experiment(
    project="SRP009615",
    organism="human",
    annotation_label="gencode_v26",
)

Use the raw annotation extension instead of a label:

rse = create_ranged_summarized_experiment(
    project="SRP009615",
    organism="human",
    annotation_extension="G026",
)

Build a junction-level RSE:

rse = create_ranged_summarized_experiment(
    project="SRP009615",
    organism="human",
    genomic_unit="junction",
)

recount3.se.create_rse(*, project: str, genomic_unit: str = 'gene', organism: str = 'human', data_source: str = 'sra', annotation_label: str | None = None, annotation_extension: str | None = None, junction_type: str = 'ALL', junction_extensions: Sequence[str] | None = None, include_metadata: bool = True, include_bigwig: bool = False, assay_name: str = 'raw_counts', join_policy: str = 'inner', autoload: bool = True, allow_fallback_to_se: bool = False, strict: bool = True) → summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment[source]

Alias for create_ranged_summarized_experiment().

The parameters and behavior are identical; see that function for full documentation.

Examples

Build an RSE for a human SRA gene-count project:

rse = create_rse(
    project="SRP009615",
    organism="human",
    annotation_label="gencode_v26",
)

Use the raw extension string instead of a label:

rse = create_rse(
    project="SRP009615",
    organism="human",
    annotation_extension="G026",
)

Parameters:

project (str)
genomic_unit (str)
organism (str)
data_source (str)
annotation_label (str | None)
annotation_extension (str | None)
junction_type (str)
junction_extensions (Sequence[str] | None)
include_metadata (bool)
include_bigwig (bool)
assay_name (str)
join_policy (str)
autoload (bool)
allow_fallback_to_se (bool)
strict (bool)

Return type:

summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment

recount3.se.expand_sra_attributes(experiment_or_coldata: summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment | pd.DataFrame, *, sra_attributes_column: str = 'sra.sample_attributes', attribute_column_prefix: str = 'sra_attribute.') → summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment | pd.DataFrame[source]

Expand encoded SRA sample attributes into separate columns.

This function mirrors the recount3 R helper expand_sra_attributes().

It understands the SRA encoding used by recount3, where a single column (typically 'sra.sample_attributes') contains strings of the form:

"age;;67.78|biomaterial_provider;;LIBD|disease;;Control|..."

Each key;;value pair becomes a new column named {attribute_column_prefix}{key} (with spaces in key replaced by '_'), and the parsed values are stored per sample. The original column is preserved.

The function supports two calling styles:

Passing a pandas.DataFrame of column metadata, in which case a new DataFrame is returned with extra columns.
Passing a BiocPy summarizedexperiment.SummarizedExperiment or summarizedexperiment.RangedSummarizedExperiment, in which case a new object of the same class is returned with updated column_data.

Parameters:

experiment_or_coldata (summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment | pd.DataFrame) – Column metadata DataFrame or a BiocPy SE/RSE object.
sra_attributes_column (str) – Name of the column that holds the raw SRA attribute strings.
attribute_column_prefix (str) – Prefix to prepend to the generated attribute column names.

Returns:

A new object of the same type as experiment_or_coldata with additional columns corresponding to parsed SRA attributes. If the requested column is missing, experiment_or_coldata is returned unchanged.

Raises:

ImportError – If a SummarizedExperiment / RangedSummarizedExperiment is supplied but BiocPy packages are not installed.
AttributeError – If the BiocPy object does not expose column_data or set_column_data in the expected API.
TypeError – If experiment_or_coldata is neither a DataFrame nor a supported BiocPy experiment object.

Return type:

summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment | pd.DataFrame

Examples

Expand attributes on a metadata DataFrame:

expanded_df = expand_sra_attributes(col_data_df)

Expand attributes directly on an RSE (returns a new RSE):

rse2 = expand_sra_attributes(rse)

Inspect the new attribute columns:

attr_cols = [c for c in rse2.column_data.column_names
             if c.startswith("sra_attribute.")]

recount3.se.compute_read_counts(rse: Any, round_to_integers: bool = True, avg_mapped_read_length_column: str = 'recount_qc.star.average_mapped_length') → DataFrame[source]

Convert coverage-sum counts into approximate read/fragment counts.

The “raw_counts” assay used by recount3-style resources represents summed per-base coverage over each feature (for example, total coverage across all bases in a gene), not the number of reads/fragments overlapping the feature. A common approximation for converting coverage-sum values into read/fragment counts is to divide by the sample’s average mapped read length.

For each feature i and sample j:

read_counts[i, j] = raw_counts[i, j] / avg_mapped_read_length_column[j]

This operation is performed column-wise: each sample column in the assay is divided by a single scalar derived from that sample’s metadata.

Rounding is optional. When enabled, values are rounded to 0 decimals to produce integer-like counts, which is convenient for downstream methods that assume counts are integer-valued.

Parameters:

rse (Any) – A RangedSummarizedExperiment-like object containing a “raw_counts” assay and sample metadata in col_data.
round_to_integers (bool) – If True, round the resulting values to 0 decimals.
avg_mapped_read_length_column (str) – Name of the metadata column containing average mapped read length per sample. The column must be present in col_data and must contain numeric values.

Returns:

A DataFrame of approximate read counts with the same shape as the “raw_counts” assay (features x samples). Row and column names are preserved when available.

Raises:

TypeError – If rse is not a RangedSummarizedExperiment, or if round_to_integers is not a bool.
ValueError – If the “raw_counts” assay is missing, if avg_mapped_read_length_column is missing from col_data, or if the assay and metadata dimensions do not align.

Return type:

DataFrame

recount3.se.compute_tpm(rse: summarizedexperiment.RangedSummarizedExperiment) → pd.DataFrame[source]

Compute Transcripts Per Million (TPM) from raw coverage sums.

TPM is calculated as:

Approximate Read Counts = Coverage AUC / Avg Read Length
RPK = Read Counts / (Feature Length / 1000)
Scale Factor = Sum(RPK) / 1,000,000
TPM = RPK / Scale Factor

Parameters:

rse (summarizedexperiment.RangedSummarizedExperiment) – A RangedSummarizedExperiment object containing raw coverage sums. Must have feature widths defined in rowRanges.

Returns:

A DataFrame of TPM values.

Raises:

TypeError – If rse is not a RangedSummarizedExperiment (needs rowRanges).
ValueError – If feature widths or read lengths are missing.

Return type:

pd.DataFrame

Examples

Compute TPM from an RSE built with create_rse():

rse = create_rse(project="SRP009615", organism="human",
                 annotation_label="gencode_v26")
tpm_df = compute_tpm(rse)

recount3.se.is_paired_end(sample_metadata_source: Any, avg_mapped_read_length_column: str = 'recount_qc.star.average_mapped_length', avg_read_length_column: str = 'recount_seq_qc.avg_len') → Series[source]

Infer paired-end status, matching recount3::is_paired_end().

In recount3 (R), paired-end status is inferred via:: ratio <- round(avg_mapped_read_length / avg_read_length, 0) ratio must be 1 (single-end) or 2 (paired-end), otherwise NA with warning. result <- ratio == 2, with names(result) = external_id.

Parameters:

sample_metadata_source (Any) – Sample metadata (DataFrame) or a (Ranged)SummarizedExperiment-like object with col_data.to_pandas().
avg_mapped_read_length_column (str) – Metadata column with average mapped length.
avg_read_length_column (str) – Metadata column containing average read length.

Returns:

A pandas Series of dtype “boolean” (True/False/pd.NA), indexed by external_id.

Raises:

ValueError – If required metadata columns are missing or non-numeric.

Return type:

Series

recount3.se.compute_scale_factors(sample_metadata_source: Any, by: str = 'auc', target_read_count: float = 40000000.0, target_read_length_bp: float = 100, auc_column: str = 'recount_qc.bc_auc.all_reads_all_bases', avg_mapped_read_length_column: str = 'recount_qc.star.average_mapped_length', mapped_reads_column: str = 'recount_qc.star.all_mapped_reads', paired_end_status: Sequence[bool] | Series | None = None) → Series[source]

Compute per-sample scaling factors for coverage-sum counts.

This function produces one scalar scale factor per sample. The intended use is to multiply each sample column of a coverage-sum count matrix by the corresponding factor to make samples comparable.

Let C[i, j] be the coverage-sum count for feature i in sample j. If s[j] is the scale factor for sample j, scaled counts are computed as:

scaled[i, j] = C[i, j] * s[j]

Scale factors are derived from sample metadata. Samples are identified by the external_id column, and the returned Series is indexed by external_id.

Two scaling methods are supported:

by=”auc” Uses a per-sample total coverage metric (AUC) to scale each sample to a common target_read_count:

s[j] = target_read_count / auc[j]

This method preserves relative feature coverage within each sample while adjusting overall sample magnitude to be comparable across samples.
by=”mapped_reads” Uses mapped read counts and read length to normalize samples to a common target_read_count and a common target read length target_read_length_bp:

s[j] = target_read_count * target_read_length_bp *
paired_multiplier[j] / (mapped_reads[j] * (avg_mapped_read_length[j] ** 2))
paired_multiplier is:
- 2 for paired-end samples
- 1 for single-end samples
- missing for samples whose paired-end status cannot be inferred
If paired_end_status is not provided, paired-end status is inferred from metadata by comparing average mapped length to average read length:

ratio = round(avg_mapped_read_length / avg_read_length)

ratio==2 indicates paired-end, ratio==1 indicates single-end. Other ratios are treated as unknown and produce missing paired multipliers.

Missing values in required metadata propagate to missing scale factors. Non-numeric metadata values raise an error.

Parameters:

sample_metadata_source (Any) – Sample metadata as a DataFrame, or an object with col_data.to_pandas() that yields sample metadata.
by (str) – Scaling method: “auc” or “mapped_reads”.
target_read_count (float) – Target library size used to compute scale factors. Interpreted as the number of single-end reads to scale each sample to.
target_read_length_bp (float) – Target read length used only when by=”mapped_reads”.
auc_column (str) – Metadata column name for the per-sample AUC metric.
avg_mapped_read_length_column (str) – Metadata column name for average mapped read length per sample.
mapped_reads_column (str) – Metadata column name for mapped read counts per sample.
paired_end_status (Sequence[bool] | Series | None) – Optional paired-end indicator per sample. If provided, it must align with the samples in external_id. If omitted, paired-end status is inferred from metadata.

Returns:

A pandas Series of scale factors indexed by external_id. The Series name is “scale_factor”.

Raises:

ValueError – If by is invalid, required metadata columns are missing, or non-numeric metadata values are present.
TypeError – If target_read_count or target_read_length_bp are not numeric scalars.

Return type:

Series

Examples

AUC-based scaling (default):

sf = compute_scale_factors(rse)  # inspect per-sample factors
scaled = transform_counts(rse, by="auc")  # apply scaling

Mapped-reads-based scaling:

sf = compute_scale_factors(rse, by="mapped_reads")

recount3.se.transform_counts(rse: Any, by: str = 'auc', target_read_count: float = 40000000.0, target_read_length_bp: float = 100, round_to_integers: bool = True, **kwargs: Any) → DataFrame[source]

Scale coverage-sum counts to a common library size.

recount3 “raw_counts” represent summed per-base coverage over each feature, not read/fragment counts. This function converts those coverage-sum values into scaled counts that are comparable across samples by multiplying each sample column by a sample-specific scale factor.

Scaling is applied independently per sample (per column). For feature i and sample j, the returned matrix contains:

scaled[i, j] = raw_counts[i, j] * scale_factor[j]

The scale factors are computed from sample metadata (col_data) using one of two methods:

by=”auc”:
scale_factor[j] = target_read_count / auc[j]

where auc is a per-sample total coverage metric. This method scales each sample to have total coverage approximately equal to target_read_count.
by=”mapped_reads”:

scale_factor[j] = (
target_read_count * target_read_length_bp * paired_multiplier[j] / (mapped_reads[j] * (avg_mapped_read_length[j] ** 2))

)

where paired_multiplier is 2 for paired-end samples, 1 for single-end samples, and missing when paired-end status cannot be inferred. This method incorporates mapped reads and read length so that samples with different read lengths are normalized onto the same target read length target_read_length_bp.

The returned values remain in the same feature-by-sample shape as the input. If round_to_integers is True, values are rounded to integer-like counts.

Parameters:

rse (Any) – A RangedSummarizedExperiment-like object containing a “raw_counts” assay and sample metadata in col_data.
by (str) – Scaling method: “auc” or “mapped_reads”.
target_read_count (float) – Target library size used to compute scale factors. Interpreted as the number of single-end reads to scale each sample to.
target_read_length_bp (float) – Target read length used only when by=”mapped_reads”.
round_to_integers (bool) – If True, round scaled values to 0 decimals.
**kwargs (Any) – Additional parameters forwarded to compute_scale_factors(). Use this to override metadata column names (for example, auc_column=…, mapped_reads_column=…, avg_mapped_read_length_column=…) or to provide paired_end_status=… when paired-end status should not be inferred.

Returns:

A DataFrame of scaled counts with the same dimensions as assay(“raw_counts”). Row and column names are preserved when available.

Raises:

ValueError – If rse is not a RangedSummarizedExperiment, if the required assay or metadata columns are missing, if by is invalid, or if the assay and metadata dimensions do not align.
TypeError – If round_to_integers is not a bool, or if numeric parameters are not valid scalars.

Return type:

DataFrame

recount3._descriptions

Resource descriptions and duffel URL-path construction.

This module defines a small set of resource description dataclasses for recount3. A description is a validated, immutable-ish bundle of parameters (organism, project, etc.) that can deterministically construct the relative path to a resource in the repository.

The main entry point is R3ResourceDescription, which acts as a multi-factory: instantiating R3ResourceDescription(resource_type=...) returns an instance of the registered concrete subclass for that resource_type.

Registered resource types

`resource_type` string	Class
`"annotations"`	`R3Annotations`
`"count_files_gene_or_exon"`	`R3GeneOrExonCounts`
`"count_files_junctions"`	`R3JunctionCounts`
`"metadata_files"`	`R3ProjectMetadata`
`"bigwig_files"`	`R3BigWig`
`"data_sources"`	`R3DataSources`
`"data_source_metadata"`	`R3DataSourceMetadata`

Valid values for the organism and data_source fields are exposed as the module-level constants VALID_ORGANISMS and VALID_DATA_SOURCES.

Typical usage example:

from recount3._descriptions import R3ResourceDescription

desc = R3ResourceDescription(
    resource_type="count_files_gene_or_exon",
    organism="human",
    data_source="sra",
    genomic_unit="gene",
    project="SRP107565",
    annotation_extension="G026",
)
path = desc.url_path()

# Resource type can also be passed as the first positional argument:
desc = R3ResourceDescription("data_sources", organism="human")

class recount3._descriptions.R3ResourceDescription(*args: Any, **kwargs: Any)[source]

Abstract base class and multi-factory for recount3 resource descriptors.

Instantiate this class with a resource_type to obtain an instance of the registered concrete subclass.

Example

>>> desc = R3ResourceDescription(
...     resource_type="annotations",
...     organism="human",
...     genomic_unit="gene",
...     annotation_extension="G026",
... )
>>> isinstance(desc, R3Annotations)
True

Concrete subclasses should: - Inherit from both _R3CommonFields and R3ResourceDescription. - Validate required parameters in __post_init__. - Implement url_path() to return the duffel-relative path.

_TYPE_REGISTRY

Mapping from resource-type strings to concrete classes. Mutable to allow dynamic registration.

Type:: dict[str, type[recount3._descriptions.R3ResourceDescription]]

_RESOURCE_TYPE

Registration string key. Injected into concrete subclasses by the register_type() decorator.

Type:: str

resource_type

Resource-type discriminator. Declared here for static typing; implemented by _R3CommonFields.

Type:: str

Constructs and returns an instance of the appropriate subclass.

The factory accepts the resource type either as: - keyword argument resource_type=…, or - the first positional argument.

Parameters:

*args (Any) – Optional positional arguments; if present, args[0] may supply the resource type.
**kwargs (Any) – Keyword arguments used to initialize the selected dataclass subclass.

Returns:

An instance of the concrete subclass registered for the selected resource type.

Raises:

KeyError – If resource_type is missing or empty.
ValueError – If resource_type is not registered.

Return type:

R3ResourceDescription

classmethod register_type(resource_type: str) → Callable[[type[R3ResourceDescription]], type[R3ResourceDescription]][source]

Registers a concrete subclass for a given resource_type.

This decorator binds resource_type to the provided subclass and also sets an internal _RESOURCE_TYPE attribute on that subclass.

Parameters:: resource_type (str) – Resource-type string used as the factory key.
Returns:: A decorator that registers the decorated subclass.
Return type:: Callable[[type[R3ResourceDescription]], type[R3ResourceDescription]]

Example

>>> @R3ResourceDescription.register_type("annotations")
... @dataclasses.dataclass(slots=True)
... class R3Annotations(_R3CommonFields, R3ResourceDescription):
...     ...

url_path() → str[source]

Return the duffel-relative URL path for this resource.

R3ResourceDescription is a multi-factory and abstract interface: calling R3ResourceDescription(resource_type=...) returns an instance of a concrete registered subclass (e.g. R3Annotations). Those subclasses implement url_path() to construct a deterministic path within the duffel repository layout.

This base-class method is not expected to be called directly; it exists to document the interface and provide a consistent method name across all description types.

Implementations must return a path without a leading slash. Callers should join this value onto the configured base URL (and add any separators as needed).

Returns:: A duffel-relative path string, no leading slash.
Raises:: NotImplementedError – Always, in the base class. Concrete subclasses override this method.
Return type:: str

class recount3._descriptions.R3Annotations(*args: Any, **kwargs: Any)[source]

Descriptor for annotation GTF files.

Required fields:

organism
genomic_unit
annotation_extension

Duffel layout:

{organism}/annotations/{genomic_unit}_sums/: {organism}.{genomic_unit}_sums.{annotation_extension}.gtf.gz

Constructs and returns an instance of the appropriate subclass.

The factory accepts the resource type either as: - keyword argument resource_type=…, or - the first positional argument.

Parameters:

*args – Optional positional arguments; if present, args[0] may supply the resource type.
**kwargs – Keyword arguments used to initialize the selected dataclass subclass.
resource_type (str)
organism (str | None)
data_source (str | None)
genomic_unit (str | None)
project (str | None)
sample (str | None)
annotation_extension (str | None)
junction_type (str | None)
junction_extension (str | None)
table_name (str | None)

Returns:

An instance of the concrete subclass registered for the selected resource type.

Raises:

KeyError – If resource_type is missing or empty.
ValueError – If resource_type is not registered.

url_path() → str[source]

Return the duffel-relative URL path for this resource.

R3ResourceDescription is a multi-factory and abstract interface: calling R3ResourceDescription(resource_type=...) returns an instance of a concrete registered subclass (e.g. R3Annotations). Those subclasses implement url_path() to construct a deterministic path within the duffel repository layout.

This base-class method is not expected to be called directly; it exists to document the interface and provide a consistent method name across all description types.

Implementations must return a path without a leading slash. Callers should join this value onto the configured base URL (and add any separators as needed).

Returns:: A duffel-relative path string, no leading slash.
Raises:: NotImplementedError – Always, in the base class. Concrete subclasses override this method.
Return type:: str

class recount3._descriptions.R3GeneOrExonCounts(*args: Any, **kwargs: Any)[source]

Descriptor for per-project gene/exon count matrices.

Required fields:

organism
data_source
genomic_unit
project
annotation_extension

Duffel layout:

{organism}/data_sources/{data_source}/{genomic_unit}_sums/: {_project_shard(project)}/{project}/ {data_source}.{genomic_unit}_sums.{project}.{annotation_extension}.gz

Constructs and returns an instance of the appropriate subclass.

The factory accepts the resource type either as: - keyword argument resource_type=…, or - the first positional argument.

Parameters:

*args – Optional positional arguments; if present, args[0] may supply the resource type.
**kwargs – Keyword arguments used to initialize the selected dataclass subclass.
resource_type (str)
organism (str | None)
data_source (str | None)
genomic_unit (str | None)
project (str | None)
sample (str | None)
annotation_extension (str | None)
junction_type (str | None)
junction_extension (str | None)
table_name (str | None)

Returns:

An instance of the concrete subclass registered for the selected resource type.

Raises:

KeyError – If resource_type is missing or empty.
ValueError – If resource_type is not registered.

url_path() → str[source]

Return the duffel-relative URL path for this resource.

R3ResourceDescription is a multi-factory and abstract interface: calling R3ResourceDescription(resource_type=...) returns an instance of a concrete registered subclass (e.g. R3Annotations). Those subclasses implement url_path() to construct a deterministic path within the duffel repository layout.

This base-class method is not expected to be called directly; it exists to document the interface and provide a consistent method name across all description types.

Implementations must return a path without a leading slash. Callers should join this value onto the configured base URL (and add any separators as needed).

Returns:: A duffel-relative path string, no leading slash.
Raises:: NotImplementedError – Always, in the base class. Concrete subclasses override this method.
Return type:: str

class recount3._descriptions.R3JunctionCounts(*args: Any, **kwargs: Any)[source]

Descriptor for per-project junction count files.

Required fields:

organism
data_source
project
junction_type
junction_extension

Duffel layout:

{organism}/data_sources/{data_source}/junctions/: {_project_shard(project)}/{project}/ {data_source}.junctions.{project}.{junction_type}. {junction_extension}.gz

Constructs and returns an instance of the appropriate subclass.

The factory accepts the resource type either as: - keyword argument resource_type=…, or - the first positional argument.

Parameters:

*args – Optional positional arguments; if present, args[0] may supply the resource type.
**kwargs – Keyword arguments used to initialize the selected dataclass subclass.
resource_type (str)
organism (str | None)
data_source (str | None)
genomic_unit (str | None)
project (str | None)
sample (str | None)
annotation_extension (str | None)
junction_type (str | None)
junction_extension (str | None)
table_name (str | None)

Returns:

An instance of the concrete subclass registered for the selected resource type.

Raises:

KeyError – If resource_type is missing or empty.
ValueError – If resource_type is not registered.

url_path() → str[source]

Return the duffel-relative URL path for this resource.

R3ResourceDescription is a multi-factory and abstract interface: calling R3ResourceDescription(resource_type=...) returns an instance of a concrete registered subclass (e.g. R3Annotations). Those subclasses implement url_path() to construct a deterministic path within the duffel repository layout.

This base-class method is not expected to be called directly; it exists to document the interface and provide a consistent method name across all description types.

Implementations must return a path without a leading slash. Callers should join this value onto the configured base URL (and add any separators as needed).

Returns:: A duffel-relative path string, no leading slash.
Raises:: NotImplementedError – Always, in the base class. Concrete subclasses override this method.
Return type:: str

class recount3._descriptions.R3ProjectMetadata(*args: Any, **kwargs: Any)[source]

Descriptor for per-project metadata tables.

Required fields:

organism
data_source
project
table_name

Duffel layout:

{organism}/data_sources/{data_source}/metadata/: {_project_shard(project)}/{project}/ {data_source}.{table_name}.{project}.MD.gz

Constructs and returns an instance of the appropriate subclass.

The factory accepts the resource type either as: - keyword argument resource_type=…, or - the first positional argument.

Parameters:

*args – Optional positional arguments; if present, args[0] may supply the resource type.
**kwargs – Keyword arguments used to initialize the selected dataclass subclass.
resource_type (str)
organism (str | None)
data_source (str | None)
genomic_unit (str | None)
project (str | None)
sample (str | None)
annotation_extension (str | None)
junction_type (str | None)
junction_extension (str | None)
table_name (str | None)

Returns:

An instance of the concrete subclass registered for the selected resource type.

Raises:

KeyError – If resource_type is missing or empty.
ValueError – If resource_type is not registered.

url_path() → str[source]

Return the duffel-relative URL path for this resource.

R3ResourceDescription is a multi-factory and abstract interface: calling R3ResourceDescription(resource_type=...) returns an instance of a concrete registered subclass (e.g. R3Annotations). Those subclasses implement url_path() to construct a deterministic path within the duffel repository layout.

This base-class method is not expected to be called directly; it exists to document the interface and provide a consistent method name across all description types.

Implementations must return a path without a leading slash. Callers should join this value onto the configured base URL (and add any separators as needed).

Returns:: A duffel-relative path string, no leading slash.
Raises:: NotImplementedError – Always, in the base class. Concrete subclasses override this method.
Return type:: str

class recount3._descriptions.R3BigWig(*args: Any, **kwargs: Any)[source]

Descriptor for per-sample BigWig coverage files.

Required fields:

organism
data_source
project
sample

Duffel layout:

{organism}/data_sources/{data_source}/base_sums/: {_project_shard(project)}/{project}/ {_sample_shard(sample, data_source)}/ {data_source}.base_sums.{project}_{sample}.ALL.bw

The sample shard subdirectory uses a different offset for GTEx samples compared to SRA/TCGA samples; see _sample_shard().

Constructs and returns an instance of the appropriate subclass.

The factory accepts the resource type either as: - keyword argument resource_type=…, or - the first positional argument.

Parameters:

*args – Optional positional arguments; if present, args[0] may supply the resource type.
**kwargs – Keyword arguments used to initialize the selected dataclass subclass.
resource_type (str)
organism (str | None)
data_source (str | None)
genomic_unit (str | None)
project (str | None)
sample (str | None)
annotation_extension (str | None)
junction_type (str | None)
junction_extension (str | None)
table_name (str | None)

Returns:

An instance of the concrete subclass registered for the selected resource type.

Raises:

KeyError – If resource_type is missing or empty.
ValueError – If resource_type is not registered.

url_path() → str[source]

Return the duffel-relative URL path for this resource.

R3ResourceDescription is a multi-factory and abstract interface: calling R3ResourceDescription(resource_type=...) returns an instance of a concrete registered subclass (e.g. R3Annotations). Those subclasses implement url_path() to construct a deterministic path within the duffel repository layout.

This base-class method is not expected to be called directly; it exists to document the interface and provide a consistent method name across all description types.

Implementations must return a path without a leading slash. Callers should join this value onto the configured base URL (and add any separators as needed).

Returns:: A duffel-relative path string, no leading slash.
Raises:: NotImplementedError – Always, in the base class. Concrete subclasses override this method.
Return type:: str

class recount3._descriptions.R3DataSources(*args: Any, **kwargs: Any)[source]

Descriptor for the organism-level data-source index (homes_index).

Required fields:

organism

Duffel layout:

{organism}/homes_index

Constructs and returns an instance of the appropriate subclass.

The factory accepts the resource type either as: - keyword argument resource_type=…, or - the first positional argument.

Parameters:

*args – Optional positional arguments; if present, args[0] may supply the resource type.
**kwargs – Keyword arguments used to initialize the selected dataclass subclass.
resource_type (str)
organism (str | None)
data_source (str | None)
genomic_unit (str | None)
project (str | None)
sample (str | None)
annotation_extension (str | None)
junction_type (str | None)
junction_extension (str | None)
table_name (str | None)

Returns:

An instance of the concrete subclass registered for the selected resource type.

Raises:

KeyError – If resource_type is missing or empty.
ValueError – If resource_type is not registered.

url_path() → str[source]

Return the duffel-relative URL path for this resource.

R3ResourceDescription is a multi-factory and abstract interface: calling R3ResourceDescription(resource_type=...) returns an instance of a concrete registered subclass (e.g. R3Annotations). Those subclasses implement url_path() to construct a deterministic path within the duffel repository layout.

This base-class method is not expected to be called directly; it exists to document the interface and provide a consistent method name across all description types.

Implementations must return a path without a leading slash. Callers should join this value onto the configured base URL (and add any separators as needed).

Returns:: A duffel-relative path string, no leading slash.
Raises:: NotImplementedError – Always, in the base class. Concrete subclasses override this method.
Return type:: str

class recount3._descriptions.R3DataSourceMetadata(*args: Any, **kwargs: Any)[source]

Descriptor for source-level metadata listings.

Required fields:

organism
data_source

Duffel layout:

{organism}/data_sources/{data_source}/metadata/: {data_source}.recount_project.MD.gz

Constructs and returns an instance of the appropriate subclass.

The factory accepts the resource type either as: - keyword argument resource_type=…, or - the first positional argument.

Parameters:

*args – Optional positional arguments; if present, args[0] may supply the resource type.
**kwargs – Keyword arguments used to initialize the selected dataclass subclass.
resource_type (str)
organism (str | None)
data_source (str | None)
genomic_unit (str | None)
project (str | None)
sample (str | None)
annotation_extension (str | None)
junction_type (str | None)
junction_extension (str | None)
table_name (str | None)

Returns:

An instance of the concrete subclass registered for the selected resource type.

Raises:

KeyError – If resource_type is missing or empty.
ValueError – If resource_type is not registered.

url_path() → str[source]

Return the duffel-relative URL path for this resource.

R3ResourceDescription is a multi-factory and abstract interface: calling R3ResourceDescription(resource_type=...) returns an instance of a concrete registered subclass (e.g. R3Annotations). Those subclasses implement url_path() to construct a deterministic path within the duffel repository layout.

This base-class method is not expected to be called directly; it exists to document the interface and provide a consistent method name across all description types.

Implementations must return a path without a leading slash. Callers should join this value onto the configured base URL (and add any separators as needed).

Returns:: A duffel-relative path string, no leading slash.
Raises:: NotImplementedError – Always, in the base class. Concrete subclasses override this method.
Return type:: str

recount3.errors

Domain-specific exceptions for recount3.

All exceptions inherit from Recount3Error, so callers can catch the base class to handle any recount3-specific failure, or catch a subclass to handle a specific failure mode:

Recount3Error: base class for all recount3 exceptions.
ConfigurationError: invalid or missing configuration (bad env var values, inaccessible cache directory, unsupported option combinations).
DownloadError: a network or I/O failure occurred while downloading a resource.
LoadError: a resource was downloaded but could not be parsed (empty file, unexpected format, shape mismatch, missing columns).
CompatibilityError: resources that are incompatible with each other were combined in an operation such as stack_count_matrices().

Example

Catch all recount3 errors with the base class:

from recount3.errors import Recount3Error, DownloadError

try:
    res.download(path="/data")
except DownloadError as exc:
    print(f"Network failure: {exc}")
except Recount3Error as exc:
    print(f"Unexpected recount3 error: {exc}")

exception recount3.errors.Recount3Error[source]: Base class for recount3-related errors.

exception recount3.errors.ConfigurationError[source]: Raised when configuration is invalid or inconsistent.

exception recount3.errors.DownloadError[source]: Raised when a resource fails to download.

exception recount3.errors.LoadError[source]: Raised when a resource fails to load or parse.

exception recount3.errors.CompatibilityError[source]: Raised when resources are incompatible for combined operations.

recount3.types

Common type aliases and literals used throughout the recount3 API.

recount3.types.CacheMode

Literal "enable" | "disable" | "update". Controls how download() interacts with the on-disk cache:

"enable": use the existing cached file; download only if missing (default).
"disable": bypass the cache entirely and stream the file directly to the destination.
"update": force a fresh download even if a cached copy already exists, then cache the result.

Type:: TypeAlias

recount3.types.CompatibilityMode

Literal "family" | "feature". Controls validation in stack_count_matrices():

"family": all count resources must belong to the same high-level family (gene/exon versus junctions).
"feature": stricter: all resources must additionally share an identical feature space – the same genomic unit for gene/exon resources, or the same junction type and extension for junction resources. (Note: this does not constrain the annotation extension, so mixed-annotation gene/exon builds are not rejected.)

Type:: TypeAlias

recount3.types.StringOrIterable

str | Iterable[str]. Most search functions accept either a single string or an iterable of strings for each parameter. When an iterable is passed, the function computes the Cartesian product across all parameters and returns one R3Resource per combination.

Type:: TypeAlias

recount3.types.FieldSpec

StringOrIterable | Callable[[Any], bool] | None. The filter predicate accepted by filter() and match_spec(). Three forms are accepted:

None: no filtering; every value passes.
A string or iterable of strings: exact membership test against the field value.
A callable: called with the field value; a truthy return keeps the resource.

Type:: TypeAlias

recount3._bigwig

BigWig file access via a small wrapper around pyBigWig.

This is an internal module. For BigWig access via the public API, use load() on a BigWig R3Resource.

The optional dependency is imported through get_pybigwig_module().

Typical usage example:

>>> from pathlib import Path
>>> from recount3._bigwig import BigWigFile
>>> with BigWigFile(Path("example.bw")) as bw:
...     lengths = bw.chroms()
...     mean = bw.stats("chr1", 0, 1000)[0]

Note

Requires the optional pyBigWig package. Install with pip install "recount3[bigwig]". An ImportError is raised on first use if the package is not available. pyBigWig can be difficult to install on non-Linux systems.

class recount3._bigwig.BigWigFile(path: Path, mode: str = 'r')[source]

A lazily-opened BigWig reader with a small, typed API.

Instances are cheap to construct and do not open the file until the first method call that requires a live pyBigWig handle. The handle is cached for subsequent calls and can be explicitly released with close().

Parameters:

path (Path)
mode (str)

path

Filesystem path to a BigWig file (typically .bw). The file must exist when the handle is opened.

Type:: pathlib.Path

mode

File mode passed to pyBigWig.open. Reading is the default ("r").

Type:: str

close() → None[source]

Close the underlying pyBigWig handle if it is open.

This method is idempotent.

Return type:: None

is_open() → bool[source]

Return True if the underlying handle is currently open.

Return type:: bool

chroms(chrom: str | None = None) → Mapping[str, int] | int | None[source]

Return chromosome lengths, or a single chromosome length.

Parameters:: chrom (str | None) – If provided, return only the length for this chromosome.
Returns:: If chrom is None, a mapping from chromosome name to length. Otherwise, the length for the requested chromosome. None may be returned if the chromosome is not present.
Return type:: Mapping[str, int] | int | None

header() → dict[str, Any][source]

Return the BigWig header metadata.

Return type:: dict[str, Any]

values(chrom: str, start: int, end: int, *, numpy: bool | None = None) → list[float] | Any[source]

Return per-base values over a half-open interval [start, end).

Parameters:

chrom (str) – Chromosome name.
start (int) – 0-based start coordinate (inclusive).
end (int) – 0-based end coordinate (exclusive).
numpy (bool | None) – Forwarded to pyBigWig.values. When True, pyBigWig may return a NumPy array depending on its configuration.

Returns:

Values returned by pyBigWig.values. When numpy is not True, this is typically a list[float]. When numpy is True, pyBigWig may return a NumPy array depending on its configuration.

Return type:

list[float] | Any

Return summary statistic(s) over an interval or whole chromosome.

Parameters:

chrom (str) – Chromosome name.
start (int | None) – 0-based start coordinate (inclusive). If omitted, stats are computed over the whole chromosome.
end (int | None) – 0-based end coordinate (exclusive). If omitted, stats are computed over the whole chromosome.
type (str) – Statistic name understood by pyBigWig.stats (for example, "mean", "min", "max", "coverage").
n_bins (int | None) – If provided, request binned stats via the nBins argument.
exact (bool | None) – If provided, forward to the exact argument.

Returns:

A list of statistic values as returned by pyBigWig.stats. Values may be None for missing data regions.

Return type:

list[float | None]

intervals(chrom: str, start: int | None = None, end: int | None = None) → list[tuple[int, int, float]] | None | Any[source]

Return (start, end, value) intervals overlapping a region.

Parameters:

chrom (str) – Chromosome name.
start (int | None) – 0-based start coordinate (inclusive). If omitted, intervals for the entire chromosome may be returned.
end (int | None) – 0-based end coordinate (exclusive). If omitted, intervals for the entire chromosome may be returned.

Returns:

Intervals returned by pyBigWig.intervals. This is often a list of (start, end, value) tuples. None may be returned when no intervals overlap the requested region.

Return type:

list[tuple[int, int, float]] | None | Any