API Reference
Typed Python API for the recount3 RNA-seq data repository.
recount3 is a large-scale uniformly processed RNA-seq resource covering tens of thousands of human and mouse samples across multiple data sources (SRA, GTEx, TCGA). This package provides a small, typed Python API for discovering, downloading, and loading recount3 data files.
The public surface is intentionally flat: core classes and helper functions are re-exported here for discovery and convenience.
Typical usage example: high-level (recommended for most cases):
from recount3 import create_rse
rse = create_rse(
project="SRP009615",
organism="human",
annotation_label="gencode_v26",
)
Typical usage example: lower-level (multi-project or custom workflows):
from recount3 import R3ResourceBundle
bundle = R3ResourceBundle.discover(
organism="human",
data_source="sra",
project=["SRP009615", "SRP012682"],
)
counts = bundle.filter(
resource_type="count_files_gene_or_exon",
genomic_unit="gene",
).stack_count_matrices()
Note
Several features require optional dependencies:
BiocPy (
biocframe,summarizedexperiment,genomicranges): required forcreate_rse(),to_ranged_summarized_experiment(), and the compute utilities inrecount3.se. Install withpip install biocframe summarizedexperiment genomicranges.pyBigWig: required for BigWig file access via
BigWigFile. Install withpip install pyBigWig.
recount3.config
Configuration helpers for recount3.
This module centralizes configuration so that environment-dependent values
are not hidden as mutable module globals. Values are read once via
default_config() and can be overridden by constructing Config
directly. CLI flags in recount3.cli take the highest precedence and
override both environment variables and Config defaults.
The module also exposes three cache utility functions:
recount3_cache(), recount3_cache_files(), and
recount3_cache_rm().
- Environment variables (all optional):
RECOUNT3_URL: base URL of the duffel mirror (default:https://duffel.rail.bio/recount3/).RECOUNT3_CACHE_DIR: directory for the on-disk file cache (default:~/.cache/recount3).RECOUNT3_CACHE_DISABLE: set to"1"to disable caching entirely.RECOUNT3_HTTP_TIMEOUT: HTTP request timeout in seconds (default: 30).RECOUNT3_MAX_RETRIES: maximum retry attempts for transient errors (default: 3).RECOUNT3_INSECURE_SSL: set to"1"to skip TLS verification (unsafe; use only for debugging certificate issues).RECOUNT3_USER_AGENT: customUser-Agentheader string.RECOUNT3_CHUNK_SIZE: streaming chunk size in bytes (default: 65536).
Typical usage example:
from pathlib import Path
from recount3.config import Config, recount3_cache, recount3_cache_rm
# Read the current cache directory (creates it if absent):
cache_dir = recount3_cache()
# Use a custom cache location for this session:
cfg = Config(cache_dir=Path("/scratch/recount3_cache"))
# Remove cached files matching a pattern (dry run first):
to_delete = recount3_cache_rm(dry_run=True)
recount3_cache_rm(predicate=lambda p: "sra" in str(p))
- class recount3.config.Config(base_url: str, timeout: int, insecure_ssl: bool, max_retries: int, user_agent: str, cache_dir: Path, cache_disabled: bool, chunk_size: int = 1048576)[source]
Immutable configuration bag.
- Parameters:
- cache_dir
Cache directory for downloaded files.
- Type:
- recount3.config.default_config() Config[source]
Return configuration constructed from environment variables.
Notes
Values are parsed to sensible types and the base URL is normalized to include a trailing slash (matching the original behavior).
- recount3.config.recount3_cache(config: Config | None = None) Path[source]
Return the cache directory used for recount3 downloads.
This helper normalizes and materializes the cache directory based on the provided configuration (or the default configuration when omitted).
- Parameters:
config (Config | None) – Optional configuration. If None,
default_config()is used.- Returns:
Absolute
pathlib.Pathto the cache directory.- Return type:
- recount3.config.recount3_cache_files(config: Config | None = None, *, pattern: str | None = None) list[Path][source]
List cached files managed by recount3.
- Parameters:
config (Config | None) – Optional configuration. If None,
default_config()is used.pattern (str | None) – Optional glob-style pattern (as accepted by
pathlib.Path.rglob()) to filter files relative to the cache root, for example"*.tsv.gz"or"*__SRP123456*". If None, all files are returned.
- Returns:
A sorted list of
pathlib.Pathobjects pointing to cached files. If the cache directory does not exist yet, an empty list is returned.- Return type:
- recount3.config.recount3_cache_rm(*, config: Config | None = None, predicate: Callable[[Path], bool] | None = None, dry_run: bool = False) list[Path][source]
Remove cached files that match a predicate.
This helper is analogous to the R-side
recount3_cache_rm(): it walks the cache directory and removes any file for whichpredicatereturns True. Directories are left in place.- Parameters:
config (Config | None) – Optional configuration. If None,
default_config()is used.predicate (Callable[[Path], bool] | None) – Callable taking a
pathlib.Pathand returning True if the file should be removed. If None, all cached files are selected.dry_run (bool) – If True, do not delete any files and only report which paths would be removed.
- Returns:
A sorted list of
pathlib.Pathobjects that were removed (or would be removed whendry_runis True).- Raises:
OSError – If filesystem operations fail during deletion.
- Return type:
Examples
List everything that would be removed, without deleting:
to_delete = recount3_cache_rm(dry_run=True)
Remove all cached files (empty the cache):
recount3_cache_rm()
Remove only files related to the
"sra"data source:recount3_cache_rm(predicate=lambda p: "sra" in str(p))
recount3.resource
Resource orchestration: URL resolution, caching, downloading, and loading.
This module defines R3Resource, the central class of the recount3
package. Every downloadable file (count matrices, metadata tables, annotation
GTFs, BigWig coverage files, junction counts) is represented as an
R3Resource. The class manages the full lifecycle of a resource:
Description -> URL: a
R3ResourceDescriptionprovides the structured parameters (organism, project, genomic unit, …) that are used to construct the deterministic duffel-mirror URL.URL -> cache:
download()fetches the file over HTTP and stores it in a persistent on-disk cache keyed by URL hash.Cache -> materialization: the cached file can be hard-linked or copied to a user-supplied directory, or appended to a ZIP archive.
Cache -> in-memory object:
load()parses the cached file into an appropriate Python object (see Notes below).
Typical usage example:
from recount3 import R3Resource, R3GeneOrExonCounts
desc = R3GeneOrExonCounts(
organism="human",
data_source="sra",
genomic_unit="gene",
project="SRP009615",
annotation_extension="G026",
)
res = R3Resource(desc)
# Cache the file (no local copy):
res.download(path=None, cache_mode="enable")
# Or copy into a directory:
dest = res.download(path="/data/recount3")
# Parse into a DataFrame (for count resources):
df = res.load()
Note
Downloads are protected by a module-level threading.Lock, so
multiple threads can safely call download() on
resources sharing a common cache path without corrupting the cache.
Note
load() returns different types depending on the
resource type:
Gene/exon count resources ->
pandas.DataFrame(features x samples).Junction MM resources ->
scipy.sparse.csr_matrix.Junction ID/RR resources ->
pandas.DataFrame.BigWig resources ->
BigWigFile.
- class recount3.resource.R3Resource(description: R3ResourceDescription, url: str | None = None, filepath: str | None = None, config: Config | None = None)[source]
Resource that manages URL resolution, caching, materialization, and loading.
- Parameters:
description (R3ResourceDescription)
url (str | None)
filepath (str | None)
config (Config | None)
- description
An instance of R3ResourceDescription that specifies the metadata and the hierarchical path used to correctly locate and define the resource.
- url
The full, absolute network URL pointing to the remote resource. If not explicitly provided during initialization, it is derived by joining the configured base URL with the description’s relative URL.
- Type:
str | None
- filepath
An optional string representing the absolute local path where the resource was successfully materialized (either copied or linked).
- Type:
str | None
- config
An optional Config instance dictating strict network and cache behaviors. If omitted, the global default configuration is dynamically applied.
- Type:
recount3.config.Config | None
- download(path: str | None = None, *, cache_mode: Literal['enable', 'disable', 'update'] = 'enable', overwrite: bool = False, chunk_size: int | None = None) str | None[source]
Ensure resource availability and optionally materialize it.
Transitions the remote resource to the local system. Caches the file, writes it to a specific directory, or appends it to a ZIP archive depending on the arguments provided.
- Parameters:
path (str | None) – Target destination. If None, performs a cache-only download. If a directory path, links or copies the file there. If a ‘.zip’ path, injects the file into the archive using arcname.
cache_mode (Literal['enable', 'disable', 'update']) – Caching behavior. ‘enable’ uses existing cache, ‘disable’ streams directly to path without caching, ‘update’ forces a cache refresh before materialization.
overwrite (bool) – If True, replaces existing files at the destination.
chunk_size (int | None) – Byte size for streaming operations. Defaults to the configured chunk size.
- Returns:
The final file path if materialized to a directory. None if performing a cache-only download or appending to a ZIP archive.
- Raises:
ValueError – Combinations are invalid (e.g., path=None with cache_mode=’disable’) or path has an unsupported format.
- Return type:
str | None
Examples
Cache the file without copying it anywhere:
res.download(path=None, cache_mode="enable")
Copy the cached file into a local directory:
dest = res.download(path="/data/recount3")
Append the file to a ZIP archive:
res.download(path="/data/recount3.zip")
Force a cache refresh before copying:
dest = res.download(path="/data/recount3", cache_mode="update")
- load(*, force: bool = False) object[source]
Parse the resource into an appropriate in-memory Python object.
Downloads and caches the resource if missing. Uses the resource description to determine the parsing strategy (e.g., BigWig, tabular counts, junctions). Caches the resulting object internally to prevent redundant disk I/O.
- Parameters:
force (bool) – If True, bypasses the in-memory object cache and re-parses data directly from disk.
- Returns:
The parsed object. Tabular and junction counts return a pandas.DataFrame. BigWig files return a recount3._bigwig.BigWigFile instance.
- Raises:
FileNotFoundError – The file is missing from the cache post-download.
LoadError – Parsing fails, matrix shapes mismatch, or the resource type is currently unsupported.
- Return type:
Examples
Load a gene/exon count matrix as a DataFrame:
counts_df = gene_count_res.load() # -> pd.DataFrame
Load a BigWig coverage file (close it when done):
bw = bigwig_res.load() # -> BigWigFile vals = bw.values("chr1", 0, 1_000_000) bw.close()
Re-parse from disk, bypassing the in-memory cache:
counts_df = res.load(force=True)
- is_loaded() bool[source]
Check if the resource currently holds a parsed in-memory object.
- Returns:
True if an object is cached in memory, False otherwise.
- Return type:
recount3.bundle
Resource bundles, project discovery, and concatenation helpers.
This module defines R3ResourceBundle, a general-purpose
container for groups of R3Resource objects.
Bundles support lazy loading, filtering by description fields,
project-aware discovery, and high-level helpers for combining recount3
resources into BiocPy objects such as
SummarizedExperiment and
RangedSummarizedExperiment.
When discovery covers exactly one (organism, data_source, project)
triple, the bundle’s organism, data_source, and project
attributes are set accordingly. For multi-project bundles these attributes
are None to avoid misrepresenting the identity.
- Filtering with FieldSpec
filter()accepts aFieldSpecfor each description field. Three forms are accepted:A string: exact match (e.g.
genomic_unit="gene").An iterable of strings: keep if the field is any of the given values.
A callable: called with the field value; truthy return keeps the resource.
None(default): no filtering on that field.
Typical usage example:
from recount3 import R3ResourceBundle
bundle = R3ResourceBundle.discover(
organism="human",
data_source="sra",
project="SRP009615",
)
# Filter to gene-level count resources and stack into a DataFrame:
counts = bundle.filter(
resource_type="count_files_gene_or_exon",
genomic_unit="gene",
).stack_count_matrices()
# Filter with a callable predicate:
meta_only = bundle.filter(
resource_type=lambda t: t == "metadata_files"
)
Note
The to_summarized_experiment() and
to_ranged_summarized_experiment() methods
require the BiocPy package summarizedexperiment, which might be
difficult to install on Windows. Install with:
pip install summarizedexperiment
- class recount3.bundle.R3ResourceBundle(resources: list[R3Resource] = <factory>, organism: str | None = None, data_source: str | None = None, project: str | None = None)[source]
Container for a set of
recount3.resource.R3Resourceobjects.Bundles act as the primary orchestration primitive in this package. They keep track of a collection of resources and provide helpers for loading, filtering, project-aware workflows, and high-level operations such as stacking matrices or building BiocPy objects.
A bundle may optionally be associated with a single project identity via the
organism,data_source, andprojectattributes. If multiple projects are combined into one bundle, these attributes are left asNone.- Parameters:
resources (list[R3Resource])
organism (str | None)
data_source (str | None)
project (str | None)
- resources
The list of resources contained in the bundle.
- Type:
- organism
Optional organism identifier (for example,
"human"or"mouse") when the bundle is project-scoped.- Type:
str | None
- data_source
Optional data source name (for example,
"sra","gtex", or"tcga") when the bundle is project-scoped.- Type:
str | None
- project
Optional study or project identifier (for example,
"SRP009615") when the bundle is project-scoped.- Type:
str | None
- classmethod discover(*, organism: str | Iterable[str], data_source: str | Iterable[str], project: str | Iterable[str], genomic_units: tuple[str, ...] = ('gene', 'exon'), annotations: str | Iterable[str] = 'default', junction_exts: tuple[str, ...] = ('MM',), junction_type: str = 'ALL', include_metadata: bool = True, include_bigwig: bool = False, strict: bool = True, deduplicate: bool = True) R3ResourceBundle[source]
Discover resources for one or more projects and return a bundle.
This is a generalized version of
recount3.project.R3Project.discover(). It can operate on a single(organism, data_source, project)triple or on the Cartesian product of multiple values for each identifier.When discovery spans more than one project, the returned bundle will contain resources from all projects and the bundle-level
organism,data_source, andprojectattributes will be left asNoneto avoid misrepresenting the identity.- Parameters:
organism (str | Iterable[str]) – Single organism name or iterable of names.
data_source (str | Iterable[str]) – Single data source or iterable of data sources.
project (str | Iterable[str]) – Single project identifier or iterable of identifiers.
genomic_units (tuple[str, ...]) – Gene expression feature levels to include; for example,
("gene", "exon").annotations (str | Iterable[str]) – Annotation selection; either
"default","all", a single annotation code, or an iterable of codes or labels understood byrecount3.search.annotation_ext().junction_exts (tuple[str, ...]) – Junction file extensions to include; typically
("MM",)for junction counts, with"RR"and/or"ID"added for coordinates or IDs.junction_type (str) – Junction type selector, such as
"ALL".include_metadata (bool) – Whether to include the 5 project metadata tables in the result.
include_bigwig (bool) – Whether to include per-sample BigWig coverage resources.
strict (bool) – If
True, propagate errors for invalid inputs or missing projects. IfFalse, attempts that fail validation are skipped.deduplicate (bool) – If
True, remove duplicated resources across discovered projects.
- Returns:
A new
R3ResourceBundlepopulated with discovered resources. When discovery covers exactly one(organism, data_source, project)triple, the resulting bundle’sorganism,data_source, andprojectattributes are set accordingly.- Raises:
ValueError – If all of
organism,data_source, orprojectevaluate to an empty collection after normalization.recount3.errors.ConfigurationError – If the underlying search logic reports configuration problems.
- Return type:
Examples
Discover all default resources for a single project:
bundle = R3ResourceBundle.discover( organism="human", data_source="sra", project="SRP009615", )
Discover gene counts only across two projects:
bundle = R3ResourceBundle.discover( organism="human", data_source="sra", project=["SRP009615", "SRP012682"], genomic_units=("gene",), )
Include BigWig coverage files alongside counts:
bundle = R3ResourceBundle.discover( organism="human", data_source="sra", project="SRP009615", include_bigwig=True, )
- add(res: R3Resource) None[source]
Add a resource to the bundle.
- Parameters:
res (R3Resource) – The resource to append to
resources.- Return type:
None
- extend(resources_iter: Iterable[R3Resource]) None[source]
Extend the bundle with additional resources.
- Parameters:
resources_iter (Iterable[R3Resource]) – Iterable of resources to add to the bundle.
- Return type:
None
- load(*, strict: bool = True, force: bool = False) R3ResourceBundle[source]
Load all resources and cache their data on each instance.
This method iterates over
resourcesand callsrecount3.resource.R3Resource.load()on each one.- Parameters:
- Returns:
This
R3ResourceBundleinstance, to enable chaining.- Return type:
- iter_loaded(*, resource_type: str | None = None, autoload: bool = False) Iterator[tuple[R3Resource, Any]][source]
Yield
(resource, data)pairs for resources with loaded data.When
autoloadisTrue, resources that have not yet been loaded are passed throughR3Resource.load()before yielding.- Parameters:
- Yields:
Tuples of
(resource, loaded_data)for each resource that matches the optionalresource_typefilter and either already has cached data or can be loaded successfully.- Return type:
- iter_bigwig(*, autoload: bool = True) Iterator[tuple[R3Resource, BigWigFile]][source]
Yield
(resource, bigwig)pairs for BigWig resources.- Parameters:
autoload (bool) – If
True, automatically load BigWig resources that have not yet been loaded.- Yields:
Pairs of
recount3.resource.R3Resourceandrecount3._bigwig.BigWigFileobjects.- Return type:
- get_loaded(*, resource_type: str | None = None, autoload: bool = False) list[Any][source]
Return loaded data objects for resources in the bundle.
- Parameters:
- Returns:
A list of loaded data objects corresponding to resources in the bundle.
- Return type:
- filter(*, resource_type: str | Iterable[str] | Callable[[Any], bool] | None = None, organism: str | Iterable[str] | Callable[[Any], bool] | None = None, data_source: str | Iterable[str] | Callable[[Any], bool] | None = None, genomic_unit: str | Iterable[str] | Callable[[Any], bool] | None = None, project: str | Iterable[str] | Callable[[Any], bool] | None = None, sample: str | Iterable[str] | Callable[[Any], bool] | None = None, table_name: str | Iterable[str] | Callable[[Any], bool] | None = None, junction_type: str | Iterable[str] | Callable[[Any], bool] | None = None, annotation_extension: str | Iterable[str] | Callable[[Any], bool] | None = None, junction_extension: str | Iterable[str] | Callable[[Any], bool] | None = None, predicate: Callable[[R3Resource], bool] | None = None, invert: bool = False) R3ResourceBundle[source]
Return a new bundle containing resources that match criteria.
Each keyword argument corresponds to an attribute on
recount3._descriptions.R3ResourceDescription. Values are interpreted usingrecount3.search.match_spec(), allowing simple values, iterables of values, or callables.- Parameters:
resource_type (str | Iterable[str] | Callable[[Any], bool] | None) – Resource type filter.
organism (str | Iterable[str] | Callable[[Any], bool] | None) – Organism filter.
data_source (str | Iterable[str] | Callable[[Any], bool] | None) – Data source filter.
genomic_unit (str | Iterable[str] | Callable[[Any], bool] | None) – Genomic unit filter.
project (str | Iterable[str] | Callable[[Any], bool] | None) – Project identifier filter.
sample (str | Iterable[str] | Callable[[Any], bool] | None) – Sample identifier filter.
table_name (str | Iterable[str] | Callable[[Any], bool] | None) – Metadata table name filter.
junction_type (str | Iterable[str] | Callable[[Any], bool] | None) – Junction type filter.
annotation_extension (str | Iterable[str] | Callable[[Any], bool] | None) – Annotation code filter.
junction_extension (str | Iterable[str] | Callable[[Any], bool] | None) – Junction extension filter.
predicate (Callable[[R3Resource], bool] | None) – Optional callback that receives each resource and returns
Trueif it should be kept.
- Returns:
A new
R3ResourceBundlecontaining only the resources that match all supplied filters and the optional predicate.- Return type:
Examples
Keep only gene-level resources:
gene_bundle = bundle.filter(genomic_unit="gene")
Keep gene or exon resources (iterable form):
ge_bundle = bundle.filter(genomic_unit=["gene", "exon"])
Keep resources whose type contains “count” (callable form):
counts = bundle.filter( resource_type=lambda t: "count" in (t or "") )
Invert a filter to exclude metadata tables:
no_meta = bundle.filter( resource_type="metadata_files", invert=True )
- only_counts() R3ResourceBundle[source]
Return a bundle restricted to gene/exon or junction count files.
- Returns:
A new
R3ResourceBundlecontaining only resources whoseresource_typeis"count_files_gene_or_exon"or"count_files_junctions".- Return type:
- only_metadata() R3ResourceBundle[source]
Return a bundle restricted to metadata resources.
- Returns:
A new
R3ResourceBundlecontaining only resources whoseresource_typeis"metadata_files".- Return type:
- exclude_metadata() R3ResourceBundle[source]
Return a bundle with metadata resources removed.
- Returns:
A new
R3ResourceBundlethat excludes resources whoseresource_typeis"metadata_files".- Return type:
- where(predicate: Callable[[R3Resource], bool]) R3ResourceBundle[source]
Predicate-based helper that forwards to
filter().- Parameters:
predicate (Callable[[R3Resource], bool]) – Function that receives each resource and returns
Trueif it should be retained in the result.- Returns:
A new
R3ResourceBundlewith only resources for whichpredicatereturnedTrue.- Return type:
- counts() R3ResourceBundle[source]
Return a sub-bundle containing only count-file resources.
This is a convenience alias for
only_counts()maintained for API continuity withrecount3.project.R3Project.- Return type:
- metadata() R3ResourceBundle[source]
Return a sub-bundle containing only metadata resources.
This is a convenience alias for
only_metadata()maintained for compatibility withrecount3.project.R3Project.- Return type:
- bigwigs() R3ResourceBundle[source]
Return a sub-bundle containing only BigWig resources.
- Returns:
A new
R3ResourceBundlecontaining only resources whose type is"bigwig_files".- Return type:
- samples(*, organism: str | None = None, data_source: str | None = None, project: str | None = None) list[str][source]
Return the list of sample identifiers associated with a project.
The default behavior uses the bundle’s stored project identity, as recorded by
discover(). Explicit keyword arguments can be provided to override or define the identity when the bundle was not created bydiscover().- Parameters:
- Returns:
A sorted list of sample identifiers for the resolved project.
- Raises:
ValueError – If the project cannot be resolved or validated.
- Return type:
- stack_count_matrices(*, join_policy: str = 'inner', axis: int = 1, verify_integrity: bool = False, autoload: bool = True, compat: Literal['family', 'feature'] = 'family') DataFrame[source]
Concatenate count matrices (gene/exon or junction) as DataFrames.
- Parameters:
join_policy (str) – Join policy passed to
pandas.concat().axis (int) – Concatenation axis passed to
pandas.concat().verify_integrity (bool) – If
True, raise when labels are not unique along the concatenation axis.autoload (bool) – If
True, automatically load resources prior to concatenation.compat (Literal['family', 'feature']) – Compatibility mode.
"family"enforces that all inputs come from the same high-level family (gene/exon or junction), while"feature"enforces an identical feature space (for example, same genomic unit and junction subtype).
- Returns:
A
pandas.DataFramecontaining the concatenated count matrices.- Raises:
recount3.errors.CompatibilityError – If incompatible count resources are mixed in a way that violates
compat.TypeError – If a loaded object is not a
pandas.DataFrame.ValueError – If no applicable resources are present or if no loaded count matrices are found.
- Return type:
DataFrame
Examples
Stack gene counts across all projects in the bundle:
df = bundle.filter( resource_type="count_files_gene_or_exon", genomic_unit="gene", ).stack_count_matrices()
Require identical feature sets (faster; fails if annotations differ):
df = bundle.filter( resource_type="count_files_gene_or_exon" ).stack_count_matrices(compat="feature")
- to_summarized_experiment(*, genomic_unit: str, annotation_extension: str | None = None, assay_name: str = 'raw_counts', join_policy: str = 'inner', autoload: bool = True) summarizedexperiment.SummarizedExperiment[source]
Build a BiocPy
SummarizedExperimentfrom this bundle.This method stacks compatible count matrices, merges available sample metadata, and constructs a BiocPy
SummarizedExperimentusing a compatibility-aware constructor that supports multiple versions of thesummarizedexperimentpackage.- Parameters:
genomic_unit (str) – Genomic unit to summarize, such as
"gene","exon", or"junction".annotation_extension (Optional[str]) – Optional annotation code for gene or exon summarizations (for example,
"G026"). When provided andgenomic_unitis gene or exon, only count resources with matching annotation are used.assay_name (str) – Name assigned to the coverage-sum assay within the
SummarizedExperiment. (default:"raw_counts").join_policy (str) – Join policy used when concatenating counts across resources.
- Returns:
A BiocPy
SummarizedExperimentinstance.- Raises:
ImportError – If BiocPy packages are not installed.
ValueError – If no counts are found or shapes are inconsistent.
TypeError – If the underlying
SummarizedExperimentconstructor rejects all compatibility variants.
- Return type:
summarizedexperiment.SummarizedExperiment
- to_ranged_summarized_experiment(*, genomic_unit: str, annotation_extension: str | None = None, prefer_rr_junction_coordinates: bool = True, assay_name: str = 'raw_counts', join_policy: str = 'inner', autoload: bool = True, allow_fallback_to_se: bool = False) summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment[source]
Build a BiocPy
RangedSummarizedExperimentwhen possible.For
"gene"and"exon"genomic units, row ranges are derived from a matching GTF(.gz) annotation resource. For"junction", this method prefers an RR table (junction coordinates) when available.When row ranges cannot be resolved and
allow_fallback_to_seisTrue, a plainSummarizedExperimentis returned instead.- Parameters:
genomic_unit (str) – One of
"gene","exon", or"junction".annotation_extension (Optional[str]) – Annotation code for gene/exon assays, if desired.
prefer_rr_junction_coordinates (bool) – If
True, prefer RR junction files for coordinate definitions when they are available.assay_name (str) – Name assigned to the coverage-sum assay within the
SummarizedExperiment(default:"raw_counts").join_policy (str) – Join policy across projects when stacking.
allow_fallback_to_se (bool) – If
True, construct a plainSummarizedExperimentwhen genomic ranges cannot be derived for the requested combination.
- Returns:
A
RangedSummarizedExperimentinstance, or a plainSummarizedExperimentwhenallow_fallback_to_seisTrueand ranges are unavailable.- Raises:
ImportError – If BiocPy packages are not installed.
ValueError – If counts are missing or ranges cannot be determined and
allow_fallback_to_seisFalse.TypeError – If the
RangedSummarizedExperimentconstructor rejects all compatibility variants.
- Return type:
summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment
- download(*, dest: str = '.', overwrite: bool = False, cache: Literal['enable', 'disable', 'update'] = 'enable') None[source]
Download all resources in the bundle to a local destination.
This method is a convenience wrapper around
recount3.resource.R3Resource.download()for each contained resource. For more advanced workflows (event logs, streaming to a ZIP archive, and so on) prefer the command-line interface,recount3 download.- Parameters:
dest (str) – Destination directory or
.zippath. When a directory is provided, each resource is materialized as a separate file under that directory. When a path ending in.zipis provided, resources are written into that archive.overwrite (bool) – If
True, allow overwriting existing files in directory mode.cache (Literal['enable', 'disable', 'update']) – Cache behavior:
"enable","disable", or"update"as defined byrecount3.types.CacheMode.
- Raises:
ValueError – Propagated from underlying resource download failures, for example when an unsupported cache mode is selected.
- Return type:
None
recount3.search
Discovery and search helpers for the recount3 data repository.
This module provides two tiers of functions:
- Tier 1: Type-specific search wrappers
These functions accept field values and return a list of
R3Resourceobjects:search_annotations(): annotation GTF filessearch_count_files_gene_or_exon(): gene/exon count matricessearch_count_files_junctions(): junction count files (MM/ID/RR)search_metadata_files(): per-project metadata tablessearch_bigwig_files(): per-sample BigWig coverage filessearch_data_sources(): organism-level data-source indexsearch_data_source_metadata(): data-source metadata listingssearch_project_all(): convenience orchestrator that calls all of the above for a given project and returns a combined resource list
- Tier 2: Discovery helpers
These functions download and parse metadata to return structured results:
available_samples(): DataFrame of samples available across data sourcesavailable_projects(): DataFrame of projects with sample countsproject_homes(): DataFrame mapping projects to their data-source home URLssamples_for_project(): sample IDs for a specific projectannotation_options(): mapping of annotation names to extension codesannotation_ext(): resolve a name or code to a canonical extension string
- StringOrIterable and the Cartesian product pattern
All Tier 1 functions accept
StringOrIterablefor each parameter: either a single string or an iterable of strings. When iterables are supplied, the function computes the Cartesian product across all parameters and returns one resource per unique combination. For example, passingproject=["SRP009615", "SRP012682"]andgenomic_unit=["gene", "exon"]produces four resources.- Annotation names and extension codes
Gene/exon count resources are keyed by an annotation extension code (e.g.
"G026"). Human-readable annotation labels (e.g."gencode_v26") are resolved to their codes viaannotation_ext(). Useannotation_options()to list all available mappings for an organism.
Typical usage example:
from recount3 import search_count_files_gene_or_exon, search_project_all
# Single project, gene-level counts:
resources = search_count_files_gene_or_exon(
organism="human",
data_source="sra",
genomic_unit="gene",
project="SRP009615",
)
# All resource types for a project in one call:
all_res = search_project_all(
organism="human",
data_source="sra",
project="SRP009615",
)
- recount3.search.match_spec(value: object | None, spec: str | Iterable[str] | Callable[[Any], bool] | None) bool[source]
Return True if
valuesatisfies the selectionspec.- Parameters:
value (object | None) – The candidate value to test.
spec (str | Iterable[str] | Callable[[Any], bool] | None) –
The selection specification. Accepted forms:
None: matches everything.A callable: called with
value; truthy return means match.A string or iterable of strings: matches if
valueis among the normalised tuple of strings.
- Returns:
Trueifvaluematchesspec,Falseotherwise.- Return type:
- recount3.search.search_annotations(*, organism: str | Iterable[str], genomic_unit: str | Iterable[str], annotation_extension: str | Iterable[str], strict: bool = True, deduplicate: bool = True) list[R3Resource][source]
Return R3Resource objects for annotation GTF files.
Constructs the Cartesian product of all provided values and returns one
R3Resourceper unique combination.- Parameters:
organism (str | Iterable[str]) – One or more organism names (e.g.
"human","mouse").genomic_unit (str | Iterable[str]) – One or more genomic units (e.g.
"gene","exon").annotation_extension (str | Iterable[str]) – One or more annotation extension strings (e.g.
"G026").strict (bool) – If
True(default), re-raise any exception encountered while constructing a resource. IfFalse, silently skip invalid combinations.deduplicate (bool) – If
True(default), discard resources whose resolved URL duplicates an earlier entry in the result list.
- Returns:
A list of
R3Resourceobjects, one per unique (organism, genomic_unit, annotation_extension) combination.- Return type:
- recount3.search.search_count_files_gene_or_exon(*, organism: str | Iterable[str], data_source: str | Iterable[str], genomic_unit: str | Iterable[str], project: str | Iterable[str], annotation_extension: str | Iterable[str] = ('G026',), strict: bool = True, deduplicate: bool = True) list[R3Resource][source]
Return R3Resource objects for per-project gene or exon count matrices.
Constructs the Cartesian product of all provided values and returns one
R3Resourceper unique combination.- Parameters:
organism (str | Iterable[str]) – One or more organism names (e.g.
"human","mouse").data_source (str | Iterable[str]) – One or more data-source identifiers (e.g.
"sra","gtex").genomic_unit (str | Iterable[str]) – One or more genomic units (
"gene"or"exon").project (str | Iterable[str]) – One or more project identifiers (e.g.
"SRP009615").annotation_extension (str | Iterable[str]) – One or more annotation extension strings. Defaults to
("G026",).strict (bool) – If
True(default), re-raise any exception encountered while constructing a resource. IfFalse, silently skip invalid combinations.deduplicate (bool) – If
True(default), discard resources whose resolved URL duplicates an earlier entry in the result list.
- Returns:
A list of
R3Resourceobjects, one per unique (organism, data_source, genomic_unit, project, annotation_extension) combination.- Return type:
Examples
Single project, gene-level counts:
resources = search_count_files_gene_or_exon( organism="human", data_source="sra", genomic_unit="gene", project="SRP009615", )
Multiple projects: returns one resource per project (Cartesian product):
resources = search_count_files_gene_or_exon( organism="human", data_source="sra", genomic_unit="gene", project=["SRP009615", "SRP012682"], )
- recount3.search.search_count_files_junctions(*, organism: str | Iterable[str], data_source: str | Iterable[str], project: str | Iterable[str], junction_type: str | Iterable[str] = 'ALL', junction_extension: str | Iterable[str] = 'MM', strict: bool = True, deduplicate: bool = True) list[R3Resource][source]
Return R3Resource objects for per-project junction count files.
Constructs the Cartesian product of all provided values and returns one
R3Resourceper unique combination. Junction files are distributed as a triplet of sidecar files sharing a common stem: a MatrixMarket matrix (.MM.gz), a sample-ID table (.ID.gz), and a row-ranges table (.RR.gz). Thejunction_extensionselects which of these three files is the primary resource.- Parameters:
organism (str | Iterable[str]) – One or more organism names (e.g.
"human","mouse").data_source (str | Iterable[str]) – One or more data-source identifiers (e.g.
"sra","gtex").project (str | Iterable[str]) – One or more project identifiers (e.g.
"SRP009615").junction_type (str | Iterable[str]) – One or more junction-type tokens. Defaults to
"ALL".junction_extension (str | Iterable[str]) – One or more file-extension tokens selecting the junction sidecar file to retrieve. Accepted values are
"MM"(count matrix),"ID"(sample IDs), and"RR"(row ranges). Defaults to"MM".strict (bool) – If
True(default), re-raise any exception encountered while constructing a resource. IfFalse, silently skip invalid combinations.deduplicate (bool) – If
True(default), discard resources whose resolved URL duplicates an earlier entry in the result list.
- Returns:
A list of
R3Resourceobjects, one per unique (organism, data_source, project, junction_type, junction_extension) combination.- Return type:
- recount3.search.search_metadata_files(*, organism: str | Iterable[str], data_source: str | Iterable[str], table_name: str | Iterable[str], project: str | Iterable[str], strict: bool = True, deduplicate: bool = True) list[R3Resource][source]
Return R3Resource objects for per-project metadata tables.
Constructs the Cartesian product of all provided values and returns one
R3Resourceper unique combination.- Parameters:
organism (str | Iterable[str]) – One or more organism names (e.g.
"human","mouse").data_source (str | Iterable[str]) – One or more data-source identifiers (e.g.
"sra","gtex").table_name (str | Iterable[str]) – One or more metadata table name suffixes (e.g.
"recount_project","recount_qc").project (str | Iterable[str]) – One or more project identifiers (e.g.
"SRP009615").strict (bool) – If
True(default), re-raise any exception encountered while constructing a resource. IfFalse, silently skip invalid combinations.deduplicate (bool) – If
True(default), discard resources whose resolved URL duplicates an earlier entry in the result list.
- Returns:
A list of
R3Resourceobjects, one per unique (organism, data_source, table_name, project) combination.- Return type:
- recount3.search.search_bigwig_files(*, organism: str | Iterable[str], data_source: str | Iterable[str], project: str | Iterable[str], sample: str | Iterable[str], strict: bool = True, deduplicate: bool = True) list[R3Resource][source]
Return R3Resource objects for per-sample BigWig coverage files.
Constructs the Cartesian product of all provided values and returns one
R3Resourceper unique combination.- Parameters:
organism (str | Iterable[str]) – One or more organism names (e.g.
"human","mouse").data_source (str | Iterable[str]) – One or more data-source identifiers (e.g.
"sra","gtex").project (str | Iterable[str]) – One or more project identifiers (e.g.
"SRP009615").sample (str | Iterable[str]) – One or more sample identifiers (e.g. a rail_id or SRR accession).
strict (bool) – If
True(default), re-raise any exception encountered while constructing a resource. IfFalse, silently skip invalid combinations.deduplicate (bool) – If
True(default), discard resources whose resolved URL duplicates an earlier entry in the result list.
- Returns:
A list of
R3Resourceobjects, one per unique (organism, data_source, project, sample) combination.- Return type:
- recount3.search.search_data_sources(*, organism: str | Iterable[str], strict: bool = True, deduplicate: bool = True) list[R3Resource][source]
Return R3Resource objects for the organism-level data-source index.
Each resource resolves to the
homes_indexfile for one organism, which lists the available data sources for that organism.- Parameters:
organism (str | Iterable[str]) – One or more organism names (e.g.
"human","mouse").strict (bool) – If
True(default), re-raise any exception encountered while constructing a resource. IfFalse, silently skip invalid combinations.deduplicate (bool) – If
True(default), discard resources whose resolved URL duplicates an earlier entry in the result list.
- Returns:
A list of
R3Resourceobjects, one per unique organism.- Return type:
- recount3.search.search_data_source_metadata(*, organism: str | Iterable[str], data_source: str | Iterable[str], strict: bool = True, deduplicate: bool = True) list[R3Resource][source]
Return R3Resource objects for data-source-level metadata listings.
Each resource resolves to the
recount_projectmetadata file for one (organism, data_source) pair, which enumerates all projects within that data source.- Parameters:
organism (str | Iterable[str]) – One or more organism names (e.g.
"human","mouse").data_source (str | Iterable[str]) – One or more data-source identifiers (e.g.
"sra","gtex").strict (bool) – If
True(default), re-raise any exception encountered while constructing a resource. IfFalse, silently skip invalid combinations.deduplicate (bool) – If
True(default), discard resources whose resolved URL duplicates an earlier entry in the result list.
- Returns:
A list of
R3Resourceobjects, one per unique (organism, data_source) combination.- Return type:
- recount3.search.create_sample_project_lists(organism: str = '') tuple[list[str], list[str]][source]
Return (samples, projects) discovered from metadata tables.
This is a compatibility wrapper around
available_samples()andavailable_projects(). It preserves the originalidsCLI behavior (simple ID lists) but now benefits from the richer and more robust metadata parsing.
- recount3.search.available_samples(*, organism: str = 'human', data_sources: str | Iterable[str] | None = None, strict: bool = True) DataFrame[source]
Return a sample overview similar to recount3::available_samples().
This reads per-data-source
*.recount_project.MD.gztables and returns a normalized DataFrame describing all samples for the requested organism.- Parameters:
organism (str) – Organism to query (“human” or “mouse”).
data_sources (str | Iterable[str] | None) – Optional subset of data sources to include (for example, “sra”, “gtex”, “tcga”). By default all known data sources are used.
strict (bool) – If True, raise a ValueError when no metadata can be found. If False, return an empty DataFrame instead.
- Returns:
external_id: Sample identifier in the original source.project: Project or study identifier.organism: Canonical organism label (“human” or “mouse”).file_source: Origin of the raw data (basename only).date_processed: Processing date in YYYY-MM-DD format.project_home: recount3 project home path.project_type: High-level project type (for example, “data_sources” or “collections”).
Additional columns present in the raw metadata are preserved.
- Return type:
A DataFrame with at least the following columns when available
- Raises:
ValueError – If inputs are invalid or no metadata resources are found and
strictis True.RuntimeError – If metadata resources are found but all fail to load.
- recount3.search.available_projects(*, organism: str = 'human', data_sources: str | Iterable[str] | None = None, strict: bool = True) DataFrame[source]
Return a project overview like recount3::available_projects().
This aggregates the sample-level metadata from
available_samples()and summarizes it at the project level.- Parameters:
- Returns:
project: Project or study identifier.organism: Canonical organism label.file_source: Origin of the raw data (basename only).project_home: recount3 project home path.project_type: High-level project type (for example, “data_sources” or “collections”).n_samples: Number of samples in the project.
Additional project-level columns are preserved.
- Return type:
A DataFrame with one row per project and at least
- recount3.search.project_homes(*, organism: str = 'human', data_sources: str | Iterable[str] | None = None, strict: bool = True) DataFrame[source]
Return a project home summary similar to recount3::project_homes().
This is a thin layer on top of
available_projects()that collapses projects down to uniqueproject_homepaths.- Parameters:
- Returns:
A DataFrame with one row per project home and at least the following columns:
project_home: recount3 project home path.project_type: High-level project type (for example, “data_sources” or “collections”).organism: Canonical organism label.file_source: Data source label when available.n_projects: Number of projects using this home.
Additional columns from
available_projects()may appear.- Return type:
DataFrame
- recount3.search.annotation_options(organism: str) dict[str, str][source]
Return annotation options for a given organism.
- Parameters:
organism (str) – Organism name (“human” or “mouse”), case-insensitive.
- Returns:
A new dict mapping canonical annotation names (for example, “gencode_v26”) to recount3 annotation file extensions (for example, “G026”).
- Raises:
ValueError – If the organism is not recognized.
- Return type:
- recount3.search.annotation_ext(organism: str, annotation: str) str[source]
Return the recount3 annotation extension for a given annotation.
This helper is analogous to the R recount3::annotation_ext() function. It accepts either a canonical annotation name (for example, “gencode_v26”) or a raw extension code (for example, “G026”).
- Parameters:
- Returns:
The recount3 annotation file extension (for example, “G026”).
- Raises:
ValueError – If the organism or annotation is not recognized.
- Return type:
- recount3.search.samples_for_project(*, organism: str, data_source: str, project: str) list[str][source]
Return sample identifiers for a given project.
This helper reads the data-source-level metadata table and extracts sample IDs for the requested project by consulting common column names: “sample”, “sample_id”, “run”, “run_accession”, and “external_id”.
The logic mirrors the R docs recommendation to consult project-level metadata first, while mining samples from data-source metadata when assembling coverage files per sample.
- Parameters:
- Returns:
Sorted list of unique sample identifiers.
- Raises:
ValueError – If the project cannot be validated against the metadata.
- Return type:
- recount3.search.search_project_all(*, organism: str, data_source: str, project: str, genomic_units: Iterable[str] = ('gene', 'exon'), annotations: str | Iterable[str] = 'default', junction_type: str = 'ALL', junction_extension: Iterable[str] = ('MM',), include_metadata: bool = True, include_bigwig: bool = False, strict: bool = True, deduplicate: bool = True) list[R3Resource][source]
Enumerate all files for a project (counts, junctions, metadata, bw).
This function composes the existing search helpers to implement a one-shot, project-scoped discovery routine. It adheres to recount3’s raw file layout documented at: https://rna.recount.bio/docs/raw-files.html (Sections 6.2-6.4).
- Parameters:
organism (str) – “human” or “mouse”.
data_source (str) – “sra”, “gtex”, or “tcga”.
project (str) – Study identifier (for example, “SRP009615”).
genomic_units (Iterable[str]) – Which expression levels to include. Defaults to both.
annotations (str | Iterable[str]) – “default”, “all”, comma-separated string, or an iterable of annotation file extensions (for example, (“G026”, “G029”)).
junction_type (str) – Junction type; typically “ALL”.
junction_extension (Iterable[str]) – Iterable of junction artifacts to include: “MM” (counts), “RR” (coordinates), “ID” (sample IDs).
include_metadata (bool) – Whether to include the five metadata tables.
include_bigwig (bool) – Whether to include per-sample BigWig coverage files.
strict (bool) – If True, raise on invalid parameters; else skip broken items.
deduplicate (bool) – If True, drop duplicates across resource families.
- Returns:
A list of
R3Resourceobjects covering the requested bundle.- Raises:
ValueError – If validation fails (for example, missing project).
- Return type:
recount3.se
High-level builders and utilities for SummarizedExperiment objects.
This module provides helpers for constructing BiocPy
SummarizedExperiment and
RangedSummarizedExperiment objects from
recount3 projects, plus utilities for working with SRA-style sample
attributes and per-sample scaling.
The heavy lifting is implemented as methods on
R3ResourceBundle:
The functions in this module are thin wrappers around those methods (for
example, create_rse()) and convenience utilities that operate on
BiocPy objects directly:
expand_sra_attributes(): parse SRAkey;;value|...attribute strings into separate columns on a metadata DataFrame or SE/RSE object.compute_read_counts(): convert coverage-sum counts to approximate read counts using average mapped read length.compute_tpm(): compute Transcripts Per Million from an RSE with genomic ranges (uses feature widths fromrow_ranges).compute_scale_factors(): compute per-sample AUC- or mapped-reads-based scale factors.transform_counts(): apply scale factors to a count matrix.
Typical usage example:
from recount3 import create_rse
from recount3.se import compute_scale_factors, transform_counts
rse = create_rse(
project="SRP009615",
organism="human",
annotation_label="gencode_v26",
)
sf = compute_scale_factors(rse)
scaled = transform_counts(rse, scale_factors=sf)
Note
Most functions in this module require BiocPy packages
(biocframe, summarizedexperiment, genomicranges).
Install them with:
pip install summarizedexperiment
- recount3.se.build_summarized_experiment(bundle: R3ResourceBundle, *, genomic_unit: str, annotation_extension: str | None = None, assay_name: str = 'raw_counts', join_policy: str = 'inner', autoload: bool = True) summarizedexperiment.SummarizedExperiment[source]
Create a SummarizedExperiment from a resource bundle.
This is a convenience wrapper around
recount3.bundle.R3ResourceBundle.to_summarized_experiment().- Parameters:
bundle (R3ResourceBundle) – Resource bundle containing counts and metadata.
genomic_unit (str) – One of
"gene","exon", or"junction".annotation_extension (str | None) – Optional annotation code for gene/exon assays (for example,
"G026").assay_name (str) – Name for the count assay in the SE.
join_policy (str) – Join policy across projects (pandas concatenation join).
- Returns:
A
summarizedexperiment.SummarizedExperimentinstance.- Return type:
summarizedexperiment.SummarizedExperiment
- recount3.se.build_ranged_summarized_experiment(bundle: R3ResourceBundle, *, genomic_unit: str, annotation_extension: str | None = None, prefer_rr_junction_coordinates: bool = True, assay_name: str = 'raw_counts', join_policy: str = 'inner', autoload: bool = True, allow_fallback_to_se: bool = False) summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment[source]
Create a RangedSummarizedExperiment when ranges can be resolved.
This is a convenience wrapper around
recount3.bundle.R3ResourceBundle.to_ranged_summarized_experiment().- Parameters:
bundle (R3ResourceBundle) – Resources for counts, metadata, and (optionally) annotations.
genomic_unit (str) – One of
"gene","exon", or"junction".annotation_extension (str | None) – Annotation code for gene/exon, if desired.
prefer_rr_junction_coordinates (bool) – Prefer RR for junction coordinates when present.
assay_name (str) – Name for the count assay in the output.
join_policy (str) – Join policy across projects when stacking.
allow_fallback_to_se (bool) – If
True, return a plain SE when ranges are unavailable.
- Returns:
A
summarizedexperiment.RangedSummarizedExperimentobject, or a plainsummarizedexperiment.SummarizedExperimentwhenallow_fallback_to_seisTrueand ranges cannot be resolved.- Return type:
summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment
- recount3.se.create_ranged_summarized_experiment(*, project: str, genomic_unit: str = 'gene', organism: str = 'human', data_source: str = 'sra', annotation_label: str | None = None, annotation_extension: str | None = None, junction_type: str = 'ALL', junction_extensions: Sequence[str] | None = None, include_metadata: bool = True, include_bigwig: bool = False, assay_name: str = 'raw_counts', join_policy: str = 'inner', autoload: bool = True, allow_fallback_to_se: bool = False, strict: bool = True) summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment[source]
High-level helper that mirrors recount3’s
create_rse()in R.This function hides the intermediate bundle construction step by:
Discovering all resources for a project via
recount3.bundle.R3ResourceBundle.discover().Stacking expression matrices across projects and samples.
Resolving genomic ranges (GTF for gene/exon; RR for junctions).
Building a BiocPy
RangedSummarizedExperiment(or SE fallback) using the bundle methods.
- Parameters:
project (str) – Study or project identifier (for example,
"SRP009615").genomic_unit (str) – One of
{"gene", "exon", "junction"}.organism (str) – Organism identifier (
"human"or"mouse").data_source (str) – Data source (
"sra","gtex", or"tcga").annotation_label (str | None) – Optional human-readable annotation label for gene/exon assays, such as
"gencode_v26"or"gencode_v29". Ignored for junction-level assays.annotation_extension (str | None) – Optional explicit annotation extension (for example,
"G026"). When provided, this takes precedence overannotation_label. Ignored for junction-level assays.junction_type (str) – Junction type; typically
"ALL".junction_extensions (Sequence[str] | None) – Iterable of junction artifact extensions to include (for example,
("MM",)or("MM", "RR")). IfNone, the default("MM",)is used.include_metadata (bool) – Whether to include the five project metadata tables in the underlying bundle (recommended).
include_bigwig (bool) – Whether to include per-sample BigWig coverage resources in the bundle. These can be large.
join_policy (str) – Join policy across projects when stacking count matrices (passed to
build_ranged_summarized_experiment()).allow_fallback_to_se (bool) – If
True, construct a plain SE when genomic ranges cannot be derived for the requested combination.strict (bool) – If
True, propagate validation errors from the search layer (for example, missing projects or incompatible combinations).assay_name (str)
- Returns:
A
summarizedexperiment.RangedSummarizedExperimentobject, or a plainsummarizedexperiment.SummarizedExperimentwhenallow_fallback_to_seisTrueand ranges cannot be resolved.- Raises:
ImportError – If BiocPy packages are not installed.
ValueError – If inputs are invalid or no counts are found.
TypeError – If the underlying
RangedSummarizedExperimentconstructor rejects all compatibility variants.
- Return type:
summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment
Examples
Build an RSE for a human SRA gene-count project (GENCODE 26):
rse = create_ranged_summarized_experiment( project="SRP009615", organism="human", annotation_label="gencode_v26", )
Use the raw annotation extension instead of a label:
rse = create_ranged_summarized_experiment( project="SRP009615", organism="human", annotation_extension="G026", )
Build a junction-level RSE:
rse = create_ranged_summarized_experiment( project="SRP009615", organism="human", genomic_unit="junction", )
- recount3.se.create_rse(*, project: str, genomic_unit: str = 'gene', organism: str = 'human', data_source: str = 'sra', annotation_label: str | None = None, annotation_extension: str | None = None, junction_type: str = 'ALL', junction_extensions: Sequence[str] | None = None, include_metadata: bool = True, include_bigwig: bool = False, assay_name: str = 'raw_counts', join_policy: str = 'inner', autoload: bool = True, allow_fallback_to_se: bool = False, strict: bool = True) summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment[source]
Alias for
create_ranged_summarized_experiment().The parameters and behavior are identical; see that function for full documentation.
Examples
Build an RSE for a human SRA gene-count project:
rse = create_rse( project="SRP009615", organism="human", annotation_label="gencode_v26", )
Use the raw extension string instead of a label:
rse = create_rse( project="SRP009615", organism="human", annotation_extension="G026", )
- Parameters:
project (str)
genomic_unit (str)
organism (str)
data_source (str)
annotation_label (str | None)
annotation_extension (str | None)
junction_type (str)
junction_extensions (Sequence[str] | None)
include_metadata (bool)
include_bigwig (bool)
assay_name (str)
join_policy (str)
autoload (bool)
allow_fallback_to_se (bool)
strict (bool)
- Return type:
summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment
- recount3.se.expand_sra_attributes(experiment_or_coldata: summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment | pd.DataFrame, *, sra_attributes_column: str = 'sra.sample_attributes', attribute_column_prefix: str = 'sra_attribute.') summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment | pd.DataFrame[source]
Expand encoded SRA sample attributes into separate columns.
This function mirrors the recount3 R helper
expand_sra_attributes().It understands the SRA encoding used by recount3, where a single column (typically
'sra.sample_attributes') contains strings of the form:"age;;67.78|biomaterial_provider;;LIBD|disease;;Control|..."Each
key;;valuepair becomes a new column named{attribute_column_prefix}{key}(with spaces inkeyreplaced by'_'), and the parsed values are stored per sample. The original string column is preserved.The function supports two calling styles:
Passing a
pandas.DataFrameof column metadata, in which case a new DataFrame is returned with extra columns.Passing a BiocPy
summarizedexperiment.SummarizedExperimentorsummarizedexperiment.RangedSummarizedExperiment, in which case a new object of the same class is returned with updatedcolumn_data.
- Parameters:
experiment_or_coldata (summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment | pd.DataFrame) – Column metadata DataFrame or a BiocPy SE/RSE object.
sra_attributes_column (str) – Name of the column that holds the raw SRA attribute strings.
attribute_column_prefix (str) – Prefix to prepend to the generated attribute column names.
- Returns:
A new object of the same type as
experiment_or_coldatawith additional columns corresponding to parsed SRA attributes. If the requested column is missing,experiment_or_coldatais returned unchanged.- Raises:
ImportError – If a SummarizedExperiment/RangedSummarizedExperiment is supplied but BiocPy packages are not installed.
AttributeError – If the BiocPy object does not expose
column_dataorset_column_datain the expected API.TypeError – If
experiment_or_coldatais neither a pandas DataFrame nor a supported BiocPy experiment object.
- Return type:
summarizedexperiment.RangedSummarizedExperiment | summarizedexperiment.SummarizedExperiment | pd.DataFrame
Examples
Expand attributes on a metadata DataFrame:
expanded_df = expand_sra_attributes(col_data_df)
Expand attributes directly on an RSE (returns a new RSE):
rse2 = expand_sra_attributes(rse)
Inspect the new attribute columns:
attr_cols = [c for c in rse2.column_data.column_names if c.startswith("sra_attribute.")]
- recount3.se.compute_read_counts(rse: Any, round_to_integers: bool = True, avg_mapped_read_length_column: str = 'recount_qc.star.average_mapped_length') DataFrame[source]
Convert coverage-sum counts into approximate read/fragment counts.
The “raw_counts” assay used by recount3-style resources represents summed per-base coverage over each feature (for example, total coverage across all bases in a gene), not the number of reads/fragments overlapping the feature. A common approximation for converting coverage-sum values into read/fragment counts is to divide by the sample’s average mapped read length.
For each feature i and sample j:
read_counts[i, j] = raw_counts[i, j] / avg_mapped_read_length_column[j]
This operation is performed column-wise: each sample column in the assay is divided by a single scalar derived from that sample’s metadata.
Rounding is optional. When enabled, values are rounded to 0 decimals to produce integer-like counts, which is convenient for downstream methods that assume counts are integer-valued.
- Parameters:
rse (Any) – A RangedSummarizedExperiment-like object containing a “raw_counts” assay and sample metadata in col_data.
round_to_integers (bool) – If True, round the resulting values to 0 decimals.
avg_mapped_read_length_column (str) – Name of the metadata column containing average mapped read length per sample. The column must be present in col_data and must contain numeric values.
- Returns:
A pandas DataFrame of approximate read counts with the same shape as the “raw_counts” assay (features x samples). Row and column names are preserved when available.
- Raises:
ValueError – If rse is not a RangedSummarizedExperiment, if the “raw_counts” assay is missing, if avg_mapped_read_length_column is missing from col_data, or if the assay and metadata dimensions do not align.
TypeError – If round_to_integers is not a bool.
- Return type:
DataFrame
- recount3.se.compute_tpm(rse: summarizedexperiment.RangedSummarizedExperiment) pd.DataFrame[source]
Compute Transcripts Per Million (TPM) from raw coverage sums.
- TPM is calculated as:
Approximate Read Counts = Coverage AUC / Avg Read Length
RPK = Read Counts / (Feature Length / 1000)
Scale Factor = Sum(RPK) / 1,000,000
TPM = RPK / Scale Factor
- Parameters:
rse (summarizedexperiment.RangedSummarizedExperiment) – A RangedSummarizedExperiment object containing raw coverage sums. Must have feature widths defined in rowRanges.
- Returns:
A pandas DataFrame of TPM values.
- Raises:
TypeError – If rse is not a RangedSummarizedExperiment (needs rowRanges).
ValueError – If feature widths or read lengths are missing.
- Return type:
pd.DataFrame
Examples
Compute TPM from an RSE built with
create_rse():rse = create_rse(project="SRP009615", organism="human", annotation_label="gencode_v26") tpm_df = compute_tpm(rse)
- recount3.se.is_paired_end(sample_metadata_source: Any, avg_mapped_read_length_column: str = 'recount_qc.star.average_mapped_length', avg_read_length_column: str = 'recount_seq_qc.avg_len') Series[source]
Infer paired-end status, matching recount3::is_paired_end().
- In recount3 (R), paired-end status is inferred via:
ratio <- round(avg_mapped_read_length / avg_read_length, 0) ratio must be 1 (single-end) or 2 (paired-end), otherwise NA with warning. result <- ratio == 2, with names(result) = external_id.
- Parameters:
- Returns:
A pandas Series of dtype “boolean” (True/False/pd.NA), indexed by external_id.
- Raises:
ValueError – If required metadata columns are missing or non-numeric.
- Return type:
Series
- recount3.se.compute_scale_factors(sample_metadata_source: Any, by: str = 'auc', target_read_count: float = 40000000.0, target_read_length_bp: float = 100, auc_column: str = 'recount_qc.bc_auc.all_reads_all_bases', avg_mapped_read_length_column: str = 'recount_qc.star.average_mapped_length', mapped_reads_column: str = 'recount_qc.star.all_mapped_reads', paired_end_status: Sequence[bool] | Series | None = None) Series[source]
Compute per-sample scaling factors for coverage-sum counts.
This function produces one scalar scale factor per sample. The intended use is to multiply each sample column of a coverage-sum count matrix by the corresponding factor to make samples comparable.
Let C[i, j] be the coverage-sum count for feature i in sample j. If s[j] is the scale factor for sample j, scaled counts are computed as:
scaled[i, j] = C[i, j] * s[j]
Scale factors are derived from sample metadata. Samples are identified by the external_id column, and the returned Series is indexed by external_id.
Two scaling methods are supported:
by=”auc” Uses a per-sample total coverage metric (AUC) to scale each sample to a common target_read_count:
s[j] = target_read_count / auc[j]
This method preserves relative feature coverage within each sample while adjusting overall sample magnitude to be comparable across samples.
by=”mapped_reads” Uses mapped read counts and read length to normalize samples to a common target_read_count and a common target read length target_read_length_bp:
- s[j] = target_read_count * target_read_length_bp * paired_multiplier[j] /
(mapped_reads[j] * (avg_mapped_read_length[j] ** 2))
- paired_multiplier is:
2 for paired-end samples
1 for single-end samples
missing for samples whose paired-end status cannot be inferred
If paired_end_status is not provided, paired-end status is inferred from metadata by comparing average mapped length to average read length:
ratio = round(avg_mapped_read_length / avg_read_length)
ratio==2 indicates paired-end, ratio==1 indicates single-end. Other ratios are treated as unknown and produce missing paired multipliers.
Missing values in required metadata propagate to missing scale factors. Non-numeric metadata values raise an error.
- Parameters:
sample_metadata_source (Any) – Sample metadata as a pandas DataFrame, or an object with col_data.to_pandas() that yields sample metadata.
by (str) – Scaling method: “auc” or “mapped_reads”.
target_read_count (float) – Target library size used to compute scale factors. Interpreted as the number of single-end reads to scale each sample to.
target_read_length_bp (float) – Target read length used only when by=”mapped_reads”.
auc_column (str) – Metadata column name for the per-sample AUC metric.
avg_mapped_read_length_column (str) – Metadata column name for average mapped read length per sample.
mapped_reads_column (str) – Metadata column name for mapped read counts per sample.
paired_end_status (Sequence[bool] | Series | None) – Optional paired-end indicator per sample. If provided, it must align with the samples in external_id. If omitted, paired-end status is inferred from metadata.
- Returns:
A pandas Series of scale factors indexed by external_id. The Series name is “scale_factor”.
- Raises:
ValueError – If by is invalid, required metadata columns are missing, or non-numeric metadata values are present.
TypeError – If target_read_count or target_read_length_bp are not numeric scalars.
- Return type:
Series
Examples
AUC-based scaling (default):
sf = compute_scale_factors(rse) scaled = transform_counts(rse, scale_factors=sf)
Mapped-reads-based scaling:
sf = compute_scale_factors(rse, by="mapped_reads")
- recount3.se.transform_counts(rse: Any, by: str = 'auc', target_read_count: float = 40000000.0, target_read_length_bp: float = 100, round_to_integers: bool = True, **kwargs: Any) DataFrame[source]
Scale coverage-sum counts to a common library size.
recount3 “raw_counts” represent summed per-base coverage over each feature, not read/fragment counts. This function converts those coverage-sum values into scaled counts that are comparable across samples by multiplying each sample column by a sample-specific scale factor.
Scaling is applied independently per sample (per column). For feature i and sample j, the returned matrix contains:
scaled[i, j] = raw_counts[i, j] * scale_factor[j]
The scale factors are computed from sample metadata (col_data) using one of two methods:
- by=”auc”:
scale_factor[j] = target_read_count / auc[j]
where auc is a per-sample total coverage metric. This method scales each sample to have total coverage approximately equal to target_read_count.
- by=”mapped_reads”:
- scale_factor[j] = (
target_read_count * target_read_length_bp * paired_multiplier[j] / (mapped_reads[j] * (avg_mapped_read_length[j] ** 2))
)
where paired_multiplier is 2 for paired-end samples, 1 for single-end samples, and missing when paired-end status cannot be inferred. This method incorporates mapped reads and read length so that samples with different read lengths are normalized onto the same target read length target_read_length_bp.
The returned values remain in the same feature-by-sample shape as the input. If round_to_integers is True, values are rounded to integer-like counts.
- Parameters:
rse (Any) – A RangedSummarizedExperiment-like object containing a “raw_counts” assay and sample metadata in col_data.
by (str) – Scaling method: “auc” or “mapped_reads”.
target_read_count (float) – Target library size used to compute scale factors. Interpreted as the number of single-end reads to scale each sample to.
target_read_length_bp (float) – Target read length used only when by=”mapped_reads”.
round_to_integers (bool) – If True, round scaled values to 0 decimals.
**kwargs (Any) – Additional parameters forwarded to compute_scale_factors(). Use this to override metadata column names (for example, auc_column=…, mapped_reads_column=…, avg_mapped_read_length_column=…) or to provide paired_end_status=… when paired-end status should not be inferred.
- Returns:
A pandas DataFrame of scaled counts with the same dimensions as assay(“raw_counts”). Row and column names are preserved when available.
- Raises:
ValueError – If rse is not a RangedSummarizedExperiment, if the required assay or metadata columns are missing, if by is invalid, or if the assay and metadata dimensions do not align.
TypeError – If round_to_integers is not a bool, or if numeric parameters are not valid scalars.
- Return type:
DataFrame
recount3._descriptions
Resource descriptions and duffel URL-path construction.
This module defines a small set of resource description dataclasses for recount3. A description is a validated, immutable-ish bundle of parameters (organism, project, etc.) that can deterministically construct the relative path to a resource in the repository.
The main entry point is R3ResourceDescription, which acts as a
multi-factory: instantiating R3ResourceDescription(resource_type=...)
returns an instance of the registered concrete subclass for that
resource_type.
- Registered resource types
resource_typestringClass
"annotations""count_files_gene_or_exon""count_files_junctions""metadata_files""bigwig_files""data_sources""data_source_metadata"
Valid values for the organism and data_source fields are exposed as
the module-level constants VALID_ORGANISMS and
VALID_DATA_SOURCES.
Typical usage:
from recount3._descriptions import R3ResourceDescription
desc = R3ResourceDescription(
resource_type="count_files_gene_or_exon",
organism="human",
data_source="sra",
genomic_unit="gene",
project="SRP107565",
annotation_extension="G026",
)
path = desc.url_path()
# Resource type can also be passed as the first positional argument:
desc = R3ResourceDescription("data_sources", organism="human")
- class recount3._descriptions.R3ResourceDescription(*args: Any, **kwargs: Any)[source]
Abstract base class and multi-factory for recount3 resource descriptors.
Instantiate this class with a resource_type to obtain an instance of the registered concrete subclass.
Example
>>> desc = R3ResourceDescription( ... resource_type="annotations", ... organism="human", ... genomic_unit="gene", ... annotation_extension="G026", ... ) >>> isinstance(desc, R3Annotations) True
Concrete subclasses should: - Inherit from both _R3CommonFields and R3ResourceDescription. - Validate required parameters in __post_init__. - Implement url_path() to return the duffel-relative path.
- _TYPE_REGISTRY
Mapping from resource-type strings to concrete classes. Mutable to allow dynamic registration.
- _RESOURCE_TYPE
Registration string key. Injected into concrete subclasses by the
register_type()decorator.- Type:
- resource_type
Resource-type discriminator. Declared here for static typing; implemented by
_R3CommonFields.- Type:
Constructs and returns an instance of the appropriate subclass.
The factory accepts the resource type either as: - keyword argument resource_type=…, or - the first positional argument.
- Parameters:
*args (Any) – Optional positional arguments; if present, args[0] may supply the resource type.
**kwargs (Any) – Keyword arguments used to initialize the selected dataclass subclass.
- Returns:
An instance of the concrete subclass registered for the selected resource type.
- Raises:
KeyError – If resource_type is missing or empty.
ValueError – If resource_type is not registered.
- Return type:
- classmethod register_type(resource_type: str) Callable[[type[R3ResourceDescription]], type[R3ResourceDescription]][source]
Registers a concrete subclass for a given resource_type.
This decorator binds resource_type to the provided subclass and also sets an internal _RESOURCE_TYPE attribute on that subclass.
- Parameters:
resource_type (str) – Resource-type string used as the factory key.
- Returns:
A decorator that registers the decorated subclass.
- Return type:
Callable[[type[R3ResourceDescription]], type[R3ResourceDescription]]
Example
>>> @R3ResourceDescription.register_type("annotations") ... @dataclasses.dataclass(slots=True) ... class R3Annotations(_R3CommonFields, R3ResourceDescription): ... ...
- url_path() str[source]
Return the duffel-relative URL path for this resource.
R3ResourceDescription is a multi-factory and abstract interface: calling
R3ResourceDescription(resource_type=...)returns an instance of a concrete registered subclass (e.g. R3Annotations). Those subclasses implement url_path() to construct a deterministic path within the duffel repository layout.This base-class method is not expected to be called directly; it exists to document the interface and provide a consistent method name across all description types.
Implementations must return a path without a leading slash. Callers should join this value onto the configured base URL (and add any separators as needed).
- Returns:
A duffel-relative path string, no leading slash.
- Raises:
NotImplementedError – Always, in the base class. Concrete subclasses override this method.
- Return type:
- class recount3._descriptions.R3Annotations(*args: Any, **kwargs: Any)[source]
Descriptor for annotation GTF files.
- Required fields:
organism
genomic_unit
annotation_extension
- Duffel layout:
- {organism}/annotations/{genomic_unit}_sums/
{organism}.{genomic_unit}_sums.{annotation_extension}.gtf.gz
Constructs and returns an instance of the appropriate subclass.
The factory accepts the resource type either as: - keyword argument resource_type=…, or - the first positional argument.
- Parameters:
*args – Optional positional arguments; if present, args[0] may supply the resource type.
**kwargs – Keyword arguments used to initialize the selected dataclass subclass.
resource_type (str)
organism (str | None)
data_source (str | None)
genomic_unit (str | None)
project (str | None)
sample (str | None)
annotation_extension (str | None)
junction_type (str | None)
junction_extension (str | None)
table_name (str | None)
- Returns:
An instance of the concrete subclass registered for the selected resource type.
- Raises:
KeyError – If resource_type is missing or empty.
ValueError – If resource_type is not registered.
- url_path() str[source]
Return the duffel-relative URL path for this resource.
R3ResourceDescription is a multi-factory and abstract interface: calling
R3ResourceDescription(resource_type=...)returns an instance of a concrete registered subclass (e.g. R3Annotations). Those subclasses implement url_path() to construct a deterministic path within the duffel repository layout.This base-class method is not expected to be called directly; it exists to document the interface and provide a consistent method name across all description types.
Implementations must return a path without a leading slash. Callers should join this value onto the configured base URL (and add any separators as needed).
- Returns:
A duffel-relative path string, no leading slash.
- Raises:
NotImplementedError – Always, in the base class. Concrete subclasses override this method.
- Return type:
- class recount3._descriptions.R3GeneOrExonCounts(*args: Any, **kwargs: Any)[source]
Descriptor for per-project gene/exon count matrices.
- Required fields:
organism
data_source
genomic_unit
project
annotation_extension
- Duffel layout:
- {organism}/data_sources/{data_source}/{genomic_unit}_sums/
{p2(project)}/{project}/ {data_source}.{genomic_unit}_sums.{project}.{annotation_extension}.gz
Constructs and returns an instance of the appropriate subclass.
The factory accepts the resource type either as: - keyword argument resource_type=…, or - the first positional argument.
- Parameters:
*args – Optional positional arguments; if present, args[0] may supply the resource type.
**kwargs – Keyword arguments used to initialize the selected dataclass subclass.
resource_type (str)
organism (str | None)
data_source (str | None)
genomic_unit (str | None)
project (str | None)
sample (str | None)
annotation_extension (str | None)
junction_type (str | None)
junction_extension (str | None)
table_name (str | None)
- Returns:
An instance of the concrete subclass registered for the selected resource type.
- Raises:
KeyError – If resource_type is missing or empty.
ValueError – If resource_type is not registered.
- url_path() str[source]
Return the duffel-relative URL path for this resource.
R3ResourceDescription is a multi-factory and abstract interface: calling
R3ResourceDescription(resource_type=...)returns an instance of a concrete registered subclass (e.g. R3Annotations). Those subclasses implement url_path() to construct a deterministic path within the duffel repository layout.This base-class method is not expected to be called directly; it exists to document the interface and provide a consistent method name across all description types.
Implementations must return a path without a leading slash. Callers should join this value onto the configured base URL (and add any separators as needed).
- Returns:
A duffel-relative path string, no leading slash.
- Raises:
NotImplementedError – Always, in the base class. Concrete subclasses override this method.
- Return type:
- class recount3._descriptions.R3JunctionCounts(*args: Any, **kwargs: Any)[source]
Descriptor for per-project junction count files.
- Required fields:
organism
data_source
project
junction_type
junction_extension
- Duffel layout:
- {organism}/data_sources/{data_source}/junctions/
{p2(project)}/{project}/ {data_source}.junctions.{project}.{junction_type}.{junction_extension}.gz
Constructs and returns an instance of the appropriate subclass.
The factory accepts the resource type either as: - keyword argument resource_type=…, or - the first positional argument.
- Parameters:
*args – Optional positional arguments; if present, args[0] may supply the resource type.
**kwargs – Keyword arguments used to initialize the selected dataclass subclass.
resource_type (str)
organism (str | None)
data_source (str | None)
genomic_unit (str | None)
project (str | None)
sample (str | None)
annotation_extension (str | None)
junction_type (str | None)
junction_extension (str | None)
table_name (str | None)
- Returns:
An instance of the concrete subclass registered for the selected resource type.
- Raises:
KeyError – If resource_type is missing or empty.
ValueError – If resource_type is not registered.
- url_path() str[source]
Return the duffel-relative URL path for this resource.
R3ResourceDescription is a multi-factory and abstract interface: calling
R3ResourceDescription(resource_type=...)returns an instance of a concrete registered subclass (e.g. R3Annotations). Those subclasses implement url_path() to construct a deterministic path within the duffel repository layout.This base-class method is not expected to be called directly; it exists to document the interface and provide a consistent method name across all description types.
Implementations must return a path without a leading slash. Callers should join this value onto the configured base URL (and add any separators as needed).
- Returns:
A duffel-relative path string, no leading slash.
- Raises:
NotImplementedError – Always, in the base class. Concrete subclasses override this method.
- Return type:
- class recount3._descriptions.R3ProjectMetadata(*args: Any, **kwargs: Any)[source]
Descriptor for per-project metadata tables.
- Required fields:
organism
data_source
project
table_name
- Duffel layout:
- {organism}/data_sources/{data_source}/metadata/
{p2(project)}/{project}/ {data_source}.{table_name}.{project}.MD.gz
Constructs and returns an instance of the appropriate subclass.
The factory accepts the resource type either as: - keyword argument resource_type=…, or - the first positional argument.
- Parameters:
*args – Optional positional arguments; if present, args[0] may supply the resource type.
**kwargs – Keyword arguments used to initialize the selected dataclass subclass.
resource_type (str)
organism (str | None)
data_source (str | None)
genomic_unit (str | None)
project (str | None)
sample (str | None)
annotation_extension (str | None)
junction_type (str | None)
junction_extension (str | None)
table_name (str | None)
- Returns:
An instance of the concrete subclass registered for the selected resource type.
- Raises:
KeyError – If resource_type is missing or empty.
ValueError – If resource_type is not registered.
- url_path() str[source]
Return the duffel-relative URL path for this resource.
R3ResourceDescription is a multi-factory and abstract interface: calling
R3ResourceDescription(resource_type=...)returns an instance of a concrete registered subclass (e.g. R3Annotations). Those subclasses implement url_path() to construct a deterministic path within the duffel repository layout.This base-class method is not expected to be called directly; it exists to document the interface and provide a consistent method name across all description types.
Implementations must return a path without a leading slash. Callers should join this value onto the configured base URL (and add any separators as needed).
- Returns:
A duffel-relative path string, no leading slash.
- Raises:
NotImplementedError – Always, in the base class. Concrete subclasses override this method.
- Return type:
- class recount3._descriptions.R3BigWig(*args: Any, **kwargs: Any)[source]
Descriptor for per-sample BigWig coverage files.
- Required fields:
organism
data_source
project
sample
- Duffel layout:
- {organism}/data_sources/{data_source}/base_sums/
{p2(project)}/{project}/{shard(sample, data_source)}/ {data_source}.base_sums.{project}_{sample}.ALL.bw
The sample shard subdirectory uses a different offset for GTEx samples compared to SRA/TCGA samples; see
_sample_shard().Constructs and returns an instance of the appropriate subclass.
The factory accepts the resource type either as: - keyword argument resource_type=…, or - the first positional argument.
- Parameters:
*args – Optional positional arguments; if present, args[0] may supply the resource type.
**kwargs – Keyword arguments used to initialize the selected dataclass subclass.
resource_type (str)
organism (str | None)
data_source (str | None)
genomic_unit (str | None)
project (str | None)
sample (str | None)
annotation_extension (str | None)
junction_type (str | None)
junction_extension (str | None)
table_name (str | None)
- Returns:
An instance of the concrete subclass registered for the selected resource type.
- Raises:
KeyError – If resource_type is missing or empty.
ValueError – If resource_type is not registered.
- url_path() str[source]
Return the duffel-relative URL path for this resource.
R3ResourceDescription is a multi-factory and abstract interface: calling
R3ResourceDescription(resource_type=...)returns an instance of a concrete registered subclass (e.g. R3Annotations). Those subclasses implement url_path() to construct a deterministic path within the duffel repository layout.This base-class method is not expected to be called directly; it exists to document the interface and provide a consistent method name across all description types.
Implementations must return a path without a leading slash. Callers should join this value onto the configured base URL (and add any separators as needed).
- Returns:
A duffel-relative path string, no leading slash.
- Raises:
NotImplementedError – Always, in the base class. Concrete subclasses override this method.
- Return type:
- class recount3._descriptions.R3DataSources(*args: Any, **kwargs: Any)[source]
Descriptor for the organism-level data-source index (homes_index).
- Required fields:
organism
- Duffel layout:
{organism}/homes_index
Constructs and returns an instance of the appropriate subclass.
The factory accepts the resource type either as: - keyword argument resource_type=…, or - the first positional argument.
- Parameters:
*args – Optional positional arguments; if present, args[0] may supply the resource type.
**kwargs – Keyword arguments used to initialize the selected dataclass subclass.
resource_type (str)
organism (str | None)
data_source (str | None)
genomic_unit (str | None)
project (str | None)
sample (str | None)
annotation_extension (str | None)
junction_type (str | None)
junction_extension (str | None)
table_name (str | None)
- Returns:
An instance of the concrete subclass registered for the selected resource type.
- Raises:
KeyError – If resource_type is missing or empty.
ValueError – If resource_type is not registered.
- url_path() str[source]
Return the duffel-relative URL path for this resource.
R3ResourceDescription is a multi-factory and abstract interface: calling
R3ResourceDescription(resource_type=...)returns an instance of a concrete registered subclass (e.g. R3Annotations). Those subclasses implement url_path() to construct a deterministic path within the duffel repository layout.This base-class method is not expected to be called directly; it exists to document the interface and provide a consistent method name across all description types.
Implementations must return a path without a leading slash. Callers should join this value onto the configured base URL (and add any separators as needed).
- Returns:
A duffel-relative path string, no leading slash.
- Raises:
NotImplementedError – Always, in the base class. Concrete subclasses override this method.
- Return type:
- class recount3._descriptions.R3DataSourceMetadata(*args: Any, **kwargs: Any)[source]
Descriptor for source-level metadata listings.
- Required fields:
organism
data_source
- Duffel layout:
- {organism}/data_sources/{data_source}/metadata/
{data_source}.recount_project.MD.gz
Constructs and returns an instance of the appropriate subclass.
The factory accepts the resource type either as: - keyword argument resource_type=…, or - the first positional argument.
- Parameters:
*args – Optional positional arguments; if present, args[0] may supply the resource type.
**kwargs – Keyword arguments used to initialize the selected dataclass subclass.
resource_type (str)
organism (str | None)
data_source (str | None)
genomic_unit (str | None)
project (str | None)
sample (str | None)
annotation_extension (str | None)
junction_type (str | None)
junction_extension (str | None)
table_name (str | None)
- Returns:
An instance of the concrete subclass registered for the selected resource type.
- Raises:
KeyError – If resource_type is missing or empty.
ValueError – If resource_type is not registered.
- url_path() str[source]
Return the duffel-relative URL path for this resource.
R3ResourceDescription is a multi-factory and abstract interface: calling
R3ResourceDescription(resource_type=...)returns an instance of a concrete registered subclass (e.g. R3Annotations). Those subclasses implement url_path() to construct a deterministic path within the duffel repository layout.This base-class method is not expected to be called directly; it exists to document the interface and provide a consistent method name across all description types.
Implementations must return a path without a leading slash. Callers should join this value onto the configured base URL (and add any separators as needed).
- Returns:
A duffel-relative path string, no leading slash.
- Raises:
NotImplementedError – Always, in the base class. Concrete subclasses override this method.
- Return type:
recount3.errors
Domain-specific exceptions for recount3.
All exceptions inherit from Recount3Error, so callers can catch
the base class to handle any recount3-specific failure, or catch a subclass
to handle a specific failure mode:
Recount3Error: base class for all recount3 exceptions.ConfigurationError: invalid or missing configuration (bad env var values, inaccessible cache directory, unsupported option combinations).DownloadError: a network or I/O failure occurred while downloading a resource.LoadError: a resource was downloaded but could not be parsed (empty file, unexpected format, shape mismatch, missing columns).CompatibilityError: resources that are incompatible with each other were combined in an operation such asstack_count_matrices().
Example
Catch all recount3 errors with the base class:
from recount3.errors import Recount3Error, DownloadError
try:
res.download(path="/data")
except DownloadError as exc:
print(f"Network failure: {exc}")
except Recount3Error as exc:
print(f"Unexpected recount3 error: {exc}")
recount3.types
Common type aliases and literals used throughout the recount3 API.
- recount3.types.CacheMode
Literal
"enable" | "disable" | "update". Controls howdownload()interacts with the on-disk cache:"enable": use the existing cached file; download only if missing (default)."disable": bypass the cache entirely and stream the file directly to the destination."update": force a fresh download even if a cached copy already exists, then cache the result.
- Type:
TypeAlias
- recount3.types.CompatibilityMode
Literal
"family" | "feature". Controls validation instack_count_matrices():"family": all count resources must belong to the same high-level family (gene/exon versus junctions)."feature": stricter: all resources must additionally share an identical feature space (same genomic unit and annotation).
- Type:
TypeAlias
- recount3.types.StringOrIterable
str | Iterable[str]. Most search functions accept either a single string or an iterable of strings for each parameter. When an iterable is passed, the function computes the Cartesian product across all parameters and returns oneR3Resourceper combination.- Type:
TypeAlias
- recount3.types.FieldSpec
StringOrIterable | Callable[[Any], bool] | None. The filter predicate accepted byfilter()andmatch_spec(). Three forms are accepted:None: no filtering; every value passes.A string or iterable of strings: exact membership test against the field value.
A callable: called with the field value; a truthy return keeps the resource.
- Type:
TypeAlias
recount3._bigwig
BigWig file access via a small wrapper around pyBigWig.
This is an internal module. For BigWig access via the public API, use
load() on a BigWig
R3Resource.
The optional dependency is imported through
get_pybigwig_module().
Typical usage:
>>> from pathlib import Path
>>> from recount3._bigwig import BigWigFile
>>> with BigWigFile(Path("example.bw")) as bw:
... lengths = bw.chroms()
... mean = bw.stats("chr1", 0, 1000)[0]
Note
Requires the optional pyBigWig package. Install with
pip install pyBigWig. An ImportError is raised on first use
if the package is not available. pyBigWig can be difficult to
install on non-Linux systems.
- class recount3._bigwig.BigWigFile(path: Path, mode: str = 'r')[source]
A lazily-opened BigWig reader with a small, typed API.
Instances are cheap to construct and do not open the file until the first method call that requires a live
pyBigWighandle. The handle is cached for subsequent calls and can be explicitly released withclose().- path
Filesystem path to a BigWig file (typically
.bw). The file must exist when the handle is opened.- Type:
- close() None[source]
Close the underlying
pyBigWighandle if it is open.This method is idempotent.
- Return type:
None
- chroms(chrom: str | None = None) Mapping[str, int] | int | None[source]
Return chromosome lengths, or a single chromosome length.
- Parameters:
chrom (str | None) – If provided, return only the length for this chromosome.
- Returns:
If
chromis None, a mapping from chromosome name to length. Otherwise, the length for the requested chromosome.Nonemay be returned if the chromosome is not present.- Return type:
- values(chrom: str, start: int, end: int, *, numpy: bool | None = None) list[float] | Any[source]
Return per-base values over a half-open interval
[start, end).- Parameters:
- Returns:
Values returned by
pyBigWig.values. Whennumpyis not True, this is typically alist[float]. Whennumpyis True,pyBigWigmay return a NumPy array depending on its configuration.- Return type:
- stats(chrom: str, start: int | None = None, end: int | None = None, *, type: str = 'mean', n_bins: int | None = None, exact: bool | None = None) list[float | None][source]
Return summary statistic(s) over an interval or whole chromosome.
- Parameters:
chrom (str) – Chromosome name.
start (int | None) – 0-based start coordinate (inclusive). If omitted, stats are computed over the whole chromosome.
end (int | None) – 0-based end coordinate (exclusive). If omitted, stats are computed over the whole chromosome.
type (str) – Statistic name understood by
pyBigWig.stats(for example,"mean","min","max","coverage").n_bins (int | None) – If provided, request binned stats via the
nBinsargument.exact (bool | None) – If provided, forward to the
exactargument.
- Returns:
A list of statistic values as returned by
pyBigWig.stats. Values may beNonefor missing data regions.- Return type:
- intervals(chrom: str, start: int | None = None, end: int | None = None) list[tuple[int, int, float]] | None | Any[source]
Return (start, end, value) intervals overlapping a region.
- Parameters:
- Returns:
Intervals returned by
pyBigWig.intervals. This is often a list of(start, end, value)tuples.Nonemay be returned when no intervals overlap the requested region.- Return type: