Tutorial
========

This tutorial walks through common workflows using the ``recount3`` Python
package.  All examples assume the package is installed::

   pip install "recount3[biocpy]"

For BigWig support add the ``bigwig`` extra as well::

   pip install "recount3[biocpy,bigwig]"


Discovering Resources for a Project
------------------------------------

:class:`~recount3.bundle.R3ResourceBundle` is the main entry point.
Call :meth:`~recount3.bundle.R3ResourceBundle.discover` to fetch every
resource associated with a project:

.. code:: python

   from recount3 import R3ResourceBundle

   bundle = R3ResourceBundle.discover(
       organism="human",
       data_source="sra",
       project="SRP009615",
   )
   print(f"Found {len(bundle.resources)} resources.")

Supported organisms are ``"human"`` and ``"mouse"``.  Supported data sources
are ``"sra"``, ``"gtex"``, and ``"tcga"``.


Stacking Count Matrices
-----------------------

Use :meth:`~recount3.bundle.R3ResourceBundle.stack_count_matrices` to
concatenate all gene-level (or exon-level, junction-level) count files
from a bundle into a single :class:`pandas.DataFrame`:

.. code:: python

   counts = bundle.only_counts().stack_count_matrices(genomic_unit="gene")
   print(counts.shape)   # (n_genes, n_samples)


Building a SummarizedExperiment
--------------------------------

Requires the ``biocpy`` extra (``biocframe``, ``summarizedexperiment``).

.. code:: python

   se = bundle.to_summarized_experiment(genomic_unit="gene")
   print(se)


Building a RangedSummarizedExperiment
---------------------------------------

A :class:`~summarizedexperiment.RangedSummarizedExperiment` attaches
genomic coordinate ranges to each feature row.  A GTF annotation file is
required; include it in the bundle via ``search_project_all`` (done
automatically by :meth:`~recount3.bundle.R3ResourceBundle.discover`) or
add an annotation resource manually.

.. code:: python

   rse = bundle.to_ranged_summarized_experiment(genomic_unit="gene")
   print(rse)
   print(rse.shape)          # (n_genes, n_samples)
   print(rse.colnames[:5])   # first five sample IDs


Accessing Metadata
------------------

Column metadata (sample annotations) is merged from every available
metadata table and aligned to the count columns automatically:

.. code:: python

   col_data = rse.coldata.to_pandas()
   print(col_data.columns.tolist())


Filtering Bundles
-----------------

Bundles are immutable snapshots.  Use :meth:`~recount3.bundle.R3ResourceBundle.filter`
to narrow down by any description field:

.. code:: python

   gene_only = bundle.filter(genomic_unit="gene")
   annots    = bundle.filter(resource_type="annotations")
   metadata  = bundle.only_metadata()


Working with Individual Resources
-----------------------------------

Each :class:`~recount3.resource.R3Resource` knows its URL and can be
downloaded and loaded independently:

.. code:: python

   from recount3 import R3GeneOrExonCounts, R3Resource

   desc = R3GeneOrExonCounts(
       organism="human",
       data_source="sra",
       genomic_unit="gene",
       project="SRP009615",
       sample="SRR387777",
   )
   res = R3Resource(desc)
   res.download(path=None, cache_mode="enable")  # saves to local cache
   df = res.load()                               # returns pd.DataFrame
   print(df.head())


Searching Without a Bundle
---------------------------

The :mod:`recount3.search` helpers return lists of
:class:`~recount3.resource.R3Resource` objects directly:

.. code:: python

   from recount3 import search_count_files_gene_or_exon

   resources = search_count_files_gene_or_exon(
       organism="human",
       data_source="sra",
       genomic_unit="gene",
       project="SRP009615",
   )
   print(f"{len(resources)} count files found.")


Cache Management
----------------

Downloaded files are cached under ``~/.cache/recount3/files`` by default.
Use the :mod:`recount3.config` helpers to inspect or clear the cache:

.. code:: python

   from recount3 import recount3_cache, recount3_cache_files, recount3_cache_rm

   print(recount3_cache())          # Path to the cache directory
   print(recount3_cache_files())    # List all cached files

   # Remove only junction files (dry run first):
   removed = recount3_cache_rm(
       predicate=lambda p: ".junctions." in p.name,
       dry_run=True,
   )
   print(f"Would remove {len(removed)} files.")

The cache directory and other settings can be overridden via environment
variables (``RECOUNT3_CACHE_DIR``, ``RECOUNT3_URL``, etc.) or by
constructing a custom :class:`~recount3.config.Config` and passing it to
any resource or search function.


SRA Attribute Expansion
-----------------------

recount3 stores SRA sample attributes as a single pipe-delimited string
column.  :func:`recount3.se.build_ranged_summarized_experiment` expands
them automatically, but you can also call the helper directly:

.. code:: python

   from recount3.se import expand_sra_attributes

   rse_expanded = expand_sra_attributes(rse)
   col_data = rse_expanded.coldata.to_pandas()
   # Columns like "sra_attribute.age", "sra_attribute.disease" are now present.