diff --git a/CHANGELOG.md b/CHANGELOG.md index b05854b..88f17b9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,11 @@ # Changelog +## Version 0.4.1 - 0.4.3 + +- Helper methods to create sample mapping if not provided. +- Subset operations on samples. +- Update sphinx configuration to run snippets in the documentation. + ## Version 0.4.0 This is a complete rewrite of the package, following the functional paradigm from our [developer notes](https://github.com/BiocPy/developer_guide#use-functional-discipline). diff --git a/README.md b/README.md index dd6488a..c5040a7 100644 --- a/README.md +++ b/README.md @@ -4,11 +4,11 @@ # MultiAssayExperiment -Container class to represent and manage multi-omics genomic experiments. Follows Bioconductor's [MAE R/Package](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html). +Container class to represent and manage multi-omics genomic experiments. `MultiAssayExperiment` (MAE) simplifies the management of multiple experimental assays conducted on a shared set of specimens, follows Bioconductor's [MAE R/Package](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html). ## Install -Package is published to [PyPI](https://pypi.org/project/multiassayexperiment/) +To get started, install the package from [PyPI](https://pypi.org/project/multiassayexperiment/) ```shell pip install multiassayexperiment @@ -16,7 +16,20 @@ pip install multiassayexperiment ## Usage -First create mock sample data +An MAE contains three main entities, + +- **Primary information** (`column_data`): Bio-specimen/sample information. The `column_data` may provide information about patients, cell lines, or other biological units. Each row in this table represents an independent biological unit. It must contain an `index` that maps to the 'primary' in `sample_map`. + +- **Experiments** (`experiments`): Genomic data from each experiment. either a `SingleCellExperiment`, `SummarizedExperiment`, `RangedSummarizedExperiment` or any class that extends a `SummarizedExperiment`. + +- **Sample Map** (`sample_map`): Map biological units from `column_data` to the list of `experiments`. Must contain columns, + - **assay** provides the names of the different experiments performed on the biological units. All experiment names from experiments must be present in this column. + - **primary** contains the sample name. All names in this column must match with row labels from col_data. + - **colname** is the mapping of samples/cells within each experiment back to its biosample information in col_data. + + Each sample in ``column_data`` may map to one or more columns per assay. + +Let's start by first creating few experiments: ```python from random import random @@ -67,7 +80,7 @@ sample_map = BiocFrame({ sample_data = BiocFrame({"samples": ["sample1", "sample2"]}, row_names= ["sample1", "sample2"]) ``` -Now we can create an instance of an MAE - +Finally, we can create an `MultiAssayExperiment` object: ```python from multiassayexperiment import MultiAssayExperiment diff --git a/docs/conf.py b/docs/conf.py index ad19490..6087fa1 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -72,6 +72,7 @@ "sphinx.ext.ifconfig", "sphinx.ext.mathjax", "sphinx.ext.napoleon", + "sphinx_autodoc_typehints", ] # Add any paths that contain templates here, relative to this directory. @@ -79,7 +80,8 @@ # Enable markdown -extensions.append("myst_parser") +# extensions.append("myst_parser") +extensions.append("myst_nb") # Configure MyST-Parser myst_enable_extensions = [ @@ -311,8 +313,9 @@ "anndata": ("https://anndata.readthedocs.io/en/latest/", None), "biocframe": ("https://biocpy.github.io/BiocFrame", None), "genomicranges": ("https://biocpy.github.io/GenomicRanges", None), - "singelcellexperiment": ("https://biocpy.github.io/SingleCellExperiment", None), + "singlecellexperiment": ("https://biocpy.github.io/SingleCellExperiment", None), "summarizedexperiment": ("https://biocpy.github.io/SummarizedExperiment", None), + "biocutils": ("https://biocpy.github.io/BiocUtils", None), } print(f"loading configurations for {project} {version} ...", file=sys.stderr) \ No newline at end of file diff --git a/docs/index.md b/docs/index.md index aa42c3e..2eb7d00 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,10 +1,10 @@ # MultiAssayExperiment -Container class to represent multiple experiments and assays performed over a set of samples. follows Bioconductor's [MAE R/Package](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html). +Container class to represent and manage multi-omics genomic experiments. `MultiAssayExperiment` (MAE) simplifies the management of multiple experimental assays conducted on a shared set of specimens, follows Bioconductor's [MAE R/Package](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html). ## Install -Package is published to [PyPI](https://pypi.org/project/multiassayexperiment/) +To get started, install the package from [PyPI](https://pypi.org/project/multiassayexperiment/) ```shell pip install multiassayexperiment @@ -15,8 +15,7 @@ pip install multiassayexperiment ```{toctree} :maxdepth: 2 -Overview -Tutorial +Overview Module Reference Contributions & Help License diff --git a/docs/overview.md b/docs/overview.md new file mode 100644 index 0000000..645db42 --- /dev/null +++ b/docs/overview.md @@ -0,0 +1,330 @@ +--- +file_format: mystnb +kernelspec: + name: python +--- + +# Multiple experiments + +`MultiAssayExperiment` (MAE) simplifies the management of multiple experimental assays conducted on a shared set of specimens. + +:::{note} +These classes follow a functional paradigm for accessing or setting properties, with further details discussed in [functional paradigm](https://biocpy.github.io/tutorial/chapters/philosophy.html#functional-discipline) section. +::: + +## Construction + +An MAE contains three main entities, + +- **Primary information** (`column_data`): Bio-specimen/sample information. The `column_data` may provide information about patients, cell lines, or other biological units. Each row in this table represents an independent biological unit. It must contain an `index` that maps to the 'primary' in `sample_map`. + +- **Experiments** (`experiments`): Genomic data from each experiment. either a `SingleCellExperiment`, `SummarizedExperiment`, `RangedSummarizedExperiment` or any class that extends a `SummarizedExperiment`. + +- **Sample Map** (`sample_map`): Map biological units from `column_data` to the list of `experiments`. Must contain columns, + - **assay** provides the names of the different experiments performed on the biological units. All experiment names from experiments must be present in this column. + - **primary** contains the sample name. All names in this column must match with row labels from col_data. + - **colname** is the mapping of samples/cells within each experiment back to its biosample information in col_data. + + Each sample in ``column_data`` may map to one or more columns per assay. + +Let's start by first creating few experiments: + +```{code-cell} + +from random import random + +import numpy as np +from biocframe import BiocFrame +from genomicranges import GenomicRanges +from iranges import IRanges + +nrows = 200 +ncols = 6 +counts = np.random.rand(nrows, ncols) +gr = GenomicRanges( + seqnames=[ + "chr1", + "chr2", + "chr2", + "chr2", + "chr1", + "chr1", + "chr3", + "chr3", + "chr3", + "chr3", + ] * 20, + ranges=IRanges(range(100, 300), range(110, 310)), + strand = ["-", "+", "+", "*", "*", "+", "+", "+", "-", "-"] * 20, + mcols=BiocFrame({ + "score": range(0, 200), + "GC": [random() for _ in range(10)] * 20, + }) +) + +col_data_sce = BiocFrame({"treatment": ["ChIP", "Input"] * 3}, + row_names=[f"sce_{i}" for i in range(6)], +) + +col_data_se = BiocFrame({"treatment": ["ChIP", "Input"] * 3}, + row_names=[f"se_{i}" for i in range(6)], +) +``` + +More importantly, we need to provide `sample_map` information: + +```{code-cell} +sample_map = BiocFrame({ + "assay": ["sce", "se"] * 6, + "primary": ["sample1", "sample2"] * 6, + "colname": ["sce_0", "se_0", "sce_1", "se_1", "sce_2", "se_2", "sce_3", "se_3", "sce_4", "se_4", "sce_5", "se_5"] +}) + +sample_data = BiocFrame({"samples": ["sample1", "sample2"]}, row_names= ["sample1", "sample2"]) + +print(sample_map) +``` + + +Finally, we can create an `MultiAssayExperiment` object: + +```{code-cell} +from multiassayexperiment import MultiAssayExperiment +from singlecellexperiment import SingleCellExperiment +from summarizedexperiment import SummarizedExperiment + +tsce = SingleCellExperiment( + assays={"counts": counts}, row_data=gr.to_pandas(), column_data=col_data_sce +) + +tse2 = SummarizedExperiment( + assays={"counts": counts.copy()}, + row_data=gr.to_pandas().copy(), + column_data=col_data_se.copy(), +) + +mae = MultiAssayExperiment( + experiments={"sce": tsce, "se": tse2}, + column_data=sample_data, + sample_map=sample_map, + metadata={"could be": "anything"}, +) + +print(mae) +``` + +### No sample mapping? + +If both `column_data` and `sample_map` are `None`, the constructor naively creates sample mapping, with each `experiment` considered to be a independent `sample`. We add a sample to `column_data` in this pattern - ``unknown_sample_{experiment_name}``. + +All cells from the each experiment are considered to be from the same sample and is reflected in `sample_map`. + +:::{important} +***This is not a recommended approach, but if you don’t have sample mapping, then it doesn’t matter***. +::: + +```{code-cell} +mae = MultiAssayExperiment( + experiments={"sce": tsce, "se": tse2}, + metadata={"could be": "anything"}, +) + +print(mae) +``` + +### Interop with `anndata` or `mudata` + +We provide convenient methods to easily convert a `MuData` object into an `MultiAssayExperiment`. + +Let's create a mudata object: + +```{code-cell} + +import numpy as np +from anndata import AnnData + +np.random.seed(1) + +n, d, k = 1000, 100, 10 + +z = np.random.normal(loc=np.arange(k), scale=np.arange(k) * 2, size=(n, k)) +w = np.random.normal(size=(d, k)) +y = np.dot(z, w.T) + +adata = AnnData(y) +adata.obs_names = [f"obs_{i+1}" for i in range(n)] +adata.var_names = [f"var_{j+1}" for j in range(d)] + +d2 = 50 +w2 = np.random.normal(size=(d2, k)) +y2 = np.dot(z, w2.T) + +adata2 = AnnData(y2) +adata2.obs_names = [f"obs_{i+1}" for i in range(n)] +adata2.var_names = [f"var2_{j+1}" for j in range(d2)] + +from mudata import MuData +mdata = MuData({"rna": adata, "spatial": adata2}) + +print(mdata) +``` + +Lets convert this object to an `MAE`: + +```{code-cell} +from multiassayexperiment import MultiAssayExperiment + +mae_obj = MultiAssayExperiment.from_mudata(input=mdata) +print(mae_obj) +``` + + +## Getters/Setters + +Getters are available to access various attributes using either the property notation or functional style. + +```{code-cell} +# access assays +print("experiment names (as property): ", mae.experiment_names) +print("experiment names (functional style): ", mae.get_experiment_names()) + +# access sample data +print(mae.column_data) +``` + +Check out the [class documentation](https://biocpy.github.io/MultiAssayExperiment/api/multiassayexperiment.html#multiassayexperiment.MultiAssayExperiment.MultiAssayExperiment) for the full list of accessors and setters. + + +#### Row or column name accessors + +A helper method is available to easily access row or column names across all experiments. This method returns a dictionary with experiment names as keys and the corresponding values, which can be either the row or column names depending on the function: + +```{code-cell} +from rich import print as pprint +pprint("row names:", mae.get_row_names()) +pprint("column names:", mae.get_column_names()) +``` + +#### Access an experiment + +One can access an experiment by name: + +```{code-cell} +print(mae.experiment("se")) +``` + +Additionally you may access an experiment with the sample information included in the column data of the experiment: + +:::{note} +This creates a copy of the experiment. +::: + +```{code-cell} +expt_with_sample_info = mae.experiment("se", with_sample_data=True) +print(expt_with_sample_info) +``` + +:::{note} +For consistency with the R MAE's interface, we also provide `get_with_column_data` method, that performs the same operation. +::: + +### Setters + +::: {important} +All property-based setters are `in_place` operations, with further details discussed in [functional paradigm](../philosophy.qmd#functional-discipline) section. +::: + +```{code-cell} +modified_column_data = mae.column_data.set_column("score", range(len(mae.column_data))) +modified_mae = mae.set_column_data(modified_column_data) +print(modified_mae) +``` + +Now, lets check the `column_data` on the original object. + +```{code-cell} +print(mae.column_data) +``` + + +## Subsetting + +You can subset `MultiAssayExperiment` by using the subset (`[]`) operator. This operation accepts different slice input types, such as a boolean vector, a `slice` object, a list of indices, or names (if available) to subset. + +`MultiAssayExperiment` allows subsetting by three dimensions: `rows`, `columns`, and `experiments`. ***`sample_map` is automatically filtered during this operation***. + +### Subset by indices + +```{code-cell} +subset_mae = mae[1:5, 0:4] +print(subset_mae) +``` + +### Subset by experiments dimension + +The following creates a subset based on the experiments dimension: + +```{code-cell} +subset_mae = mae[1:5, 0:1, ["se"]] +print(subset_mae) +``` + +:::{note} +If you're wondering about why the experiment "se" has 0 columns, it's important to note that our MAE implementation does not remove columns from an experiment solely because none of the columns map to the samples of interest. This approach aims to prevent unexpected outcomes in complex subset operations. +::: + +## Helper functions + +The `MultiAssayExperiment` class also provides a few methods for sample management. + +### Complete cases + +The `complete_cases` function is designed to identify samples that contain data across all experiments. It produces a boolean vector with the same length as the number of samples in `column_data`. Each element in the vector is `True` if the sample is present in all experiments, or `False` otherwise. + +```{code-cell} +print(mae.complete_cases()) +``` + +You can use this boolean vector to select samples with complete data across all assays or experiments. + +```{code-cell} +subset_mae = mae[:, mae.complete_cases(),] +print(subset_mae) +``` + + +### Replicates + +This method identifies 'samples' with replicates within each experiment. The result is a dictionary where experiment names serve as keys, and the corresponding values indicate whether the sample is replicated within each experiment. + + +```{code-cell} +from rich import print as pprint # mainly for pretty printing +pprint(mae.replicated()) +``` + + +### Intersect rows + +The `intersect_rows` finds common `row_names` across all experiments and returns a `MultiAssayExperiment` with those rows. + +```{code-cell} +common_rows_mae = mae.intersect_rows() +print(common_rows_mae) +``` + +If you are only interested in finding common `row_names` across all experiments: + +```{code-cell} +common_rows = mae.find_common_row_names() +print(common_rows) +``` + +### Empty MAE + +While the necessity of an empty `MultiAssayExperiment` might not be apparent, for the sake of consistency with the rest of the tutorials: + +```{code-cell} +mae = MultiAssayExperiment(experiments={}) +print(mae) +``` diff --git a/docs/requirements.txt b/docs/requirements.txt index 6d65a9b..c20cf60 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -1,7 +1,9 @@ furo +myst-nb # Requirements file for ReadTheDocs, check .readthedocs.yml. # To build the module reference correctly, make sure every external package # under `install_requires` in `setup.cfg` is also listed here! # sphinx_rtd_theme myst-parser[linkify] sphinx>=3.2.1 +sphinx-autodoc-typehints diff --git a/docs/tutorial.md b/docs/tutorial.md deleted file mode 100644 index b220834..0000000 --- a/docs/tutorial.md +++ /dev/null @@ -1,232 +0,0 @@ -# Tutorial - -Container class to represent and manage multi-omics genomic experiments. - -For more detailed description checkout the [MultiAssayExperiment Bioc/R package](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html)) - -# Construct an `MultiAssayExperiment` - -An MAE contains three main entities, - -- Primary information (`col_data`): Bio-specimen/sample information. The ``col_data`` may provide information about patients, cell lines, or - other biological units. -- Experiments (`experiments`): Genomic data from each experiment. either a `SingleCellExperiment`, `SummarizedExperiment`, `RangeSummarizedExperiment` or - any class that extends a `SummarizedExperiment`. -- Sample Map (`sample_map`): Map biological units from ``col_data`` to the list of ``experiments``. Must contain columns, - - - **assay** provides the names of the different experiments performed on the - biological units. All experiment names from ``experiments`` must be present in this column. - - **primary** contains the sample name. All names in this column must match with row labels from ``col_data``. - - **colname** is the mapping of samples/cells within each experiment back to its biosample information in ``col_data``. - -Lets create these objects - -```python -from biocframe import BiocFrame -from iranges import IRanges -import numpy as np -from genomicranges import GenomicRanges -from random import random - -nrows = 200 -ncols = 6 -counts = np.random.rand(nrows, ncols) -gr = GenomicRanges( - seqnames=[ - "chr1", - "chr2", - "chr2", - "chr2", - "chr1", - "chr1", - "chr3", - "chr3", - "chr3", - "chr3", - ] * 20, - ranges=IRanges(range(100, 300), range(110, 310)), - strand = ["-", "+", "+", "*", "*", "+", "+", "+", "-", "-"] * 20, - mcols=BiocFrame({ - "score": range(0, 200), - "GC": [random() for _ in range(10)] * 20, - }) -) - -col_data_sce = BiocFrame({"treatment": ["ChIP", "Input"] * 3}, - row_names=["sce"] * 6, -) - -col_data_se = BiocFrame({"treatment": ["ChIP", "Input"] * 3}, - row_names=["se"] * 6, -) - -sample_map = BiocFrame({ - "assay": ["sce", "se"] * 6, - "primary": ["sample1", "sample2"] * 6, - "colname": ["sce", "se"] * 6 -}) - -sample_data = BiocFrame({"samples": ["sample1", "sample2"]}, row_names=["sample1", "sample2"]) -``` - -Then, create various experiment classes, - -```python -from singlecellexperiment import SingleCellExperiment -from summarizedexperiment import SummarizedExperiment - -tsce = SingleCellExperiment( - assays={"counts": counts}, row_data=gr.to_pandas(), column_data=col_data_sce -) - -tse2 = SummarizedExperiment( - assays={"counts": counts.copy()}, - row_data=gr.to_pandas().copy(), - column_data=col_data_se.copy(), -) -``` - -Now that we have all the pieces together, we can now create an MAE, - -```python -from multiassayexperiment import MultiAssayExperiment - -mae = MultiAssayExperiment( - experiments={"sce": tsce, "se": tse2}, - column_data=sample_data, - sample_map=sample_map, - metadata={"could be": "anything"}, -) -``` - -To make your life easier, we also provide methods to naively create sample mapping from experiments. - -**_This is not a recommended approach, but if you don't have sample mapping, then it doesn't matter._** - -```python -import multiassayexperiment -maeObj = multiassayexperiment.make_mae(experiments={"sce": tsce, "se": tse2}) -``` - -## Import `MuData` and `AnnData` as `MultiAssayExperiment` - -If you have datasets stored as `MuData`, these can be easily converted to an MAE using the `from_mudata` method. - -Lets first construct `AnnData`` objects and then an MAE - -```python -import multiassayexperiment as mae -import numpy as np -from anndata import AnnData - -np.random.seed(1) - -n, d, k = 1000, 100, 10 - -z = np.random.normal(loc=np.arange(k), scale=np.arange(k) * 2, size=(n, k)) -w = np.random.normal(size=(d, k)) -y = np.dot(z, w.T) - -adata = AnnData(y) -adata.obs_names = [f"obs_{i+1}" for i in range(n)] -adata.var_names = [f"var_{j+1}" for j in range(d)] - -d2 = 50 -w2 = np.random.normal(size=(d2, k)) -y2 = np.dot(z, w2.T) - -adata2 = AnnData(y2) -adata2.obs_names = [f"obs_{i+1}" for i in range(n)] -adata2.var_names = [f"var2_{j+1}" for j in range(d2)] -``` - -we can now construct a `MuData` object and convert that to an MAE - -```python -from mudata import MuData -from multiassayexperiment import MultiAssayExperiment -mdata = MuData({"rna": adata, "spatial": adata2}) - -maeObj = MultiAssayExperiment.from_mudata(input=mdata) -``` - -Methods are also available to convert an `AnnData` object to `MAE`. - -```python -import multiassayexperiment -maeObj = multiassayexperiment.read_h5ad("tests/data/adata.h5ad") -``` - -# Accessors - -Multiple methods are available to access various slots of a `MultiAssayExperiment` object - -```python -mae.assays -mae.column_data -mae.sample_map -mae.experiments -mae.metadata -``` - -## Access experiments - -if you want to access a specific experiment - -```python -# access a specific experiment -mae.experiment("se") -``` - -This does not include the sample data stored in the MAE. If you want to include this information - -***Note: This creates a copy of the experiment object.*** - -```python -expt_with_sampleData = maeObj.experiment(experiment_name, with_sample_data=True) -``` - -# Slice a `MultiAssayExperiment` - -`MultiAssayExperiment` allows subsetting by `rows`, `columns`, and `experiments`. Samples are automatically sliced during this operation. - -The structure for slicing, - -``` -mae[rows, columns, experiments] -``` - -- rows, columns: accepts either a slice, list of indices or a dictionary to specify slices per experiment. -- experiments: accepts a list of experiment names to subset to. - -## Slice by row and column slices - -```python -maeObj[1:5, 0:4] -``` - -## Slice by rows, columns, experiments - -```python -maeObj[1:5, 0:4, ["spatial"]] -``` - -Checkout other methods that perform similar operations - `subset_by_rows`, `subset_by_columns` & `subset_by_experiments`. - -# Helper methods - -## completedCases - -This method returns a boolean vector that specifies which bio specimens have data across all experiments. - -```python -maeObj.completed_cases() -``` - -## replicated - -replicated identifies bio specimens that have multiple observations per experiment. - -```python -maeObj.replicated() -``` diff --git a/setup.cfg b/setup.cfg index e790e8e..5ee1f2b 100644 --- a/setup.cfg +++ b/setup.cfg @@ -49,9 +49,9 @@ python_requires = >=3.8 # For more information, check out https://semver.org/. install_requires = importlib-metadata; python_version<"3.8" - biocframe>=0.5.6,<0.6.0 - biocutils>=0.1.4,<0.2.0 - summarizedexperiment>=0.4.1,<0.5.0 + biocframe>=0.5.6 + biocutils>=0.1.4 + summarizedexperiment>=0.4.5 [options.packages.find] where = src