docs

cgoliver · Sep 19, 2024 · f8d37b1 · f8d37b1
1 parent 5b245d1
commit f8d37b1
Show file tree

Hide file tree

Showing 9 changed files with 497 additions and 50 deletions.
diff --git a/docs/source/command_line.rst b/docs/source/command_line.rst
@@ -0,0 +1,29 @@
+Command Line Utilities
+-------------------------
+
+
+We provide several command line utilities which you can use to set up
+the rnaglib environment.
+
+
+Database building
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+To build or update a local database of RNA structures along with their annotations,
+you can use the ``rnaglib_prepare_data`` command line utility.
+
+
+::
+
+    $ rnaglib_prepare_data -s structures/ --tag first_build -o builds/ -d
+
+Database Indexing
+~~~~~~~~~~~~~~~~~~~
+
+Indexing a database collects information about annotations present in a
+database to enable rapid access of particular RNAs given some desired
+properties.::
+
+    $ rnaglib_index
+
+
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -18,17 +18,10 @@
    :caption: Tutorials
    :hidden:
 
-   Machine Learning on Benchmark Datasets <tuto_tasks>
    What is an RNA 2.5D graph? <what_is>
    A tour of RNA 2.5D graphs <tuto_2.5d>
-
-.. toctree::
-   :maxdepth: 2
-   :caption: Advanced Tutorials 
-   :hidden:
-
-   Creating custom tasks and splits <tuto_custom_task>
-   Add your own annotations and features <tuto_custom_features>
+   ML Tasks API <tuto_custom_task>
+   RNA Transforms <tuto_transforms>
    How is the data built? <tuto_build>
 
 

diff --git a/docs/source/tuto_custom_task.rst b/docs/source/tuto_custom_task.rst
@@ -1,33 +1,268 @@
-Anatomy of a Task
+Using Tasks API
+--------------------------
+
+An ``rnaglib.Task`` object packages everything you need to train a model for a particular biological problem.
+
+The key components of a ``Task`` you will use are:
+
+* A dedicated dataset for the task
+* Train/validation/test dataloaders
+* Model evaluator
+
+When you instantiate the task, the task either calls the ``process()`` and ``split()`` methods to compute the necessary data or you have already run this before and the result was stored in the ``root`` directory and loading should be instantaneous. 
+
+Once loading is complete you only need to select a tensor representation (e.g. graph, voxel, point cloud) and encode the underlying dataset with it and then iterate through the train loader. Note that whenever you update the task's dataset you should call ``set_loaders()`` so that changes in the dataset are reflected in the data served by the loaders::
+
+
+
+    from rnaglib.tasks import ChemicalModification
+    from rnaglib.transforms import GraphRepresentation
+
+    ta = ChemicalModification(root='cm')
+    ta.dataset.add_representation(GraphRepresentation(framework='pyg'))
+    ta.set_loaders()
+
+    for batch in ta.train_loader:
+        pred = ta.dummy_model(batch['graph'])
+        ...
+
+
+     metrics = ta.evaluate(ta.dummy_model)
+
+
+Once you have completed training you can pass your model to the task's ``evaluate()`` method which will return a dictionary of metrics and performance values.
+
+.. note::
+
+    Each task provides a ``dummy_model`` variable which you can use for testing out the task. It simply returns a random prediction of the appropriate shape.
+
+
+
+Building Custom Tasks
 -------------------------------------
 
-If you would like to propose a new prediction task for the machine learning community. We provide the customizable ``Task`` class.
+If you would like to propose a new prediction task for the machine learning community. You just have to implement a few methos in a subclass of the``Task`` class.
 
 An instance of the ``Task`` class packages the following attributes:
 
 - ``dataset``: full collection of RNAs to use in the task.
 - ``splitter``: method for partitioning the dataset into train, validation, and test subsets.
-- ``features_computer``: method for setting and encoding input and target variables.
+- ``target_vars``: method for setting and encoding input and target variables.
 - ``evaluate``: method which accepts a model and returns performance metrics.
 
+Once the task processing is complete, all task data is dumped into ``root`` which is a path passed to the task init method.
 
-Here is a template for a custom task::
+
+Here is a minimal template for a custom task::
 
     from rnaglib.tasks import Task
     from rnaglib.data_loading import RNADataset
     from rnaglib.splitters import Splitter 
 
     class MyTask(Task):
 
-        def build_dataset(self) -> RNADataset:
+        def __init__(self, root):
+            super().__init__(root)
+
+        def process(self) -> RNADataset:
             # build the task's dataset
+            # ...
             pass
 
+        @property
         def default_splitter() -> Splitter:
             # return a splitter object to build train/val/test splits
+            # ...
             pass
-
-        def features_computer() -> FeaturesComputer:
+            
+        def get_task_vars() -> FeaturesComputer:
             # computes the task's default input and target variables
+            # managed by creating a FeaturesComputer object
+            # ...
+            pass
+
+
+In this tutorial we will walk through the steps to create a task with the aim of predicting for each residue, whether or not it will be chemically modified, and a more advanced example we will build the task of predicting the Rfam classification of an RNA.
+
+Types of tasks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~   
+
+Tasks can operate at the residue, edge, and whole RNA level. 
+Biolerplate for evaluation and loading would be affected depending on the choice of level.
+For that reason we create sub-classes of the ``Task`` clas which you can use to avoid re-coding such things.
+
+
+Since chemical modifications are applied to residues, Let's build a residue-level binary classification task.::
+
+    from rnaglib.tasks import ResidueClassificationTask
+
+    class ChemicalModification(ResidueClasificationTask):
+        ....
+
+
+
+
+1. Create the task's dataset
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~   
+
+Each task needs to define which RNAs to use. Typically this involves filtering a whole dataset of available RNAs by certain attributes to retain only the ones that contain certain annotations or pass certain criteria (e.g. size, origin, resolution, etc.).
+
+You are free to do this in any way you like as long as after ``Task.process()`` is called, a list of ``.json`` graphs storing the RNA annotations the task needs is dumped into ``{root}/dataset``.
+
+To make things easier you can take advantage of the ``rnaglib.Tranforms`` library which provides funcionality for manipulating datasets of RNAs.
+
+Let's define a ``Task.process()`` method which builds a dataset with a single criterion:
+
+* Only keep RNAs that contain at least one chemically modified residue
+
+The ``Transforms`` library provides a filter which checks that an RNA's residues are of a desired value. ::
+
+    from rnaglib.data_loading import RNADataset
+    from rnaglib.tasks import ResidueClassificationTask
+    from rnaglib.transforms import ResidueAttributeFilter
+    from rnaglib.transforms import PDBIDNameTransform
+
+    class ChemicalModification(ResidueClasificationTask):
+        def process(self) -> RNADataset:
+            # grab a full set of available RNAs
+            rnas = RNADataset()
+
+            filter = ResidueAttributeFilter(attribute='is_modified',
+                                            val_checker=lambda val: val == True
+                                            )
+
+            rnas = filter(rnas)
+
+            rnas = PDBIDNameTransform()(rnas)
+            dataset = RNADataset(rnas=[r["rna"] for r in rnas])
+            return dataset
+
+            pass
+
+
+Applying the filter gives us a new list containing only the RNAs that passed the filter. The last thing we need to do is assign a ``name`` value to each RNA so that they can be properly managed by the ``RNADataset``. We assign the PDBID as the name of each item in our dataset using the ``PDBIDNameTransform``.
+
+Now we just create a new ``RNADataset`` object using the reduced list. The dataset object requires a list and not a generator so we just unroll before passing it.
+
+That's it now you just return the new ``RNADataset`` object.
+
+2. Set the task's variables
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~   
+
+Apart from the RNAs themselves, the task needs to know which variables are relevant. In particular we need to set the prediction target. Additionally we can set some default input features, which are always provided. The user can always add more input features if he/she desires by manipulating ``task.dataset.features_computer`` but at the minimum we need to define target variables.::
+
+    from rnaglib.data_loading import RNADataset
+    from rnaglib.tasks import ResidueClassificationTask
+    from rnaglib.transforms import ResidueAttributeFilter
+    from rnaglib.transforms import PDBIDNameTransform
+    from rnaglib.transforms import FeaturesComputer
+
+    class ChemicalModification(ResidueClasificationTask):
+        def process(self) -> RNADataset:
+            ...
+            pass
+
+        def get_task_vars(self) -> FeaturesComputer:
+            return FeaturesComputer(nt_features=['nt_code'], nt_targets=['is_modified'])
+
+
+Here we simply have a nucleotide level target so we pass the ``'is_modified'`` attribute to the ``FeaturesComputer`` object. This will take care of selecting the residue when encoding the RNA into tensor form. In addition we provide the nucleotide identity (``'nt_code'``) as a default input feature.
+
+
+3. Train/val/test splits
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~   
+
+The last necessary step is to define the train, validation and test subsets of the whole dataset. Once these are set, the task's boilerplate will take care of generating the appropriate loaders.
+
+To set the splits, you implement the ``default_splitter()`` method which returns a ``Splitter`` object. A ``Splitter`` object is simply a callable which accepts a dataset and returns three lists of indices representing the train, validation and test subsets.
+
+You can select from the library of implemented splitters of implement your own.
+
+For this example, we will split the RNAs by structural similarity using RNA-align.::
+
+    from rnaglib.data_loading import RNADataset
+    from rnaglib.tasks import ResidueClassificationTask
+
+    from rnaglib.transforms import ResidueAttributeFilter
+    from rnaglib.transforms import PDBIDNameTransform
+    from rnaglib.transforms import FeaturesComputer
+
+    from rnaglib.splitters import Splitter, RNAalignSplitter
+
+    class ChemicalModification(ResidueClasificationTask):
+        def process(self) -> RNADataset:
+            ...
+            pass
+
+        def get_task_vars(self) -> FeaturesComputer:
+            return FeaturesComputer(nt_features=['nt_code'], nt_targets=['is_modified'])
+
+        @property
+        def default_splitter(self) -> Splitter
+            return RNAalignSplitter(similarity_threshold=0.6)
+
+
+Now our splits will guarantee a maximum structural similarity of 0.6 between them.
+
+Check out the Splitter class for a quick guide on how to create your own splitters.
+
+Note that this is only setting the default method to use for splitting the dataset. If a user wants to try a different splitter it can be pased to the task's init.
+
+That's it! Your task is now fully defined and can be used in model training and evaluation.
+
+Here is the ful task implementation::
+
+
+    from rnaglib.data_loading import RNADataset
+    from rnaglib.tasks import ResidueClassificationTask
+    from rnaglib.transforms import FeaturesComputer
+    from rnaglib.transforms import ResidueAttributeFilter
+    from rnaglib.transforms import PDBIDNameTransform
+    from rnaglib.splitters import Splitter, RNAalignSplitter
+
+
+    class ChemicalModification(ResidueClassificationTask):
+        """Residue-level binary classification task to predict whether or not a given
+        residue is chemically modified.
+        """
+
+        target_var = "is_modified"
+
+        def __init__(self, root, splitter=None, **kwargs):
+            super().__init__(root=root, splitter=splitter, **kwargs)
+
+        def get_task_vars(self):
+            return FeaturesComputer(nt_targets=self.target_var)
+
+        def process(self):
+            rnas = ResidueAttributeFilter(
+                attribute=self.target_var, value_checker=lambda val: val == True
+            )(RNADataset(debug=self.debug))
+            rnas = PDBIDNameTransform()(rnas)
+            dataset = RNADataset(rnas=[r["rna"] for r in rnas])
+            return dataset
+
+        def default_splitter(self) -> Splitter:
+            return RNAalignSplitter(similarity_threshold=0.6)
+
+
+Customize Splitting
+------------------------
+
+We provide some pre-defined splitters for sequence and structure-based splitting. If you have other criteria for splitting you can subclass the ``Splitter`` class. All you have to do is implement the ``__call__()`` method which takes a dataset and returns three lists of indices::
+
+    class Splitter:
+        def __init__(self, split_train=0.7, split_valid=0.15, split_test=0.15):
+            assert sum([split_train, split_valid, split_test]) == 1, "Splits don't sum to 1."
+            self.split_train = split_train
+            self.split_valid = split_valid
+            self.split_test = split_test
             pass
-                
+
+        def __call__(self, dataset):
+            return None, None, None
+
+
+The ``__call__(self, dataset)`` method returns three lists of indices from the given ``dataset`` object.
+
+The splitter can be initiated with the desired proportions of the dataset for each subset.