docs: improved readSCP vignette upon refactoring

UCLouvain-CBIO · Apr 10, 2024 · 22d0ae9 · 22d0ae9
1 parent 5e49dd3
commit 22d0ae9
Show file tree

Hide file tree

Showing 9 changed files with 418 additions and 383 deletions.
diff --git a/vignettes/figures/readSCP_step1.png b/vignettes/figures/readSCP_step1.png
diff --git a/vignettes/figures/readSCP_step1.svg b/vignettes/figures/readSCP_step1.svg
diff --git a/vignettes/figures/readSCP_step2.png b/vignettes/figures/readSCP_step2.png
diff --git a/vignettes/figures/readSCP_step2.svg b/vignettes/figures/readSCP_step2.svg
diff --git a/vignettes/figures/readSCP_step3.png b/vignettes/figures/readSCP_step3.png
diff --git a/vignettes/figures/readSCP_step3.svg b/vignettes/figures/readSCP_step3.svg
diff --git a/vignettes/figures/readSCP_step4.png b/vignettes/figures/readSCP_step4.png
diff --git a/vignettes/figures/readSCP_step4.svg b/vignettes/figures/readSCP_step4.svg
diff --git a/vignettes/read_scp.Rmd b/vignettes/read_scp.Rmd
@@ -44,27 +44,24 @@ Because mass spectrometry (MS)-based single-cell proteomics (SCP) only
 captures the proteome of between one and a few tens of single-cells in
 a single run, the data is usually acquired across many MS batches.
 Therefore, the data for each run should conceptually be stored in its
-own container, that we here call an *assay*. The expected input for
+own container, that we here call a *set*. The expected input for
 working with the `scp` package is quantification data of peptide to
 spectrum matches (PSM). These data can then be processed to reconstruct
 peptide and protein data. The links between related features across
-different assays are stored to facilitate manipulation and
+different sets are stored to facilitate manipulation and
 visualization of of PSM, peptide and protein data. This is
 conceptually shown below.
 
 ```{r, fig.cap="The `scp` framework relies on `SingleCellExperiment` and `QFeatures` objects", echo=FALSE, out.width='100%', fig.align='center'}
 knitr::include_graphics("./figures/SCP_framework.png")
 ```
 
-There are two input tables required for starting an analysis with
-`scp`:
+The main input table required for starting an analysis with `scp` is 
+called the `assayData`.
 
-1. The input table
-2. The sample table
+# `assayData` table
 
-# Input table
-
-The input table is generated after the identification and
+The `assayData` table is generated after the identification and
 quantification of the MS spectra by a pre-processing software such as
 MaxQuant, ProteomeDiscoverer or MSFragger (the
 [list](https://en.wikipedia.org/wiki/List_of_mass_spectrometry_software)
@@ -84,20 +81,20 @@ In this toy example, there are 1361 rows corresponding to features
 fields recorded by MaxQuant during the processing of the MS spectra.
 There are three types of columns:
 
-- Feature quantification: 1 to n (depending on technology)
+- Quantification columns (`quantCols`): 1 to n (depending on technology)
+- Run identifier column (`runCol`): *e.g.* file name
 - Feature annotations: *e.g.* peptide sequence, ion charge, protein name
-- Acquisition annotations: *e.g.* file name
 
-```{r, echo=FALSE, out.width='60%', fig.cap="Conceptual representation of the input table", fig.align = 'center'}
+```{r, echo=FALSE, out.width="450px", fig.cap="Conceptual representation of the `assayData` input table", fig.align = 'center'}
 knitr::include_graphics('figures/readSCP_inputTable.png')
 ```
 
-### Feature quantifications
+### Quantification columns (`quantCols`)
 
 The quantification data can be composed of one (in case of label-free
-acquisition) up to 16 columns (in case of TMT-16 multiplexing). The
-columns holding the quantification start with `Reporter.intensity.`
-followed by a number.
+acquisition) or multiple columns (in case of multiplexing). In the
+example data set, the columns holding the quantification, the 
+`quantCols`, start with `Reporter.intensity.` followed by a number.
 
 ```{r}
 (quantCols <- grep("Reporter.intensity.\\d", colnames(mqScpData),
@@ -113,10 +110,19 @@ come back to this later.
 head(mqScpData[, quantCols])
 ```
 
+### Run identifier column (`runCol`)
+
+This column provides the identifier of the MS runs in which each PSM 
+was acquired. MaxQuant uses the raw file name to identify the run.
+
+```{r}
+unique(mqScpData$Raw.file)
+```
+
 ### Feature annotations
 
-Most columns in the `mqScpData` table contain information used or
-generated during the identification of the MS spectra. For instance,
+The remaining columns in the `mqScpData` table contain information used
+or generated during the identification of the MS spectra. For instance,
 you may find the charge of the parent ion, the score and probability
 of a correct match between the MS spectrum and a peptide sequence, the
 sequence of the best matching peptide, its length, its modifications,
@@ -128,20 +134,9 @@ head(mqScpData[, c("Charge", "Score", "PEP", "Sequence", "Length",
                    "Retention.time", "Proteins")])
 ```
 
-### Acquisition annotations
-
-This type of annotation is related to the MS instrument. In MaxQuant,
-only the file name generated by the MS instrument is stored. There is one
-file for each MS run, hence the file name can be used as a batch
-identifier.
-
-```{r}
-unique(mqScpData$Raw.file)
-```
-
-# Sample table
+# `colData` table
 
-The sample table contains the experimental design generated by the
+The `colData` table contains the experimental design generated by the
 researcher. The rows of the sample table correspond to a sample in
 the experiment and the columns correspond to the available annotations
 about the sample. We will here use the second example table:
@@ -151,100 +146,91 @@ data("sampleAnnotation")
 head(sampleAnnotation)
 ```
 
-This table may contain any information about the samples. For example,
+The `colData` table may contain any information about the samples. For example,
 useful information could be the type of sample that is analysed, a
 phenotype known from the experimental design, the MS batch, the
 acquisition date, MS settings used to acquire the sample, the LC
 batch, the sample preparation batch, etc... However, `scp`
-**requires** 2 specific fields in the sample annotations:
+**requires** 2 specific columns in the `colData` table:
 
-1. One column that tells `scp` the names of the columns in the feature
+1. `runCol`: this column provides the MS run names (that match the 
+   `Raw.file` column in the `assayData` table).
+2. `quantCols`: this column tells `scp` the names of the columns in the feature
    data holds the quantification of the corresponding sample.
-2. One column containing the MS run names (`Raw.file` in this case).
-   It must have the same name as the name of the column containing the
-   MS run names in the quantification table.
 
 These two columns allow `scp` to correctly split and match data that
 were acquired across multiple acquisition runs.
 
-```{r echo=FALSE, out.width='60%', fig.cap="Conceptual representation of the sample table", fig.align = 'center'}
+```{r echo=FALSE, out.width='450px', fig.cap="Conceptual representation of the sample table", fig.align = 'center'}
 knitr::include_graphics('figures/readSCP_sampleTable.png')
 ```
 
-# `readSCP`
+# `readSCP()`
 
-`readSCP` is the function that converts the sample and the feature
-data into a `QFeatures` object following the data structure described
+`readSCP` is the function that converts the `assayData` and the 
+`colData` into a `QFeatures` object following the data structure described
 above, that is storing the data belonging to each MS batch in a
-separate `SingleCellExperiment` object. We therefore provide the
-feature data, the sample data to the function as well as the name of
-the column that holds the batch name in both tables and the name of
-the column in the sample data that points to the quantification
-columns in the feature data.
+separate `SingleCellExperiment` object.
 
 ## Sample names
 
 `readSCP()` automatically assigns names that are unique across all
-samples in all assays. This is performed by appending the name of the
-batch where a given sample is found in with the name of the
+samples in all sets. This is performed by appending the name of the
+MS run where a given sample is found with the name of the
 quantification column for that sample. Suppose a sample belongs to
 batch `190222S_LCA9_X_FP94BM` and the quantification values in the
-feature data are found in the column called `Reporter.intensity.3`,
+`assayData` table are found in the column called `Reporter.intensity.3`,
 then the sample name will become
-`190222S_LCA9_X_FP94BMReporter.intensity.3`.
+`190222S_LCA9_X_FP94BM_Reporter.intensity.3`.
 
 ## Special case: empty samples
 
 In some rare cases, it can be beneficial to remove empty samples (all
-quantifications are `NA`) from the assays. Such samples can occur when
+quantifications are `NA`) from the sets. Such samples can occur when
 samples that were acquired with different multiplexing labels are
 merged in a single table. For instance, the SCoPE2 data we provide as
 an example contains runs that were acquired with two TMT protocols.
-The 3 first assays were acquired using the TMT-11 protocol and the
-last assay was acquired using a TMT-16 protocol. When exporting the
-table, the authors combined the data in a single table, were missing
-channels in the TMT-11 data are filled with `NA`. This is essential
-when working in table format, but since `scp` keeps the runs separated
-we can allow for different numbers of channels per run.  When setting
+The 3 first sets were acquired using the TMT-11 protocol and the
+last set was acquired using a TMT-16 protocol. The missing label
+channels in the TMT-11 data are filled with `NA`s. When setting
 `removeEmptyCols = TRUE`, `readSCP` automatically detects and removes
-columns that contain only `NA`s,
+columns containing only `NA`s,
 
 ## Running `readSCP`
 
 We convert the sample and the feature data into a `QFeatures` object
 in a single command thanks to `readSCP`.
 
 ```{r readSCP}
-scp <- readSCP(assayData = mqScpData,
+(scp <- readSCP(assayData = mqScpData,
                colData = sampleAnnotation,
                runCol = "Raw.file",
-               removeEmptyCols = TRUE)
-scp
+               removeEmptyCols = TRUE))
 ```
 
-We can see that the object returned by `readSCP()` is a `QFeatures`
-object containing 4 `SingleCellExperiment` assays that have been named
-after the 4 MS batches. Each assay contains either 11 or 16 columns
+The object returned by `readSCP()` is a `QFeatures`
+object containing 4 `SingleCellExperiment` sets that have been named
+after the 4 MS batches. Each set contains either 11 or 16 columns
 (samples) depending on the TMT labelling strategy and a variable
 number of rows (quantified PSMs). Each piece of information can easily
 be retrieved thanks to the `QFeatures` architectures.  As mentioned in
-the previous vignette, sample data is retrieved from the
-`colData`. Note that unique sample names were automatically generated
-by combining the batch name and quantitative column name:
+another
+[vignette](https://uclouvain-cbio.github.io/scp/articles/QFeatures_nutshell.html),
+the `colData` is retrieved using its dedicated function:
 
 ```{r colData}
 head(colData(scp))
 ```
 
 The feature annotations are retrieved from the `rowData`. Since the
-feature annotations are specific to each assay, we need to tell from
-which assay we want to get the `rowData`:
+feature annotations are specific to each set, we need to tell from
+which set we want to get the `rowData`:
 
 ```{r rowData}
 head(rowData(scp[["190222S_LCA9_X_FP94BM"]]))[, 1:5]
 ```
 
-Finally, we can also retrieve the quantification matrix for an assay
+Finally, we can also retrieve the quantification matrix for a set
 of interest:
 
 ```{r assay}
@@ -255,41 +241,42 @@ head(assay(scp, "190222S_LCA9_X_FP94BM"))
 
 `readSCP` proceeds as follows:
 
-1. The table must be provided by the user as a `data.frame` before
-   being converted to a `SingleCellExperiment` object. `readSCP()`
-   needs to know in which field(s) the quantitative data is
-   stored. Those field name(s) is/are provided by the `quantCols`
+1. The `assayData` table must be provided as a `data.frame`. 
+   `readSCP()` converts the table to a `SingleCellExperiment` object 
+   but it needs to know which column(s) store the quantitative data.
+   Those column name(s) is/are provided by the `quantCols`
    field in the annotation table (`colData` argument).
 
-```{r echo=FALSE, out.width='60%', fig.cap="Step1: Convert the input table to a `SingleCellExperiment` object", fig.align = 'center'}
+```{r echo=FALSE, out.width='450px', include=TRUE,fig.cap="Step1: Convert the input table to a `SingleCellExperiment` object", fig.align = 'center'}
 knitr::include_graphics('figures/readSCP_step1.png')
 ```
 
 2. The `SingleCellExperiment` object is then split according to the
    acquisition run. The split is performed depending on the `runCol`
    field in `assayData`. It is also indicated in the `runCol`
-   argument. In this case the data will be split according to the
+   argument. In this case, the data will be split according to the
    `Raw.file` column in `mqScpData`. `Raw.file` contains the names of
    the acquisition runs that was used by MaxQuant to retrieve the raw
    data files.
 
-```{r echo=FALSE, out.width='65%', fig.cap="Step2: Split by acquisition run", fig.align = 'center'}
+```{r echo=FALSE, out.width='500px', fig.cap="Step2: Split by acquisition run", fig.align = 'center'}
 knitr::include_graphics('figures/readSCP_step2.png')
 ```
 
 3. The sample annotations is generated from the supplied sample table
    (`colData` argument). Note that in order for `readSCP()` to
    correctly match the feature data with the annotations, `colData`
-   must contain the same `runCol` field with batch names.
+   must contain a `runCol` column with run names and a `quantCols` 
+   column with the names of the quantitative columns in `assayData`.
 
-```{r echo=FALSE, out.width='100%', fig.cap="Step3: Adding and matching the sample annotations", fig.align = 'center'}
+```{r echo=FALSE, out.width='700px', fig.cap="Step3: Adding and matching the sample annotations", fig.align = 'center'}
 knitr::include_graphics('figures/readSCP_step3.png')
 ```
 
-4. Finally, the split feature data and the sample annotations are
-   stored in a single `QFeatures` object.
+4. Finally, the `SingleCellExperiment` sets and the `colData` are
+   converted to a `QFeatures` object.
 
-```{r echo=FALSE, out.width='80%', fig.cap="Step4: Convert to a `QFeatures`", fig.align = 'center'}
+```{r echo=FALSE, out.width='600px', fig.cap="Step4: Converting to a `QFeatures`", fig.align = 'center'}
 knitr::include_graphics('figures/readSCP_step4.png')
 ```
 
@@ -302,7 +289,7 @@ sample data must also contain a column that points to the columns of
 the feature data that contains the quantifications. Since label-free
 SCP acquires one single-cell per run, this sample data column should
 point the same column for all samples. Moreover, this means that each
-PSM assay will contain a single column.
+PSM set will contain a single column.
 
 # What about other input formats?
 
@@ -324,7 +311,7 @@ plexDIA/mTRAQ `Report.tsv` files generated by DIA-NN.
 
 For more information, see the `readQFeatures()` and
 `readQFeaturesFromDIANN()` manual pages, that described the main
-principle that convern the data import and formatting.
+principle that concern the data import and formatting.
 
 # Need help?