Skip to content

Commit

Permalink
docs: improved readSCP vignette upon refactoring
Browse files Browse the repository at this point in the history
  • Loading branch information
cvanderaa committed Apr 10, 2024
1 parent 5e49dd3 commit 22d0ae9
Show file tree
Hide file tree
Showing 9 changed files with 418 additions and 383 deletions.
Binary file modified vignettes/figures/readSCP_step1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
142 changes: 77 additions & 65 deletions vignettes/figures/readSCP_step1.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified vignettes/figures/readSCP_step2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
152 changes: 82 additions & 70 deletions vignettes/figures/readSCP_step2.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified vignettes/figures/readSCP_step3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
175 changes: 95 additions & 80 deletions vignettes/figures/readSCP_step3.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified vignettes/figures/readSCP_step4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
177 changes: 93 additions & 84 deletions vignettes/figures/readSCP_step4.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
155 changes: 71 additions & 84 deletions vignettes/read_scp.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -44,27 +44,24 @@ Because mass spectrometry (MS)-based single-cell proteomics (SCP) only
captures the proteome of between one and a few tens of single-cells in
a single run, the data is usually acquired across many MS batches.
Therefore, the data for each run should conceptually be stored in its
own container, that we here call an *assay*. The expected input for
own container, that we here call a *set*. The expected input for
working with the `scp` package is quantification data of peptide to
spectrum matches (PSM). These data can then be processed to reconstruct
peptide and protein data. The links between related features across
different assays are stored to facilitate manipulation and
different sets are stored to facilitate manipulation and
visualization of of PSM, peptide and protein data. This is
conceptually shown below.

```{r, fig.cap="The `scp` framework relies on `SingleCellExperiment` and `QFeatures` objects", echo=FALSE, out.width='100%', fig.align='center'}
knitr::include_graphics("./figures/SCP_framework.png")
```
There are two input tables required for starting an analysis with
`scp`:
The main input table required for starting an analysis with `scp` is
called the `assayData`.
1. The input table
2. The sample table
# `assayData` table
# Input table
The input table is generated after the identification and
The `assayData` table is generated after the identification and
quantification of the MS spectra by a pre-processing software such as
MaxQuant, ProteomeDiscoverer or MSFragger (the
[list](https://en.wikipedia.org/wiki/List_of_mass_spectrometry_software)
Expand All @@ -84,20 +81,20 @@ In this toy example, there are 1361 rows corresponding to features
fields recorded by MaxQuant during the processing of the MS spectra.
There are three types of columns:

- Feature quantification: 1 to n (depending on technology)
- Quantification columns (`quantCols`): 1 to n (depending on technology)
- Run identifier column (`runCol`): *e.g.* file name
- Feature annotations: *e.g.* peptide sequence, ion charge, protein name
- Acquisition annotations: *e.g.* file name

```{r, echo=FALSE, out.width='60%', fig.cap="Conceptual representation of the input table", fig.align = 'center'}
```{r, echo=FALSE, out.width="450px", fig.cap="Conceptual representation of the `assayData` input table", fig.align = 'center'}
knitr::include_graphics('figures/readSCP_inputTable.png')
```
### Feature quantifications
### Quantification columns (`quantCols`)
The quantification data can be composed of one (in case of label-free
acquisition) up to 16 columns (in case of TMT-16 multiplexing). The
columns holding the quantification start with `Reporter.intensity.`
followed by a number.
acquisition) or multiple columns (in case of multiplexing). In the
example data set, the columns holding the quantification, the
`quantCols`, start with `Reporter.intensity.` followed by a number.
```{r}
(quantCols <- grep("Reporter.intensity.\\d", colnames(mqScpData),
Expand All @@ -113,10 +110,19 @@ come back to this later.
head(mqScpData[, quantCols])
```

### Run identifier column (`runCol`)

This column provides the identifier of the MS runs in which each PSM
was acquired. MaxQuant uses the raw file name to identify the run.

```{r}
unique(mqScpData$Raw.file)
```

### Feature annotations

Most columns in the `mqScpData` table contain information used or
generated during the identification of the MS spectra. For instance,
The remaining columns in the `mqScpData` table contain information used
or generated during the identification of the MS spectra. For instance,
you may find the charge of the parent ion, the score and probability
of a correct match between the MS spectrum and a peptide sequence, the
sequence of the best matching peptide, its length, its modifications,
Expand All @@ -128,20 +134,9 @@ head(mqScpData[, c("Charge", "Score", "PEP", "Sequence", "Length",
"Retention.time", "Proteins")])
```

### Acquisition annotations

This type of annotation is related to the MS instrument. In MaxQuant,
only the file name generated by the MS instrument is stored. There is one
file for each MS run, hence the file name can be used as a batch
identifier.

```{r}
unique(mqScpData$Raw.file)
```

# Sample table
# `colData` table

The sample table contains the experimental design generated by the
The `colData` table contains the experimental design generated by the
researcher. The rows of the sample table correspond to a sample in
the experiment and the columns correspond to the available annotations
about the sample. We will here use the second example table:
Expand All @@ -151,100 +146,91 @@ data("sampleAnnotation")
head(sampleAnnotation)
```

This table may contain any information about the samples. For example,
The `colData` table may contain any information about the samples. For example,
useful information could be the type of sample that is analysed, a
phenotype known from the experimental design, the MS batch, the
acquisition date, MS settings used to acquire the sample, the LC
batch, the sample preparation batch, etc... However, `scp`
**requires** 2 specific fields in the sample annotations:
**requires** 2 specific columns in the `colData` table:

1. One column that tells `scp` the names of the columns in the feature
1. `runCol`: this column provides the MS run names (that match the
`Raw.file` column in the `assayData` table).
2. `quantCols`: this column tells `scp` the names of the columns in the feature
data holds the quantification of the corresponding sample.
2. One column containing the MS run names (`Raw.file` in this case).
It must have the same name as the name of the column containing the
MS run names in the quantification table.

These two columns allow `scp` to correctly split and match data that
were acquired across multiple acquisition runs.

```{r echo=FALSE, out.width='60%', fig.cap="Conceptual representation of the sample table", fig.align = 'center'}
```{r echo=FALSE, out.width='450px', fig.cap="Conceptual representation of the sample table", fig.align = 'center'}
knitr::include_graphics('figures/readSCP_sampleTable.png')
```

# `readSCP`
# `readSCP()`

`readSCP` is the function that converts the sample and the feature
data into a `QFeatures` object following the data structure described
`readSCP` is the function that converts the `assayData` and the
`colData` into a `QFeatures` object following the data structure described
above, that is storing the data belonging to each MS batch in a
separate `SingleCellExperiment` object. We therefore provide the
feature data, the sample data to the function as well as the name of
the column that holds the batch name in both tables and the name of
the column in the sample data that points to the quantification
columns in the feature data.
separate `SingleCellExperiment` object.

## Sample names

`readSCP()` automatically assigns names that are unique across all
samples in all assays. This is performed by appending the name of the
batch where a given sample is found in with the name of the
samples in all sets. This is performed by appending the name of the
MS run where a given sample is found with the name of the
quantification column for that sample. Suppose a sample belongs to
batch `190222S_LCA9_X_FP94BM` and the quantification values in the
feature data are found in the column called `Reporter.intensity.3`,
`assayData` table are found in the column called `Reporter.intensity.3`,
then the sample name will become
`190222S_LCA9_X_FP94BMReporter.intensity.3`.
`190222S_LCA9_X_FP94BM_Reporter.intensity.3`.

## Special case: empty samples

In some rare cases, it can be beneficial to remove empty samples (all
quantifications are `NA`) from the assays. Such samples can occur when
quantifications are `NA`) from the sets. Such samples can occur when
samples that were acquired with different multiplexing labels are
merged in a single table. For instance, the SCoPE2 data we provide as
an example contains runs that were acquired with two TMT protocols.
The 3 first assays were acquired using the TMT-11 protocol and the
last assay was acquired using a TMT-16 protocol. When exporting the
table, the authors combined the data in a single table, were missing
channels in the TMT-11 data are filled with `NA`. This is essential
when working in table format, but since `scp` keeps the runs separated
we can allow for different numbers of channels per run. When setting
The 3 first sets were acquired using the TMT-11 protocol and the
last set was acquired using a TMT-16 protocol. The missing label
channels in the TMT-11 data are filled with `NA`s. When setting
`removeEmptyCols = TRUE`, `readSCP` automatically detects and removes
columns that contain only `NA`s,
columns containing only `NA`s,

## Running `readSCP`

We convert the sample and the feature data into a `QFeatures` object
in a single command thanks to `readSCP`.

```{r readSCP}
scp <- readSCP(assayData = mqScpData,
(scp <- readSCP(assayData = mqScpData,
colData = sampleAnnotation,
runCol = "Raw.file",
removeEmptyCols = TRUE)
scp
removeEmptyCols = TRUE))
```

We can see that the object returned by `readSCP()` is a `QFeatures`
object containing 4 `SingleCellExperiment` assays that have been named
after the 4 MS batches. Each assay contains either 11 or 16 columns
The object returned by `readSCP()` is a `QFeatures`
object containing 4 `SingleCellExperiment` sets that have been named
after the 4 MS batches. Each set contains either 11 or 16 columns
(samples) depending on the TMT labelling strategy and a variable
number of rows (quantified PSMs). Each piece of information can easily
be retrieved thanks to the `QFeatures` architectures. As mentioned in
the previous vignette, sample data is retrieved from the
`colData`. Note that unique sample names were automatically generated
by combining the batch name and quantitative column name:
another
[vignette](https://uclouvain-cbio.github.io/scp/articles/QFeatures_nutshell.html),
the `colData` is retrieved using its dedicated function:

```{r colData}
head(colData(scp))
```

The feature annotations are retrieved from the `rowData`. Since the
feature annotations are specific to each assay, we need to tell from
which assay we want to get the `rowData`:
feature annotations are specific to each set, we need to tell from
which set we want to get the `rowData`:

```{r rowData}
head(rowData(scp[["190222S_LCA9_X_FP94BM"]]))[, 1:5]
```

Finally, we can also retrieve the quantification matrix for an assay
Finally, we can also retrieve the quantification matrix for a set
of interest:

```{r assay}
Expand All @@ -255,41 +241,42 @@ head(assay(scp, "190222S_LCA9_X_FP94BM"))

`readSCP` proceeds as follows:

1. The table must be provided by the user as a `data.frame` before
being converted to a `SingleCellExperiment` object. `readSCP()`
needs to know in which field(s) the quantitative data is
stored. Those field name(s) is/are provided by the `quantCols`
1. The `assayData` table must be provided as a `data.frame`.
`readSCP()` converts the table to a `SingleCellExperiment` object
but it needs to know which column(s) store the quantitative data.
Those column name(s) is/are provided by the `quantCols`
field in the annotation table (`colData` argument).

```{r echo=FALSE, out.width='60%', fig.cap="Step1: Convert the input table to a `SingleCellExperiment` object", fig.align = 'center'}
```{r echo=FALSE, out.width='450px', include=TRUE,fig.cap="Step1: Convert the input table to a `SingleCellExperiment` object", fig.align = 'center'}
knitr::include_graphics('figures/readSCP_step1.png')
```
2. The `SingleCellExperiment` object is then split according to the
acquisition run. The split is performed depending on the `runCol`
field in `assayData`. It is also indicated in the `runCol`
argument. In this case the data will be split according to the
argument. In this case, the data will be split according to the
`Raw.file` column in `mqScpData`. `Raw.file` contains the names of
the acquisition runs that was used by MaxQuant to retrieve the raw
data files.
```{r echo=FALSE, out.width='65%', fig.cap="Step2: Split by acquisition run", fig.align = 'center'}
```{r echo=FALSE, out.width='500px', fig.cap="Step2: Split by acquisition run", fig.align = 'center'}
knitr::include_graphics('figures/readSCP_step2.png')
```

3. The sample annotations is generated from the supplied sample table
(`colData` argument). Note that in order for `readSCP()` to
correctly match the feature data with the annotations, `colData`
must contain the same `runCol` field with batch names.
must contain a `runCol` column with run names and a `quantCols`
column with the names of the quantitative columns in `assayData`.

```{r echo=FALSE, out.width='100%', fig.cap="Step3: Adding and matching the sample annotations", fig.align = 'center'}
```{r echo=FALSE, out.width='700px', fig.cap="Step3: Adding and matching the sample annotations", fig.align = 'center'}
knitr::include_graphics('figures/readSCP_step3.png')
```

4. Finally, the split feature data and the sample annotations are
stored in a single `QFeatures` object.
4. Finally, the `SingleCellExperiment` sets and the `colData` are
converted to a `QFeatures` object.

```{r echo=FALSE, out.width='80%', fig.cap="Step4: Convert to a `QFeatures`", fig.align = 'center'}
```{r echo=FALSE, out.width='600px', fig.cap="Step4: Converting to a `QFeatures`", fig.align = 'center'}
knitr::include_graphics('figures/readSCP_step4.png')
```
Expand All @@ -302,7 +289,7 @@ sample data must also contain a column that points to the columns of
the feature data that contains the quantifications. Since label-free
SCP acquires one single-cell per run, this sample data column should
point the same column for all samples. Moreover, this means that each
PSM assay will contain a single column.
PSM set will contain a single column.
# What about other input formats?
Expand All @@ -324,7 +311,7 @@ plexDIA/mTRAQ `Report.tsv` files generated by DIA-NN.
For more information, see the `readQFeatures()` and
`readQFeaturesFromDIANN()` manual pages, that described the main
principle that convern the data import and formatting.
principle that concern the data import and formatting.
# Need help?
Expand Down

0 comments on commit 22d0ae9

Please sign in to comment.