Skip to content
Pieter Verschaffelt edited this page Aug 6, 2024 · 3 revisions

Introducing support for non-tryptic peptides

We are happy to announce the release of Unipept Next, featuring a range of remarkable new enhancements and capabilities:

  • Faster! The taxonomic and functional analysis of a metaproteomics sample now takes even less time.
  • Support for missed cleavage handling in the Unipept API and CLI.
  • Support for the analysis of samples with semi-tryptic and non-tryptic peptides.
  • Completely new index structure for matching peptides with proteins, based on suffix arrays.

In-depth explanation and background information

Traditional relational database

Over the last year, we have been busy working on a new index structure that can be used by Unipept to match peptides provided by the user with proteins in UniProtKB (and consequently taxonomic and functional information). Previously, Unipept employed a traditional relational database that contained pre-digested tryptic peptides that only allowed for the efficient retrieval of the peptides that are actually present in this database. This means that only perfectly cleaved tryptic peptides are supported. Over time, we've added support for tryptic peptides with missed cleavages, albeit with a significant performance penalty and still without support for semi-tryptic peptides or peptides that are not cleaved by trypsin altogether.

Moving forwards to a suffix array

Instead of precomputing and performing an in-silico tryptic digestion of the proteins that are present in UniProtKB, we decided to switch to a suffix array. This data structure allows for the efficient matching of small substrings in a large text, allowing us to find out which proteins a peptide belongs to. Because the suffix array does not require us to make any assumptions about the type of input peptide that it should try to match, it allows us to also query non-tryptic peptides. However, one of the downsides of using suffix arrays is that they require significantly more memory compared to traditional relational databases. This increased memory usage is due to the need to store the array itself and the additional information required for efficient substring searches. As a result, this can lead to higher resource consumption, especially when dealing with large datasets, and may necessitate specialized hardware or optimizations to manage the memory overhead effectively.

While a complete suffix array delivers the best performance, the index itself is large. The size of the suffix array can be reduced by introducing a sampling step to create a so-called sparse suffix array (SSA). This variant of a suffix array only stores every k-th suffix of the input text. This results in an SSA which is only $\frac{1}{k}$ of the original suffix array size. The disadvantage of using an SSA is that it becomes impossible to search for peptides having less than $k$ amino acids, and that searching in general becomes slower. This trade-off does not introduce any problem, since Unipept already limits the search to peptides with 5 or more amino acids. This restriction was introduced since mass spectrometers can’t read peptides with less than 5 amino acids. Furthermore, such extremely short peptides occur in a lot of proteins. Subsequently, this yields extremely generic functional and taxonomic analysis results, which is not very informative for end-users.

Performance comparison

Unipept Next achieves performance that is on-par with Unipept 5.0 when analyzing traditional tryptic peptides. This means users can expect the same level of efficiency and speed for standard analyses. A significant advancement in Unipept Next, however, is the fact that "advanced missed cleavage handling" feature is always enabled, without experiencing any performance penalty. Furthermore, because of the way this suffix array data structure works, the option can not be disabled anymore.

Right now, the option is always checked in Unipept's user interface and cannot be unchecked anymore. In the future, the option will be removed entirely.