Skip to content

Important statistics

Pieter Verschaffelt edited this page Aug 25, 2023 · 5 revisions

The Unipept Database contains an enormous amount of proteins, peptides and links between the different functional (or taxonomic) ontologies. We present a summary of some interesting statistics on this page which also tries to provide context for some of the complex computations that are performed in the background.

Statistics overview

This document was last updated for the Unipept database constructed from UniProt 2023.3. Take caution when extrapolating some of these numbers to more recent iterations of UniProt.

  • Database version: Unipept 2023.3 (based on UniProt 2023.3)
  • Total proteins: 248 497 366
  • Average protein length: 348.88 amino acids
  • Total peptides: 1 342 470 765 (~1.3 billion)
  • Average peptide length: 18.83 amino acids

Protein counts distribution

Every peptide sequence in the database is associated with one or more proteins in which the sequence occurs. The amount of proteins that a peptide is associated with can be differ drastically. The following table presents an overview of how many peptides are associated with $n$ or more proteins:

# Associated proteins # Peptides
$\ge 1$ 1 342 470 764
$\ge 2$ 355 979 324
$\ge 10$ 38 697 210
$\ge 10^2$ 2 921 879
$\ge 10^3$ 217 922
$\ge 10^4$ 13 008
$\ge 10^5$ 118
$\ge 10^6$ 0

From this table, we can see that the amount of peptides that's associated with a large number of proteins decreases rapidly. There are only a little over 13k peptides that occur in 10k or more proteins. For these peptides, we expect the lowest common ancestor to be very generic (since the taxonomic diversity will probably be high). We cannot just assume that this expectation holds, and we will therefor be explicitly checking this.

If we extract these 13k peptide sequences and query the database for the taxonomic ranks that are associated with these sequences, we find the following results:

# NCBI taxonomy rank # Peptides
root 12 369
superkingdom 43
kingdom 16
subkingdom 0
superphylum 0
phylum 8
subphylum 7
superclass 1
class 18
subclass 1
superorder 0
order 0
infraorder 1
superfamily 0
family 2
subfamily 0
tribe 1
subtribe 0
genus 55
subgenus 0
species_group 0
species_subgroup 0
species 200
subspecies 0
strain 1
varietas 0
forma 0

Remarkable here are the 200 peptide sequences with an LCA at the species rank. This is something that I did not expect at first glance, so we need to take a look deeper down what species these peptides are associated with.

# Peptides LCA
119 Alphainfluenzavirus influenzae
32 Human immunodeficiency virus
14 Hepatitis B virus
9 Betainfluenzavirus influenzae
4 Orthoflavivirus denguei
3 Simian immunodeficiency virus
1 Alcidodes juglans
1 Bacillus subtilis
1 Bacteroides thetaiotaomicron
1 Cannabis sativa
1 Capsicum baccatum
1 Echinocucumis hispida
1 Geissoloma marginatum
1 Homo sapiens
1 Human immunodeficiency virus
1 Kalanchoe fedtschenkoi
1 Leucosceptrum canum
1 Loxia curvirostra
1 Marinilactibacillus piezotolerans
1 Melanocenchris jacquemontii
1 Merops nubicus
1 Morbillivirus hominis
1 Phalaenopsis pulcherrima
1 Phormidesmis priestleyi
1 Rhodobacter maris

The majority of these LCA's are viral. Now, since for some of these viruses (such as HIV or influenza) a lot of research is conducted and an enormous amount of different strains exist (that are all present in the UniProt database), this explains why there are still 200 peptides that occur in 10k or more proteins and that still have an LCA annotated at the species rank.

In order to make future experiments easier to start with, these are some examples of sequences for some of the organisms that we got back:

Alphainfluenzavirus influenzae

EVHLYYLEK
QCFNPMLVELAEK
LEQSGLPVGGNEK
MMTNSQDTELSFTLTGDNTK
SMEYDAVATTHSWLPK

Human immunodeficiency virus

LGPENPYNTPVFALK
ALVELCTEMEK
QLLSGLVQQQSNLLR
AFSPEVLPMFSALSEGATPQDLNTMLNTVGGHQAAMQMLK
GSPALFQSSMTR

Hepatitis B virus

FLWEWASAR
LPVNRPLDWK
QPTPLSPPLR
LPMGVGLSPFLLAQFTSALCSVVR
EFGASVELLSFLPSDFFPSLR