Skip to content

Shared flows

jmmut edited this page Oct 7, 2020 · 6 revisions

This section illustrates sequences of steps that are shared by multiple jobs.

The following flows are currently defined:

  • Generate VEP annotation
  • Calculate population statistics

Please note this section is a work in progress and more details about the structure of each flow will be added in the future.

Generate VEP annotation

Variant annotations are generated using Ensembl VEP, a binary completely independent from the EVA pipeline. In fact, one could annotate each study with a different version of VEP.

In order to annotate variants that have been previously loaded, the database is traversed, looking for those lacking an annotation. The output of this is a tab-separated file following the format described here.

In addition to this tab-separated file, the following are also necessary to run VEP:

  • A FASTA file containing the sequence matched by the VCF
  • A VEP cache containing transcripts location, regulatory regions, SIFT/Polyphen scores, etc.

FASTA files and VEP caches ready to be used together can be found in the Ensembl FTP, here and here.

VEP creates a plain text file with the annotations, which is then read and loaded into the database along with the variants.

Annotation flow

(Click on the diagram for fullscreen view)

Calculate population statistics

The process for calculating statistics starts selecting a study and an analysis, and then the list of relevant variants are queried. For each variant, genotypes and frequencies are accumulated. Also general variant counts are accumulated. These accumulations are stored in local files, and in a subsequent step, the files are loaded into the MongoDB collections.

Statistics flow

(Click on the diagram for fullscreen view)

At the moment, only basic variant and source (file) statistics are calculated:

Variant statistics

Variant statistics aggregate information across samples about a single variant, coming from a single source (where source is typically an analysis, roughly a VCF file). The statistics supported are:

  • Minor Allele Frequency (MAF, defined as "second highest frequency")
  • Which one is the MAF allele
  • Mininum Genotype Frequency (MGF, defined as the frequency of the genotype with less appearances)
  • Which one is the MGF genotype
  • How many missing alleles there are
  • How many missing genotypes there are
  • Genotype counts (for each genotype, how many appearances are there)

Source statistics

Source statistics aggregate information across variants and samples within a single source. A variant source is typically an analysis, which is roughly equivalent to a single VCF file. The statistics supported are:

  • Number of samples
  • Number of variants
  • Number of indels
  • Average quality of the variants
Clone this wiki locally