Shared flows

This section illustrates sequences of steps that are shared by multiple jobs.

The following flows are currently defined:

Generate VEP annotation
Calculate population statistics

Please note this section is a work in progress and more details about the structure of each flow will be added in the future.

Generate VEP annotation

Variant annotations are generated using Ensembl VEP, a binary completely independent from the EVA pipeline. In fact, one could annotate each study with a different version of VEP.

In order to annotate variants that have been previously loaded, the database is traversed, looking for those lacking an annotation. The output of this is a tab-separated file following the format described here.

In addition to this tab-separated file, the following are also necessary to run VEP:

A FASTA file containing the sequence matched by the VCF
A VEP cache containing transcripts location, regulatory regions, SIFT/Polyphen scores, etc.

FASTA files and VEP caches ready to be used together can be found in the Ensembl FTP, here and here.

VEP creates a plain text file with the annotations, which is then read and loaded into the database along with the variants.

(Click on the diagram for fullscreen view)

Calculate population statistics

The process for calculating statistics starts selecting a study and an analysis, and then the list of relevant variants are queried. For each variant, genotypes and frequencies are accumulated. Also general variant counts are accumulated. These accumulations are stored in local files, and in a subsequent step, the files are loaded into the MongoDB collections.

(Click on the diagram for fullscreen view)

At the moment, only basic variant and source (file) statistics are calculated:

Variant statistics

Variant statistics aggregate information across samples about a single variant, coming from a single source (where source is typically an analysis, roughly a VCF file). The statistics supported are:

Minor Allele Frequency (MAF, defined as "second highest frequency")
Which one is the MAF allele
Mininum Genotype Frequency (MGF, defined as the frequency of the genotype with less appearances)
Which one is the MGF genotype
How many missing alleles there are
How many missing genotypes there are
Genotype counts (for each genotype, how many appearances are there)

Source statistics

Source statistics aggregate information across variants and samples within a single source. A variant source is typically an analysis, which is roughly equivalent to a single VCF file. The statistics supported are:

Number of samples
Number of variants
Number of indels
Average quality of the variants

Home

Pipeline design

Database

DBMS
Schema

Tutorials

Population statistics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared flows

Generate VEP annotation

Calculate population statistics

Variant statistics

Source statistics

Clone this wiki locally