Skip to content

Day 4 Metagenomics

Merritt-Brian edited this page Apr 25, 2024 · 16 revisions

Day 4 Metagenomics for agnostic classifications

FIRST STEP Open Ubuntu or WSL and type + hit enter:

cd $HOME/omics_workshop && conda activate omics_workshop

Defining our command using Kraken2

Type:

First, let's stage everything by decompressing our database tarball with tar and also make a directory to place our output files

mkdir -p metagenomics
tar -k -C databases -xvzf databases/test_metagenome.tar.gz

IF THE COMMAND ERRORS HERE IT ISNT A PROBLEM AND YOU CAN MOVE ON TO THE NEXT STEP

Remember these parameters for tar. They aren't important to kraken2 but are useful when working with decompressing a tarball

  • tar a method of decompressing or compressing a directory into a single file. In this example, we are decompressing a tarball (.tar) that is also compressed (.gz)
  • -C Change to directory to perform decompressing. This is where the files will live after decompression has taken place
  • -k Keep the original tarball. Omit this in most cases if you dont want to store as much as soon as you've decompressed.
  • -x Extract - used for decompressing
  • -v verbose output. otherwise, it is quiet on the stdout/stderr
  • -z The tar is also compressed because it ends with .gz
  • -f Specify the filename of the tarball (compressed or otherwise)
kraken2 --report metagenomics/miseq.k2.report  --paired --out metagenomics/miseq.k2.out --db databases/test_metagenome fastq/miseq_reads_R1.fastq.gz fastq/miseq_reads_R2.fastq.gz
  • --report the primary report file (see contents below) that contains the hierarchical distribution of organism abundance in our sample
  • --out the read-by-read classification file. This can be very large depending on the amount of reads you classify
  • --db the kraken2 database that contains the k2d index files necessary to perform classification.
  • --paired indicating that we are classifying a set of paired-end read files
  • fastq/miseq_reads_R1.fastq.gz fastq/miseq_reads_R2.fastq.gz The positional read names (can be compressed or not) for our example. This example, we use paired-end reads.

You will see output like

======================================================================

(DO NOT TYPE THIS)


Loading database information... done.
78014 sequences (21.72 Mbp) processed in 1.766s (2650.3 Kseq/m, 737.94 Mbp/m).
  77724 sequences classified (99.63%)
  290 sequences unclassified (0.37%)

======================================================================

This will make a kraken2 report and outfile in the metagenomics directory. Let's view the report file with cat like so

cat metagenomics/miseq.k2.report 

Which will have output like

======================================================================

(DO NOT TYPE THIS)

 6.46  5040    5040    U       0       unclassified
 93.54  72974   0       R       1       root
 93.34  72822   0       R1      131567    cellular organisms
 93.34  72822   2       D       2           Bacteria
 64.00  49926   0       P       1224          Pseudomonadota
 53.67  41871   1       C       1236            Gammaproteobacteria
 53.36  41631   0       O       72274             Pseudomonadales
 53.36  41631   0       F       135621              Pseudomonadaceae
 53.36  41631   0       G       286                   Pseudomonas
 53.36  41631   0       G1      136841                  Pseudomonas aeruginosa group
 53.36  41631   0       S       287                       Pseudomonas aeruginosa
 53.36  41631   41631   S1      1279007                     Pseudomonas aeruginosa PA1
...
 14.34  11185   11185   S1      93061                       Staphylococcus aureus subsp. aureus NCTC 8325
...
  0.15  114     114     S3      99287                         Salmonella enterica subsp. enterica serovar Typhimurium str. LT2
...
  0.12  95      95      S2      511145                      Escherichia coli str. K-12 substr. MG1655

...
 12.38  9655    9655    S2      224308                          Bacillus subtilis subsp. subtilis str. 168
...
...
  0.01  7       7       S       1613                      Limosilactobacillus fermentum
 ...
  0.19  152     152     S2      12814                           Respiratory syncytial virus

======================================================================

Notice the columns which are unanmed. I've truncated much of the output for readability. The columns are:

  1. Percentage of fragments covered by the clade rooted at this taxon
  2. Number of fragments covered by the clade rooted at this taxon
  3. Number of fragments assigned directly to this taxon
  4. A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. Taxa that are not at any of these 10 ranks have a rank code that is formed by using the rank code of the closest ancestor rank with a number indicating the distance from that rank. E.g., "G2" is a rank code indicating a taxon is between genus and species and the grandparent taxon is at the genus rank.
  5. NCBI taxonomic ID number
  6. Indented scientific name

Let's now do the same for our ONT data

kraken2 --report metagenomics/ont.k2.report --out metagenomics/ont.k2.out --db databases/test_metagenome fastq/ont_reads.fastq.gz

Be aware that we've removed --paired to indicate that we are classifying single-end reads.

See docs for more detailed information on kraken2

Lastly, do you notice any organisms missing from the alignment coverage stats?

Testing out a different Database

What if, instead, we wanted to use a set of indices (a database) that is tailored for another set of targets? For example, what if we wanted to only classify viruses present in the sample?

We can run:

kraken2 \
   --db databases/k2_viral \
   --report metagenomics/miseq.k2.viral.report \
   --out metagenomics/miseq.k2.viral.out \
   --paired fastq/miseq_reads_R1.fastq.gz fastq/miseq_reads_R2.fastq.gz

Which will have output like this if we read it with

cat metagenomics/miseq.k2.viral.report

======================================================================

(DO NOT TYPE THIS)

39007 sequences (21.72 Mbp) processed in 1.811s (1292.4 Kseq/m, 719.68 Mbp/m).
  807 sequences classified (2.07%)
  38200 sequences unclassified (97.93%)

And the file will look like:

97.93  38200   38200   U       0       unclassified
  2.07  807     0       R       1       root
  2.07  807     0       D       10239     Viruses
  1.78  694     0       D1      2731341     Duplodnaviria
  1.78  694     0       K       2731360       Heunggongvirae
  1.78  694     0       P       2731618         Uroviricota
  1.78  694     62      C       2731619           Caudoviricetes
  0.45  174     0       G       680116              Spbetavirus
  0.45  174     0       S       66797                 Spbetavirus SPbeta
  0.45  174     174     S1      2932878                 Bacillus phage SPBc2

Notice any interesting characteristics of these results, relative to the other database we used?

======================================================================

Make a Krona Plot

Make sure you've updated the taxonomy before you do this. This should've occurred when you installed everything with install.bat and install.sh. If you haven't done so, make sure to run within WSL2:

Making the Krona Plot

Next, we need to use the NCBI taxonomy mapping for parent-child relationships and make the report. Let's make it in the metagenomics folder like so. This requires a taxonomy.tab file to be present. You can get it here or by installing it from the command line like:

Illumina Reads

ktImportTaxonomy -t 5 -m 3 -o metagenomics/miseq.krona.html metagenomics/miseq.k2.report 
  • ktImportTaxonomy
  • -t The taxonomic ID column
  • -m The abundance (magnitude column number in our k2 report file)
  • -o The output html file
  • metagenomics/miseq.k2.report The positional input filename from the kraken2 results we made earlier (report file)

Remember, the content of the k2 report looks like

======================================================================

(DO NOT TYPE THIS)

Abundance classified_at_clade classified_at_taxon_only rank  taxid organismName
14.34  11185   11185   S1      93061                       Staphylococcus aureus subsp. aureus NCTC 8325

======================================================================

So we want to set the magnitude to -m 3 since Krona wants to get the size for each level rather than the aggregate across it AND all children to make the hierarchy we will sow in a moment.

Now, open up the metagenomics/miseq.krona.html by double-clicking it. It should take you to a web-browser with the interactive plot.

ONT Reads

ktImportTaxonomy -t 5 -m 3 -o metagenomics/ont.krona.html metagenomics/ont.k2.report 

If you don't have the taxonomy.tab file, you can get it here or installing with the command:

======================================================================

(DO NOT TYPE THIS UNLESS YOU HAVE ISSUES WITH THIS STEP)

mkdir -p  ~/omics_workshop/downloads/ \
    && wget https://github.com/jhuapl-bio/datasets/raw/main/databases/ncbi/taxonomy.tab.gz \
    -O ~/omics_workshop/downloads/taxonomy.tab.gz \
    && gzip -f -d ~/omics_workshop/downloads/taxonomy.tab.gz  && \
    mv ~/omics_workshop/downloads/taxonomy.tab $(dirname $(which ktImportTaxonomy))/../opt/krona/taxonomy/

======================================================================

Using a larger database for your own use case

  1. Head here to select the database you want
  2. Select the .tar.gz to download
  3. Decompress the file with tar -xvzf $database.tar.gz. You will need to know where it is located in your downloads folder. you can also copy the url of the .tar.gz and run wget $url then run tar -xvzf database.tar.gz
  4. Change the --db paramater to your database of choice

For instance: kraken2 --report metagenomics/miseq.k2.report --paired --out metagenomics/miseq.k2.out --db $databasefolder fastq/miseq_reads_R1.fastq.gz fastq/miseq_reads_R2.fastq.gz

where $database is the database of your choice. Double check the name with a ls in the folder

If you ran the install.sh script you should already have it in the correct location

Day 6