-
Notifications
You must be signed in to change notification settings - Fork 0
Day 4 Metagenomics
FIRST STEP Open Ubuntu
or WSL
and type + hit enter:
cd $HOME/omics_workshop && conda activate omics_workshop
Type:
First, let's stage everything by decompressing our database tarball with tar and also make a directory to place our output files
mkdir -p metagenomics
tar -k -C databases -xvzf databases/test_metagenome.tar.gz
IF THE COMMAND ERRORS HERE IT ISNT A PROBLEM AND YOU CAN MOVE ON TO THE NEXT STEP
Remember these parameters for tar. They aren't important to kraken2 but are useful when working with decompressing a tarball
-
tar
a method of decompressing or compressing a directory into a single file. In this example, we are decompressing a tarball (.tar
) that is also compressed (.gz
) -
-C
Change to directory to perform decompressing. This is where the files will live after decompression has taken place -
-k
Keep the original tarball. Omit this in most cases if you dont want to store as much as soon as you've decompressed. -
-x
Extract - used for decompressing -
-v
verbose output. otherwise, it is quiet on the stdout/stderr -
-z
The tar is also compressed because it ends with.gz
-
-f
Specify the filename of the tarball (compressed or otherwise)
kraken2 --report metagenomics/miseq.k2.report --paired --out metagenomics/miseq.k2.out --db databases/test_metagenome fastq/miseq_reads_R1.fastq.gz fastq/miseq_reads_R2.fastq.gz
-
--report
the primary report file (see contents below) that contains the hierarchical distribution of organism abundance in our sample -
--out
the read-by-read classification file. This can be very large depending on the amount of reads you classify -
--db
the kraken2 database that contains thek2d
index files necessary to perform classification. -
--paired
indicating that we are classifying a set of paired-end read files -
fastq/miseq_reads_R1.fastq.gz fastq/miseq_reads_R2.fastq.gz
The positional read names (can be compressed or not) for our example. This example, we use paired-end reads.
You will see output like
======================================================================
(DO NOT TYPE THIS)
Loading database information... done.
78014 sequences (21.72 Mbp) processed in 1.766s (2650.3 Kseq/m, 737.94 Mbp/m).
77724 sequences classified (99.63%)
290 sequences unclassified (0.37%)
======================================================================
This will make a kraken2 report and outfile in the metagenomics
directory. Let's view the report file with cat
like so
cat metagenomics/miseq.k2.report
Which will have output like
======================================================================
(DO NOT TYPE THIS)
6.46 5040 5040 U 0 unclassified
93.54 72974 0 R 1 root
93.34 72822 0 R1 131567 cellular organisms
93.34 72822 2 D 2 Bacteria
64.00 49926 0 P 1224 Pseudomonadota
53.67 41871 1 C 1236 Gammaproteobacteria
53.36 41631 0 O 72274 Pseudomonadales
53.36 41631 0 F 135621 Pseudomonadaceae
53.36 41631 0 G 286 Pseudomonas
53.36 41631 0 G1 136841 Pseudomonas aeruginosa group
53.36 41631 0 S 287 Pseudomonas aeruginosa
53.36 41631 41631 S1 1279007 Pseudomonas aeruginosa PA1
...
14.34 11185 11185 S1 93061 Staphylococcus aureus subsp. aureus NCTC 8325
...
0.15 114 114 S3 99287 Salmonella enterica subsp. enterica serovar Typhimurium str. LT2
...
0.12 95 95 S2 511145 Escherichia coli str. K-12 substr. MG1655
...
12.38 9655 9655 S2 224308 Bacillus subtilis subsp. subtilis str. 168
...
...
0.01 7 7 S 1613 Limosilactobacillus fermentum
...
0.19 152 152 S2 12814 Respiratory syncytial virus
======================================================================
Notice the columns which are unanmed. I've truncated much of the output for readability. The columns are:
- Percentage of fragments covered by the clade rooted at this taxon
- Number of fragments covered by the clade rooted at this taxon
- Number of fragments assigned directly to this taxon
- A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. Taxa that are not at any of these 10 ranks have a rank code that is formed by using the rank code of the closest ancestor rank with a number indicating the distance from that rank. E.g., "G2" is a rank code indicating a taxon is between genus and species and the grandparent taxon is at the genus rank.
- NCBI taxonomic ID number
- Indented scientific name
Let's now do the same for our ONT data
kraken2 --report metagenomics/ont.k2.report --out metagenomics/ont.k2.out --db databases/test_metagenome fastq/ont_reads.fastq.gz
Be aware that we've removed --paired
to indicate that we are classifying single-end reads.
See docs for more detailed information on kraken2
Lastly, do you notice any organisms missing from the alignment coverage stats?
What if, instead, we wanted to use a set of indices (a database) that is tailored for another set of targets? For example, what if we wanted to only classify viruses present in the sample?
We can run:
kraken2 \
--db databases/k2_viral \
--report metagenomics/miseq.k2.viral.report \
--out metagenomics/miseq.k2.viral.out \
--paired fastq/miseq_reads_R1.fastq.gz fastq/miseq_reads_R2.fastq.gz
Which will have output like this if we read it with
cat metagenomics/miseq.k2.viral.report
======================================================================
(DO NOT TYPE THIS)
39007 sequences (21.72 Mbp) processed in 1.811s (1292.4 Kseq/m, 719.68 Mbp/m).
807 sequences classified (2.07%)
38200 sequences unclassified (97.93%)
And the file will look like:
97.93 38200 38200 U 0 unclassified
2.07 807 0 R 1 root
2.07 807 0 D 10239 Viruses
1.78 694 0 D1 2731341 Duplodnaviria
1.78 694 0 K 2731360 Heunggongvirae
1.78 694 0 P 2731618 Uroviricota
1.78 694 62 C 2731619 Caudoviricetes
0.45 174 0 G 680116 Spbetavirus
0.45 174 0 S 66797 Spbetavirus SPbeta
0.45 174 174 S1 2932878 Bacillus phage SPBc2
Notice any interesting characteristics of these results, relative to the other database we used?
======================================================================
Make sure you've updated the taxonomy before you do this. This should've occurred when you installed everything with install.bat
and install.sh
. If you haven't done so, make sure to run within WSL2:
Next, we need to use the NCBI taxonomy mapping for parent-child relationships and make the report. Let's make it in the metagenomics
folder like so. This requires a taxonomy.tab file to be present. You can get it here or by installing it from the command line like:
ktImportTaxonomy -t 5 -m 3 -o metagenomics/miseq.krona.html metagenomics/miseq.k2.report
ktImportTaxonomy
-
-t
The taxonomic ID column -
-m
The abundance (magnitude column number in our k2 report file) -
-o
The output html file -
metagenomics/miseq.k2.report
The positional input filename from the kraken2 results we made earlier (report file)
Remember, the content of the k2 report looks like
======================================================================
(DO NOT TYPE THIS)
Abundance classified_at_clade classified_at_taxon_only rank taxid organismName
14.34 11185 11185 S1 93061 Staphylococcus aureus subsp. aureus NCTC 8325
======================================================================
So we want to set the magnitude to -m 3
since Krona wants to get the size for each level rather than the aggregate across it AND all children to make the hierarchy we will sow in a moment.
Now, open up the metagenomics/miseq.krona.html
by double-clicking it. It should take you to a web-browser with the interactive plot.
ktImportTaxonomy -t 5 -m 3 -o metagenomics/ont.krona.html metagenomics/ont.k2.report
If you don't have the taxonomy.tab file, you can get it here or installing with the command:
======================================================================
(DO NOT TYPE THIS UNLESS YOU HAVE ISSUES WITH THIS STEP)
mkdir -p ~/omics_workshop/downloads/ \
&& wget https://github.com/jhuapl-bio/datasets/raw/main/databases/ncbi/taxonomy.tab.gz \
-O ~/omics_workshop/downloads/taxonomy.tab.gz \
&& gzip -f -d ~/omics_workshop/downloads/taxonomy.tab.gz && \
mv ~/omics_workshop/downloads/taxonomy.tab $(dirname $(which ktImportTaxonomy))/../opt/krona/taxonomy/
======================================================================
- Head here to select the database you want
- Select the .tar.gz to download
- Decompress the file with
tar -xvzf $database.tar.gz
. You will need to know where it is located in your downloads folder. you can also copy the url of the .tar.gz and runwget $url
then runtar -xvzf database.tar.gz
- Change the
--db
paramater to your database of choice
For instance: kraken2 --report metagenomics/miseq.k2.report --paired --out metagenomics/miseq.k2.out --db $databasefolder fastq/miseq_reads_R1.fastq.gz fastq/miseq_reads_R2.fastq.gz
where $database is the database of your choice. Double check the name with a ls
in the folder
If you ran the install.sh
script you should already have it in the correct location