Create v1.md

jhuapl-bio · Aug 9, 2024 · 74e05b8 · 74e05b8
1 parent df42d93
commit 74e05b8
Showing 1 changed file with 167 additions and 0 deletions.
diff --git a/v1.md b/v1.md
@@ -0,0 +1,167 @@
+
+## Original Mytax v1.0
+
+
+This is the repository for mytax, a tool for building custom taxonomies, which can aid nucleotide sequence classification.
+
+# Installation
+
+Clone this repo with:
+
+`git clone https://github.com/jhuapl-bio/mytax`
+
+Symbolically link all shell scripts into your path, for example with:
+
+`find v1 -name "*.sh" | while read fn; do sudo ln -s $PWD/$fn /usr/local/bin; done`
+
+# Dependencies
+
+ - jellyfish (version 1) - https://www.cbcb.umd.edu/software/jellyfish/
+ - kraken (version 1) - https://ccb.jhu.edu/software/kraken/
+ - gawk - https://www.gnu.org/software/gawk/manual/html_node/Installation.html
+ - perl
+ - GNU CoreUtils
+
+ - 16 GB of RAM is needed to build the provided influenza kraken database
+
+# Usage
+
+## Building example
+
+This pipeline is built from a central set of scripts located in the `v1` directory
+
+Build flu-kraken example with:
+
+`build_flukraken.sh -k flukraken-$(date +"%F")`
+
+The single script `build_flukraken.sh` functions as an outer wrapper for the influenza classification example using the Kraken classifier published in the mytax paper.
+
+
+`build_flukraken.sh` can also be used as a model to build modified pipelines as desired.  It is built from four main sub-modules:
+```
+	download_IVR.sh -> download references and taxonomy from IVR
+
+	build_IVR_metadata.sh -> build tab-delimited metadata table in format for mytax
+
+	build_taxonomy.sh -> build custom taxonomy from tab-delimited table
+
+	build_krakendb.sh -> add new taxonomic IDs to reference FASTA, build kraken database, post-process database for visualization pipeline
+```
+
+`build_krakendb.sh` currently references three helper scripts, which also need to be in the PATH:
+```
+	fix_references.sh -> adds new taxonomic IDs to reference FASTA
+
+	kraken-build -> builds kraken database
+
+	process_krakendb.sh -> post-processes database for visualization pipeline (not included in this repo yet)
+```
+
+
+
+## Running process script on kraken/kraken2 report and outfiles
+
+### If running from Docker
+
+docker build . -t jhuaplbio/mytax
+
+Unix
+
+`docker container run -it --rm -v $PWD:/data jhuaplbio/mytax bash`
+
+Windows Powershell
+
+`docker container run -it --rm -v $pwd:/data jhuaplbio/mytax bash`
+
+
+
+## Run the installation script
+
+
+# Activate the env, this will contain kraken2 and centrifuge scripts to build the database if needed as well as kraken2 and centrifuge dependencies
+
+`conda activate mytax`
+
+## Lets make a sample.fastq from test-data
+
+### First, download ncbi taxdump
+
+```
+python3 src/generate_hierarchy.py -o $PWD/taxdump --report test-data/sample.report   -download 
+rm taxdump.tar.gz
+```
+
+
+### DEPRECATED Kraken1 
+
+```
+mkdir -p databases/minikraken1
+wget https://ccb.jhu.edu/software/kraken/dl/minikraken_20171019_4GB.tgz -O databases/minikraken1.tgz
+tar -xvzf databases/minikraken1.tgz --directory databases/
+
+export kraken1db=databases/minikraken_20171013_4GB && \
+kraken --db $kraken1db --output test-data/sample.out test-data/sample.fastq && \
+kraken-report --db $kraken1db  test-data/sample.out | tee  test-data/sample.report
+```
+
+
+### Kraken2
+
+### IF you've made flukraken2 in tmp or....
+
+`export KRAKEN2_DEFAULT_DB="tmp/flukraken2`
+
+### IF you have a pre-made minikraken/other kraken db ready 
+
+```
+kraken2 --report output/sample_metagenome.first.report --output output/sample_metagenome.first.out --memory-mapping --db ~/Desktop/mytax/minikraken2 example-data/sample_metagenome.first.fastq 
+```
+
+### Download minikraken2
+
+```
+mkdir -p databases/
+wget ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/old/minikraken2_v2_8GB_201904.tgz -O databases/minikraken2.tgz
+tar -xvzf databases/minikraken2.tgz --directory databases/ 
+```
+
+### Centrifuge 
+
+#### Install 
+
+`bash install.sh`
+
+#### Set up centrifuge env
+
+```
+mkdir -p databases/centrifuge
+wget https://genome-idx.s3.amazonaws.com/centrifuge/p_compressed%2Bh%2Bv.tar.gz -O databases/centrifuge.tgz
+tar -xvzf databases/centrifuge.tgz --directory databases/centrifuge/
+```
+
+
+
+#### Run Centrifuge classify 
+
+```
+## If you need to make a new database, see here: $CONDA_PREFIX/lib/centrifuge/centrifuge-build --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp  sample.fastq sample
+
+$CONDA_PREFIX/lib/centrifuge/centrifuge -f -x databases/centrifuge/p_compressed+h+v  -q test-data/sample.fastq  --report test-data/sample.centrifuge.report > test-data/sample.out
+$CONDA_PREFIX/lib/centrifuge/centrifuge-kreport  -x databases/centrifuge/p_compressed+h+v test-data/sample.centrifuge.report > test-data/sample.report
+```
+
+####  Next, generate the hierarchy json file
+
+```
+python3 server/src/generate_hierarchy.py \
+-o output/sample_metagenome.first.fullstring \
+--report output/sample_metagenome.first.report \
+-taxdump taxonomy/nodes.dmp
+```
+
+#### Get the json for mytax sunburst plot 
+```
+bash server/src/krakenreport2json.sh -i output/sample_metagenome.first.fullstring -o output/sample_metagenome.first.json
+```
+
+The resulting file can then imported into the sunburst plot at `server/src/sunburst/index.html` rendered with a simple `http.server` protocol like `python3 -m http.server 8080`