Skip to content

Commit

Permalink
Create v1.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Merritt-Brian authored Aug 9, 2024
1 parent df42d93 commit 74e05b8
Showing 1 changed file with 167 additions and 0 deletions.
167 changes: 167 additions & 0 deletions v1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@

## Original Mytax v1.0


This is the repository for mytax, a tool for building custom taxonomies, which can aid nucleotide sequence classification.

# Installation

Clone this repo with:

`git clone https://github.com/jhuapl-bio/mytax`

Symbolically link all shell scripts into your path, for example with:

`find v1 -name "*.sh" | while read fn; do sudo ln -s $PWD/$fn /usr/local/bin; done`

# Dependencies

- jellyfish (version 1) - https://www.cbcb.umd.edu/software/jellyfish/
- kraken (version 1) - https://ccb.jhu.edu/software/kraken/
- gawk - https://www.gnu.org/software/gawk/manual/html_node/Installation.html
- perl
- GNU CoreUtils

- 16 GB of RAM is needed to build the provided influenza kraken database

# Usage

## Building example

This pipeline is built from a central set of scripts located in the `v1` directory

Build flu-kraken example with:

`build_flukraken.sh -k flukraken-$(date +"%F")`

The single script `build_flukraken.sh` functions as an outer wrapper for the influenza classification example using the Kraken classifier published in the mytax paper.


`build_flukraken.sh` can also be used as a model to build modified pipelines as desired. It is built from four main sub-modules:
```
download_IVR.sh -> download references and taxonomy from IVR
build_IVR_metadata.sh -> build tab-delimited metadata table in format for mytax
build_taxonomy.sh -> build custom taxonomy from tab-delimited table
build_krakendb.sh -> add new taxonomic IDs to reference FASTA, build kraken database, post-process database for visualization pipeline
```

`build_krakendb.sh` currently references three helper scripts, which also need to be in the PATH:
```
fix_references.sh -> adds new taxonomic IDs to reference FASTA
kraken-build -> builds kraken database
process_krakendb.sh -> post-processes database for visualization pipeline (not included in this repo yet)
```



## Running process script on kraken/kraken2 report and outfiles

### If running from Docker

docker build . -t jhuaplbio/mytax

Unix

`docker container run -it --rm -v $PWD:/data jhuaplbio/mytax bash`

Windows Powershell

`docker container run -it --rm -v $pwd:/data jhuaplbio/mytax bash`



## Run the installation script


# Activate the env, this will contain kraken2 and centrifuge scripts to build the database if needed as well as kraken2 and centrifuge dependencies

`conda activate mytax`

## Lets make a sample.fastq from test-data

### First, download ncbi taxdump

```
python3 src/generate_hierarchy.py -o $PWD/taxdump --report test-data/sample.report -download
rm taxdump.tar.gz
```


### DEPRECATED Kraken1

```
mkdir -p databases/minikraken1
wget https://ccb.jhu.edu/software/kraken/dl/minikraken_20171019_4GB.tgz -O databases/minikraken1.tgz
tar -xvzf databases/minikraken1.tgz --directory databases/
export kraken1db=databases/minikraken_20171013_4GB && \
kraken --db $kraken1db --output test-data/sample.out test-data/sample.fastq && \
kraken-report --db $kraken1db test-data/sample.out | tee test-data/sample.report
```


### Kraken2

### IF you've made flukraken2 in tmp or....

`export KRAKEN2_DEFAULT_DB="tmp/flukraken2`

### IF you have a pre-made minikraken/other kraken db ready

```
kraken2 --report output/sample_metagenome.first.report --output output/sample_metagenome.first.out --memory-mapping --db ~/Desktop/mytax/minikraken2 example-data/sample_metagenome.first.fastq
```

### Download minikraken2

```
mkdir -p databases/
wget ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/old/minikraken2_v2_8GB_201904.tgz -O databases/minikraken2.tgz
tar -xvzf databases/minikraken2.tgz --directory databases/
```

### Centrifuge

#### Install

`bash install.sh`

#### Set up centrifuge env

```
mkdir -p databases/centrifuge
wget https://genome-idx.s3.amazonaws.com/centrifuge/p_compressed%2Bh%2Bv.tar.gz -O databases/centrifuge.tgz
tar -xvzf databases/centrifuge.tgz --directory databases/centrifuge/
```



#### Run Centrifuge classify

```
## If you need to make a new database, see here: $CONDA_PREFIX/lib/centrifuge/centrifuge-build --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp sample.fastq sample
$CONDA_PREFIX/lib/centrifuge/centrifuge -f -x databases/centrifuge/p_compressed+h+v -q test-data/sample.fastq --report test-data/sample.centrifuge.report > test-data/sample.out
$CONDA_PREFIX/lib/centrifuge/centrifuge-kreport -x databases/centrifuge/p_compressed+h+v test-data/sample.centrifuge.report > test-data/sample.report
```

#### Next, generate the hierarchy json file

```
python3 server/src/generate_hierarchy.py \
-o output/sample_metagenome.first.fullstring \
--report output/sample_metagenome.first.report \
-taxdump taxonomy/nodes.dmp
```

#### Get the json for mytax sunburst plot
```
bash server/src/krakenreport2json.sh -i output/sample_metagenome.first.fullstring -o output/sample_metagenome.first.json
```

The resulting file can then imported into the sunburst plot at `server/src/sunburst/index.html` rendered with a simple `http.server` protocol like `python3 -m http.server 8080`

0 comments on commit 74e05b8

Please sign in to comment.