Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create spike-in dataset with Eukaryotic and viral DNA #48

Open
brooksph opened this issue Feb 8, 2018 · 3 comments
Open

Create spike-in dataset with Eukaryotic and viral DNA #48

brooksph opened this issue Feb 8, 2018 · 3 comments

Comments

@brooksph
Copy link
Contributor

brooksph commented Feb 8, 2018

Expected behavior

These data were identified by Nicolete (SigSci) Updates from Nicolette - https://docs.google.com/document/d/1uYqs939MU55D_3La8RUc8NJIYVrjLmVD-q_Tw5_Bjbg/edit

Viral DNA spike-ins (non-assembled datasets):
Organism: Human betaherpesvirus 5 (dsDNA)
Technology: Illumina Genome Analyzer II https://www.ncbi.nlm.nih.gov/sra/ERX004415[accn]
Technology: Illumina HiSeq 2000 https://www.ncbi.nlm.nih.gov/sra/ERX2083171[accn]
Organism: Human gammaherpesvirus 4 (dsDNA) (alternate name: Epstein-Barr virus)
Technology: Illumina HiSeq 2000 https://www.ncbi.nlm.nih.gov/sra/ERX218636[accn]
Organism: Vaccinia virus (dsDNA)
Technology: Illumina HiSeq 2000 https://www.ncbi.nlm.nih.gov/sra/SRX2421177[accn]
Organism: Cowpox virus (dsDNA)
Technology: Illumina MiSeq https://www.ncbi.nlm.nih.gov/sra/SRX3106169[accn]
Organism: Torque teno virus (ssDNA)
Technology: Illumina HiSeq 4000 https://www.ncbi.nlm.nih.gov/sra/SRX1762570[accn]
Organism: Adeno-associated virus (ssDNA)
Technology: Illumina HiSeq 2500 https://www.ncbi.nlm.nih.gov/sra/SRX1960902[accn]
Organism: Human bocavirus 1 (ssDNA)
Technology: Illumina HiSeq 2500 https://www.ncbi.nlm.nih.gov/sra/ERX1470610[accn]
Organism: Enterobacteria phage T7
Technology: Illumina HiSeq 2500 https://www.ncbi.nlm.nih.gov/sra/SRX2365806[accn]
Organism: Enterobacteria phage T3
Technology: Illumina HiSeq 2000 https://www.ncbi.nlm.nih.gov/sra/SRX209596[accn]
Organism: Bacillus phage BC01
Technology: Illumina HiSeq 4000 https://www.ncbi.nlm.nih.gov/sra/SRX3214803[accn]
Eukaryotic microbe spike-ins:
Organism: Saccharomyces cerevisiae Y12
Illumina HiSeq 4000 https://www.ncbi.nlm.nih.gov/sra/SRX2487940[accn]
PacBio RS II https://www.ncbi.nlm.nih.gov/sra/SRX2485790[accn]
Organism: Schizosaccharomyces kambucha strain:SZY13
Illumina HiSeq 2000 https://www.ncbi.nlm.nih.gov/sra/SRX521792[accn]
PacBio RS https://www.ncbi.nlm.nih.gov/sra/SRX521793[accn]
Organism: Colletotrichum higginsianum IMI 349063
Illumina HiSeq 1500 https://www.ncbi.nlm.nih.gov/sra/SRX2765599[accn]
PacBio RS II https://www.ncbi.nlm.nih.gov/sra/SRX1567884[accn]
Organism: [Candida] auris
Illumina HiSeq 2500 https://www.ncbi.nlm.nih.gov/sra/SRX1939498[accn]
PacBio RS II https://www.ncbi.nlm.nih.gov/sra/SRX1939493[accn]
Organism: Fusarium poae isolate 2516
Illumina HiSeq 2000 https://www.ncbi.nlm.nih.gov/sra/SRX1977327[accn]
PacBio RS II https://www.ncbi.nlm.nih.gov/sra/SRX1977328[accn]

Actual behavior

Steps to reproduce the behavior

@brooksph
Copy link
Contributor Author

brooksph commented Feb 8, 2018

One exclusively short read and one hybrid

@kternus
Copy link
Collaborator

kternus commented Apr 5, 2018

If it's helpful, the simulated "frankengenome" dataset has a crazy mix of short bacterial, archaeal, viral, and eukaryotic reads:
https://ftp-private.ncbi.nlm.nih.gov/nist-immsa/IMMSA/UnAmbiguouslyMapped_ds.frankengenome.fq.gz

I attached a truth file for it that follows the same format as the other unambiguously mapped datasets in the McIntyre et al. 2017 study:
UnAmbiguouslyMapped_ds_frankengenome_TRUTH.txt

Column 1 = NCBI Taxonomy ID
Column 2 = Number of reads simulated from that organism
Column 3 = Abundance of that organism in the dataset
Column 4 = Rank
Column 5 = Species name

@kternus
Copy link
Collaborator

kternus commented Apr 5, 2018

The frankengenome won't be a good resource for generating spike-in datasets, but it's an additional dataset option if you'd like to do more testing with reads from viruses and eukaryotes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants