# Data Sets ## Fusion Caller Output Files ### Set 1. chimeraviz example data This dataset contains example fusion output from the [chimeraviz](https://bioconductor.org/packages/release/bioc/html/chimeraviz.html) BioConductor package. Root directory: `/home/projects/hackseq17_3/datasets/chimeraviz_examples/` Relevant files: - `FusionMap_01_TestDataset_InputFastq.FusionReport.txt` - FusionMap output - `PRADA.acc.fusion.fq.TAF.tsv` - PRADA output - `defuse_833ke_results.filtered.tsv` - DeFuse output - `ericscript_SRR1657556.results.total.tsv` - EricScript output - `fusioncatcher_833ke_final-list-candidate-fusion-genes.txt` - FusionCatcher output - `infusion_fusions.txt` - Infusion output - `jaffa_results.csv` - JAFFA results - `soapfuse_833ke_final.Fusion.specific.for.genes` - SOAPFuse results - `star-fusion.fusion_candidates.final.abridged.txt` - STARfusion results ### Set 2. FusionCatcher example data This data set contains a small FASTQ subset of files, containing reads supporting the fusions described [on the fusioncatcher github page](https://github.com/ndaniel/fusioncatcher/tree/master/test). Root directory: `/home/projects/hackseq17_3/datasets/fusioncatcher_examples/` Relevant files: - `final-list_candidate-fusion-genes.txt` - fusioncatcher output - `readme.txt` - README describing the detected fusions - `reads_1.fq.gz` - Read 1 file - `reads_2.fq.gz` - Read 2 file ### Set 3. AML Cell Line Data This data set contains fusion results for three technical replicates from an AML cell line. Root directory: `/home/projects/hackseq17_3/datasets/aml_cell_line_examples/` Under the root directory, you'll find directories named by tool and library. There are results for fusioncatcher, defuse, ericscript, STAR-fusion (with Oncofuse annotations),and PAVfinder. ## Annotation Data Sources ### Tumor Fusion Gene Data Portal - Located online at [PanCanFusV2](http://54.84.12.177/PanCanFusV2/) - Downloaded to ORCA at `/home/projects/hackseq17_3/annotation_sources/tumour_fusion_gene_data_portal/` - Contains 17,754 observations of 27 variables, in a format that is amenable to conversion to BEDPE - Annotations include recurrence in TCGA tumour types, as well as additional manual and automated curations ### Database of Genomic Variants - Located online at [DGV](http://dgv.tcag.ca/dgv/app/home) - Latest release of GRCh37 dataset is on ORCA at `/home/projects/hackseq17_3/annotation_sources/dgv/` - Contains 392,583 observations of 20 variables. These are mainly CNVs, insertions, and deletions though, so it seems it won't be as relevant here - [Issue #14](https://github.com/rdocking/fusebench/issues/14) ### FusionCatcher Data Sources - [Fusioncatcher](https://github.com/ndaniel/fusioncatcher) includes a whole lot of annotation resources - These are described in the [Fusioncatcher manual](https://github.com/ndaniel/fusioncatcher/blob/master/doc/manual.md#23---genomic-databases) - These mainly consist of just lists of Ensembl gene IDs - Downloaded on ORCA at `/home/projects/hackseq17_3/tools/fusioncatcher_install/fusioncatcher/data/human_v89/` - [Issue #21](https://github.com/rdocking/fusebench/issues/21) ### Atlas of Genetics and Cytogenetics in Oncology and Haematology - [Atlas of Genetics and Cytogenetics in Oncology and Haematology](http://atlasgeneticsoncology.org/) - Seems like there is no API or download access - This database contains a lot of well-curated information - it may only be possible to query through the web interface though - Depending on the eventual review interface, it may be possible to have links to this resource - it doesn't look like it's possible to include in an automated way though. ### CIViC - [CIViC](https://civic.genome.wustl.edu/home) contains clinical interpretations of variants in cancer - There is both API and bulk download access - September 2017 release has been downloaded on ORCA at `/home/projects/hackseq17_3/annotation_sources/civic/` - The contents of the individual files are described in [#13](https://github.com/rdocking/fusebench/issues/13) - It looks like there ~85 annotated fusions ### ChimerDB - [ChimerDB](http://203.255.191.229:8080/chimerdbv31/mindex.cdb) - Contains three kinds of data: - "ChimerKB represents a knowledgebase including 1,066 fusion genes with manual curation that were compiled from public resources of fusion genes with experimental evidences." - "ChimerPub includes 2,767 fusion genes obtained from text mining of PubMed abstracts." - "ChimerSeq module is designed to archive the fusion candidates from deep sequencing data." - These data files are available as MySQL and Excel-formatted dump files - These aren't downloaded to ORCA yet ### TICdb - [TICdb](http://www.unav.es/genetica/TICdb/) - 1,374 annotated fusions, with annotations of the gene partners and actual fusion sequence, with links to Pubmed or Genbank - Not downloaded to ORCA yet ### ChiTaRS - [ChiTaRS](http://chitars.bioinfo.cnio.es/) - 20,754 annotations for humans, downloaded from `http://chitars.bioinfo.cnio.es/downloads.html` - Downloaded to ORCA at: `/home/projects/hackseq17_3/annotation_sources/chitars/all_human_ChiTaRS_coord.csv` - Note that it looks like this data source hasn't been updated since late-2014 ### COSMIC - [Catalogue of Somatic Mutations in Cancer](http://cancer.sanger.ac.uk/cosmic) - [Cell Lines Project](http://cancer.sanger.ac.uk/cell_lines) - There is an API for querying COSMIC, and bulk downloads are available at [downloads](http://cancer.sanger.ac.uk/cosmic/download) - The downloads are via SFTP - `@rdocking` has credentials but hasn't downloaded things to ORCA yet - See also [#29](https://github.com/rdocking/fusebench/issues/29)