Skip to content

data_sets

Rod Docking edited this page Oct 20, 2017 · 11 revisions

Data Sets

Fusion Caller Output Files

Set 1. chimeraviz example data

This dataset contains example fusion output from the chimeraviz BioConductor package.

Root directory: /home/projects/hackseq17_3/datasets/chimeraviz_examples/

Relevant files:

  • FusionMap_01_TestDataset_InputFastq.FusionReport.txt - FusionMap output
  • PRADA.acc.fusion.fq.TAF.tsv - PRADA output
  • defuse_833ke_results.filtered.tsv - DeFuse output
  • ericscript_SRR1657556.results.total.tsv - EricScript output
  • fusioncatcher_833ke_final-list-candidate-fusion-genes.txt - FusionCatcher output
  • infusion_fusions.txt - Infusion output
  • jaffa_results.csv - JAFFA results
  • soapfuse_833ke_final.Fusion.specific.for.genes - SOAPFuse results
  • star-fusion.fusion_candidates.final.abridged.txt - STARfusion results

Set 2. FusionCatcher example data

This data set contains a small FASTQ subset of files, containing reads supporting the fusions described on the fusioncatcher github page.

Root directory: /home/projects/hackseq17_3/datasets/fusioncatcher_examples/

Relevant files:

  • final-list_candidate-fusion-genes.txt - fusioncatcher output
  • readme.txt - README describing the detected fusions
  • reads_1.fq.gz - Read 1 file
  • reads_2.fq.gz - Read 2 file

Set 3. AML Cell Line Data

This data set contains fusion results for three technical replicates from an AML cell line.

Root directory: /home/projects/hackseq17_3/datasets/aml_cell_line_examples/

Under the root directory, you'll find directories named by tool and library. There are results for fusioncatcher, defuse, ericscript, STAR-fusion (with Oncofuse annotations),and PAVfinder.

Annotation Data Sources

Tumor Fusion Gene Data Portal

  • Located online at PanCanFusV2
  • Downloaded to ORCA at /home/projects/hackseq17_3/annotation_sources/tumour_fusion_gene_data_portal/
  • Contains 17,754 observations of 27 variables, in a format that is amenable to conversion to BEDPE
  • Annotations include recurrence in TCGA tumour types, as well as additional manual and automated curations

Database of Genomic Variants

  • Located online at DGV
  • Latest release of GRCh37 dataset is on ORCA at /home/projects/hackseq17_3/annotation_sources/dgv/
  • Contains 392,583 observations of 20 variables. These are mainly CNVs, insertions, and deletions though, so it seems it won't be as relevant here
  • Issue #14

FusionCatcher Data Sources

  • Fusioncatcher includes a whole lot of annotation resources
  • These are described in the Fusioncatcher manual
  • These mainly consist of just lists of Ensembl gene IDs
  • Downloaded on ORCA at /home/projects/hackseq17_3/tools/fusioncatcher_install/fusioncatcher/data/human_v89/
  • Issue #21

Atlas of Genetics and Cytogenetics in Oncology and Haematology

  • Atlas of Genetics and Cytogenetics in Oncology and Haematology
  • Seems like there is no API or download access
  • This database contains a lot of well-curated information - it may only be possible to query through the web interface though
  • Depending on the eventual review interface, it may be possible to have links to this resource - it doesn't look like it's possible to include in an automated way though.

CIViC

  • CIViC contains clinical interpretations of variants in cancer
  • There is both API and bulk download access
  • September 2017 release has been downloaded on ORCA at /home/projects/hackseq17_3/annotation_sources/civic/
  • The contents of the individual files are described in #13
  • It looks like there ~85 annotated fusions

ChimerDB

  • ChimerDB
  • Contains three kinds of data:
    • "ChimerKB represents a knowledgebase including 1,066 fusion genes with manual curation that were compiled from public resources of fusion genes with experimental evidences."
    • "ChimerPub includes 2,767 fusion genes obtained from text mining of PubMed abstracts."
    • "ChimerSeq module is designed to archive the fusion candidates from deep sequencing data."
  • These data files are available as MySQL and Excel-formatted dump files
  • These aren't downloaded to ORCA yet

TICdb

  • TICdb
  • 1,374 annotated fusions, with annotations of the gene partners and actual fusion sequence, with links to Pubmed or Genbank
  • Not downloaded to ORCA yet

ChiTaRS

  • ChiTaRS
  • 20,754 annotations for humans, downloaded from http://chitars.bioinfo.cnio.es/downloads.html
  • Downloaded to ORCA at: /home/projects/hackseq17_3/annotation_sources/chitars/all_human_ChiTaRS_coord.csv
  • Note that it looks like this data source hasn't been updated since late-2014

COSMIC