Skip to content

Releases: FelixKrueger/SNPsplit

MGP v8 annotations and GRCm39

07 Jan 11:47
Compare
Choose a tag to compare

v0.6.0 - GRCm39 genome build and new Docs

SNPsplit

  • Added an option --single_end to skip the paired-end auto-detection entirely (which failed for e.g. alignments with STAR see here)

SNPsplit_genome_preparation

  • Changed the chromosome detection regex to a non-greedy match (so it only uses the NAME entry following the ID=NAME, up to, but not including the first ,)

v7-Genome Preparation

26 Jul 07:02
Compare
Choose a tag to compare

v0.5.0

SNPsplit_genome_preparation

  • Added option --v7_MGP; now also accepts the v7 file (mgp_REL2005_snps_indels.vcf.gz) of Mouse Genomes Project which may be downloaded here: ftp://ftp-mouse.sanger.ac.uk/REL-2004-v7-SNPs_Indels/mgp_REL2005_snps_indels.vcf.gz. INDEL variants are being skipped (this is noted in the report). This new version adds a number of additional strains to choose from, now amounting to 51 strains in total:

Available genomes to choose from are:

SEA_GnJ
SM_J
ST_bJ
CAST_EiJ
BALB_cByJ
NON_LtJ
FVB_NJ
RIIIS_J
CE_J
NZO_HlLtJ
C58_J
BTBR_T+_Itpr3tf_J
MOLF_EiJ
BUB_BnJ
C57L_J
CZECHII_EiJ
C57BL_10J
B10.RIII
AKR_J
C3H_HeJ
LP_J
DBA_2J
QSi3
ZALENDE_EiJ
A_J
PL_J
129S1_SvImJ
NZW_LacJ
PWK_PhJ
C57BL_10SnJ
C57BR_cdJ
QSi5
C57BL_6NJ
SWR_J
MA_MyJ
C3H_HeH
SPRET_EiJ
LEWES_EiJ
WSB_EiJ
129P2_OlaHsd
CBA_J
SJL_J
BALB_cJ
KK_HiJ
JF1_MsJ
NZB_B1NJ
I_LnJ
DBA_1J
129S5SvEvBrd
NOD_ShiLtJ
RF_J

If the file mgp_REL2005_snps_indels.vcf.gz is given, --v7 is set automatically.

  • now attempts to extract the fields FORMAT and INFO from the VCF file automatically, to get access to the required information GT (genotype) and FI (filter). See more here.

0.4.0 - Soft-clipping, YAML and more

29 Sep 09:51
36fd82a
Compare
Choose a tag to compare
  • SNPsplit now supports soft-clipping of reads (CIGAR operation S).

  • SNPsplit now writes important statistics out in YAML format to enable easier integration into MultiQC. If tag2sort is called via SNPsplit itself, the ...sort.yaml file will be integrated into the main ...SNPsplit_report.yaml file (and deleted afterwards)

  • Added option --skip_tag2sort to allow the separation of the allele-tagging and allele-sorting (tag2sort) processes. This might be desired to add a de-duplication step such as markduplicates or deduplicate_bismark for Nextflow pipelines

  • For genomes that consist of chromosomes for which SNPs are recorded, and scaffolds for which there are no SNPs, now all chromosomes and scaffolds are printed to both the N-masked and full sequence genomes (see here).

  • Added auto-detection of single-end or paired-end files. This avoids accidentally processing paired-end files in single-end mode see here.

  • Now making use of variable genome_build instead of using GRCm38 invariably

v0.3.4 - Added SNPsplit to Bioconda

28 May 13:58
5df9762
Compare
Choose a tag to compare
  • Changed /usr/bin/perl to /usr/bin/env perl, which was required for adding SNPsplit to bioconda. Thanks to @vivekbhr for these changes.

  • Fixed output-path handling for paired-end and Hi-C mode (was only working for single-end files).

tag2sort


  • Added option -o/--output_dir to specify an output directory.

Fixed allele-assignment for certain SNPs in --bisulfite mode

15 Jun 14:13
Compare
Choose a tag to compare

v0.3.3

SNPsplit


  • Changed FindBin qw($Bin) to FindBin qw($RealBin) so that symlinks to tag2sort are resolved properly.

  • In certain cases, specific SNPs were only used for the allele assignment if they were methylated. In more detail: In cases where the SNP was either C/G (REF/ALT) or G/C (REF/ALT), and the read was on the opposing strand, only the methylated form of the C on the reverse strand had previously been allowed as a valid expected base. This has now been changed so that both G and A are considered valid for the strain containing a G at the SNP position (see also this issue).

  • Changed the way in which C>T SNPs are handled in the allele-tagging report (note that this was merely a report/interpretation thing and did not have any effect the on the actual results). Previously, reads without a call for genome 1 or genome 2 had been listed as:
    reads did not contain one of the expected bases at known SNP positions.
    In a bisulfite setting this also included C>T SNPs however, and hence the number could have been rather high (>10%). I have now changed this so that reads which had at least one C>T SNP and were unassignable at the same time are scored differently:
    reads that were unassignable contained C>T SNPs preventing the assignment

  • Changed all instances of zcat to gunzip -c in SNPsplit and SNPsplit_genome_preparation to prevent errors on certain OSX platforms

v0.3.2 - Much improved SNP genome preparation

29 Mar 10:15
Compare
Choose a tag to compare

SNPsplit


  • Changed the samtools command throughout SNPsplit to now correctly use the path supplied by the user with --samtools_path. Thanks to Kenzo Hillion for spotting this (see here).

  • Option --genome_build [NAME] should now work as intended (used to be --build only).

SNPsplit_genome_preparation

  • Relaxed SNP filtering criteria to now support multiple homozygous variants for the same position in the genome. This step should incresae the number of usable SNPs slightly (but noticably). See here

  • Changed the SNP filtering for --dual_hybrid mode to only include positions where both strains had a high confidence call (irrespective of the nature of the call). This step should greatly reduce the number of false positive allele calls. See here for more details.

  • Added a check to SNPsplit_genome_preparation that produces a [FATAL ERROR] if the stored chromosome names are not the same as the ones in the VCF file (which is a rather common mistake when people use the Ensembl VCF file but get the genome from UCSC. This should change soon if and when Ensembl adopts the same standard used by NCBI/UCSC).

  • Added a new version of the genome preparation script that can deal with the latest version of the VCF file for the old NCBIM37 genome build ("mgp.v2.snps.annot.reformat.vcf.gz"). The script is called "SNPsplit_genome_preparation_v2VCF" and may be found in the folder "outdated_VCF_versions" on Github. Please note that this does not include the changes to we made the current version (see above).

Automated genome preparation for single or dual hybrid strains

18 May 16:08
Compare
Choose a tag to compare

SNPsplit


  • Changed sorting command for BAM files to also work with Samtools versions 1.3+
  • The sorting report for single-end files is now also written to the report files.
  • Added the # of SNPs used for the allele-discrimination to the report file to make it easier to spot errors
  • Now removing CR and LF line endings when reading in the SNP file. For SNP annotation files copied from a Windows machine we saw problems with no allele-specific reads for genome 2 at all which was due to the invisible \r character for the SNP call

SNPsplit_genome_preparation


Added whole new functionality to construct single- or dual-hybrid genomes starting from VCF files which are obtainable from the Mouse Genomes Project (http://www.sanger.ac.uk/science/data/mouse-genomes-project), here is a brief description of what it does:

SNPsplit_genome_preparation is designed to read in a variant call files from the Mouse Genomes Project (e.g. this latest file: ftp://ftp-mouse.sanger.ac.uk/current_snps/mgp.v5.merged.snps_all.dbSNP142.vcf.gz) and generate new genome versions where the strain SNPs are either incorporated into the new genome (full sequence) or masked by the ambiguity nucleobase 'N' (N-masking).

SNPsplit_genome_preparation may be run in two different modes:

Single strain mode:

  1. The VCF file is read and filtered for high-confidence SNPs in the strain specified with strain
  2. The reference genome (given with --reference_genome <genome>) is read into memory, and the filtered high-confidence SNP positions are incorporated either as N-masking (default) or full sequence (option --full_sequence)

Dual strain mode:

  1. The VCF file is read and filtered for high-confidence SNPs in the strain specified with --strain <name>
  2. The reference genome (given with --reference_genome <genome>) is read into memory, and the filtered high-confidence SNP positions are incorporated as full sequence and optionally as N-masking
  3. The VCF file is read one more time and filtered for high-confidence SNPs in strain 2 specified with --strain2 <name>
  4. The filtered high-confidence SNP positions of strain 2 are incorporated as full sequence and optionally as N-masking
  5. The SNP information of strain and strain 2 relative to the reference genome build are compared, and a new Ref/SNP annotation is constructed whereby the new Ref/SNP information will be Strain/Strain2 (and no longer the standard reference genome strain Black6 (C57BL/6J))
    6.The full genome sequence given with --strain <name> is read into memory, and the high-confidence SNP positions between Strain and Strain2 are incorporated as full sequence and optionally as N-masking

The resulting .fa files are ready to be indexed with your favourite aligner. Proved and tested aligners include Bowtie2, Tophat, STAR, Hisat2, HiCUP and Bismark. Please note that STAR and Hisat2 may require you to disable soft-clipping, please see the SNPsplit manual more details

Both the SNP filtering and the genome preparation write out little report files for record keeping.