Merge pull request #5 from FelixKrueger/dev

Update to work with GRCm39 genome and v8 annotations
FelixKrueger · Apr 12, 2023 · 5688543 · 5688543
2 parents 03bc071 + ba0b416
commit 5688543
Show file tree

Hide file tree

Showing 6 changed files with 132 additions and 77 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,14 @@
 # reStrainingOrder Changelog
 
+## Changelog for version 0.4.0 [release on 12 April 2023]
+
+Updated documentation to reflect the changes made by swtiching over to the `GRCm39` mouse genome, as well as the v8 annotation files. The v5 and v7 versions are probably no longer available for download, but we have left the option `--v7` in there for backward compatibility for the time being.
+
+### reStraining
+
+- Updated the genome preparation to now work with the latest (v8) genome annotation file (mgp_REL2021_snps.vcf.gz) from the [Mouse Genomes Project](https://www.mousegenomes.org/). The mgp_v5 version for the now outdated GRCm38 genome are now no longer supported (since this is primarily a screening tool anyway...).
+
+
 ## Changelog for version 0.3.0 [release on 27 March 2022]
 
 ### reStraining

diff --git a/Docs/README.md b/Docs/README.md
@@ -4,7 +4,7 @@
 
 # reStrainingOrder - Mouse Strain Identification
 
-## User Guide - v0.1.0
+## User Guide - v0.4.0
 
 This User Guide outlines how reStrainingOrder works and gives details for each step.
 
@@ -50,7 +50,7 @@ While the genome preparation is not very resource hungry, the alignment and scor
 
 #### Feedback
 
-We would like to hear your comments or suggestions! Please e-mail [me here](mailto:[email protected])!
+We would like to hear your comments or suggestions! Please e-mail [me here](mailto:[email protected])!
 
 
 # The reStrainingOrder workflow in more detail
@@ -61,18 +61,18 @@ We would like to hear your comments or suggestions! Please e-mail [me here](mail
 
 This is a one-off process.
 
-`reStraining` is designed to read in a variant call file from the Mouse Genomes Project (download e.g. from this location: ftp://ftp-mouse.sanger.ac.uk/current_snps/mgp.v5.merged.snps_all.dbSNP142.vcf.gz (FTP links are not rendered nicely in Github markdown)) and generate a new genome version where all positions found as a SNP in any of the strains (currently 35 different ones) are masked by the ambiguity nucleobase `N` (**N-masking**). The entire process of filtering through ~80 million SNP positions and preparing the N-masked genome typically takes four hours on our server and requires some 6GB of memory.
+`reStraining` is designed to read in a variant call file from the Mouse Genomes Project (download e.g. from [The Mouse Genomes Project](https://www.mousegenomes.org/). It now assumes the GRCm39 mouse genome build by default, and uses the latest SNP annotation file (v8: [mgp_REL2021_snps.vcf.gz](https://ftp.ebi.ac.uk/pub/databases/mousegenomes/REL-2112-v8-SNPs_Indels/mgp_REL2021_snps.vcf.gz)); the previous version v5 (`mgp.v5.merged.snps_all.dbSNP142.vcf.gz`) is no longer supported. `reStraining` generates a new genome version where all positions found as a SNP in any of the strains (currently >50 different ones) are masked by the ambiguity nucleobase `N` (**N-masking**). The entire process of filtering through ~80 million SNP positions and preparing the N-masked genome typically takes a few hours and requires some 6GB of memory.
 
 If you don't have the mouse genome files already, you can download them from Ensembl, e.g. with a command like this:
 
 ```
-wget ftp://ftp.ensembl.org/pub/release-97/fasta/mus_musculus/dna/*dna.chromosome.*
+wget ftp://ftp.ensembl.org/pub/release-109/fasta/mus_musculus/dna/*dna.chromosome.*
 ```
 
 Here is a sample command for the genome preparation step:
 
 ```
-reStraining --vcf mgp.v5.merged.snps_all.dbSNP142.vcf.gz --reference /bi/scratch/Genomes/Mouse/GRCm38/
+reStraining --vcf mgp_REL2021_snps.vcf.gz --reference Genomes/Mouse/GRCm39/
 ```
 
 
@@ -89,33 +89,33 @@ This folder (called `MGP_strains_N-masked`) and its FastA contents are vital for
 
 **Chromosome 1 matrix file**
 
-The genome preparation command writes out a matrix file for chromosome 1 only (called `MGPv5_SNP_matrix_chr1.txt.gz`), which is in the following format:
+The genome preparation command writes out a matrix file for chromosome 1 only (called `MGPv8_SNP_matrix_chr1.txt.gz`), which is in the following format:
 
 ```
-Chromosome	Position	REF	ALT	129P2_OlaHsd	129S1_SvImJ	129S5SvEvBrd	AKR_J	A_J	BALB_cJ	BTBR_T+_Itpr3tf_J	BUB_BnJ	C3H_HeH	C3H_HeJ	C57BL_10J	C57BL_6NJ	C57BR_cdJ	C57L_J	C58_J	CAST_EiJ	CBA_J	DBA_1J	DBA_2J	FVB_NJ	I_LnJ	KK_HiJ	LEWES_EiJ	LP_J	MOLF_EiJ	NOD_ShiLtJ	NZB_B1NJ	NZO_HlLtJ	NZW_LacJ	PWK_PhJ	RF_SEA_GnJ	SPRET_EiJ	ST_bJ	WSB_EiJ	ZALENDE_EiJ
-1	3000023	C	A	1	1	0	0	0	0	1	1	0	1	0	0	1	0	0	0	1	0	0	0	0	0	0	0
-1	3000126	G	T	1	1	0	0	0	1	1	1	0	1	1	1	1	1	1	0	1	1	0	0	1	1	0	1
-1	3000185	G	T	1	1	1	1	0	0	1	1	0	0	0	0	1	1	0	1	0	1	1	1	1	0	0	1
-1	3000234	G	A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
+Chromosome	Position	REF	ALT	129P2_OlaHsd	129S1_SvImJ	129S5SvEvBrd	A_J	AKR_J	B10.RIII	BALB_cByJ	BALB_cJ	BTBR_T+_Itpr3tf_J	BUB_BnJ	C3H_HeH	C3H_HeJ	C57BL_10J	C57BL_10SnJ	C57BL_6NJ	C57BR_cdJ	C57L_J	C58_J	CAST_EiJ	CBA_J	CE_J	CZECHII_EiJ	DBA_1J	DBA_2J	FVB_NJ	I_LnJ	JF1_MsJ	KK_HiJ	LEWES_EiJ	LG_J	LP_J	MAMy_J	MOLF_EiJ	NOD_ShiLtJ	NON_LtJ	NZB_B1NJ	NZO_HlLtJ	NZW_LacJ	PL_J	PWK_PhJ	QSi3	QSi5	RF_J	RIIIS_J	SEA_GnJ	SJL_J	SM_J	SPRET_EiJ	ST_bJ	SWR_J	WSB_EiJ	ZALENDE_EiJ
+1	3050050	C	G	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0 0 0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
+1	3050069	C	T	1	1	1	0	1	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0 1	1	0	0	0	0	1	1	0	1	1	0	0	0	1	0	1	1	0	0	0	1	0	0	1	1	0	1
+1	3050115	G	A	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0 0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
+1	3050118	G	A	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	1	0	0	1	1 0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0
 ```
 A score of 0 for a strain indicates that a given strain has the `REF` base at this position, a call of 1 means that it contains the `ALT` base with high confidence. This matrix file is used as input for the SNP scoring process (reStrainingOrder, [see below](#step-III---scoring-snps) ). 
 
 The matrix is written out for a single chromosome only to use less memory in the scoring process. In theory one could use any other chromosome as well (or even the whole genome, but with 70M positions this would be challenging...!). This is the SNP filtering summary:
 
 ```
-SNP position summary for all MGP strains (based on mouse genome build GRCm38)
+SNP position summary for all MGP strains (based on mouse genome build GRCm39)
 ===========================================================================
 
-Positions read in total:	78,772,544
-Positions skipped because the REF/ALT bases were not well defined:	960,167
-Positions discarded as no strain had a high confidence call:	7,936,128
+Positions read in total:	6,449,162
+Positions skipped because the REF/ALT bases were not well defined:	84,827
+Positions discarded as no strain had a high confidence call:	698,557
 
-Positions printed to THE CHR1 MATRIX in total:	5,506,653
+Positions printed to THE MATRIX in total:	5,665,777
 ```
 
 **Please note:** that only positions that have a single `REF/ALT` genotype were considered (i.e. positions with several ALT positions for different strains (e.g. `REF: A`, `ALT: C,T`) were skipped for simplicity. Also, positions where the reference sequence did not have a DNA base or positions with no high confidence SNP call in any of the strains were skipped entirely. 
 
-In total, the chr1 matrix file contains ~5.5 million positions that were of high quality in one or more strains.
+In total, the chr1 matrix file contains ~5.6 million positions that were of high quality in one or more strains.
 
 **SNP folder**
 
@@ -212,7 +212,7 @@ The number of SNP positions that have been skipped because of this bisulfite amb
 
 This step carries out the following tasks:
 
-- read and store matrix of high confidence SNP positions on chromosome 1 (`MGPv5_SNP_matrix_chr1.txt.gz`)
+- read and store matrix of high confidence SNP positions on chromosome 1 (`MGPv8_SNP_matrix_chr1.txt.gz`)
 
 - read BAM file, identify reads overlapping genomic N-masked positions, and store frequency of detected bases at N-masked positions (`REF`/`ALT`/`OTHER`). This step discriminates between standard genomic or bisulfite converted reads (C/T positions cannot be used under certain conditions, see above)
 
@@ -223,7 +223,7 @@ Once the BAM file has finished processing, `reStrainingOrder` calculates the fol
 
 A **sample command** could look like this:
 
-`reStrainingOrder --snp MGPv5_SNP_matrix_chr1.txt.gz Spretus_10M_bowtie2.bam`
+`reStrainingOrder --snp MGPv8_SNP_matrix_chr1.txt.gz Spretus_10M_bowtie2.bam`
 
 ### Output: 
 
@@ -290,6 +290,6 @@ This file is generated by `reStrainingReport`. It is called automatically at the
 
 
 # Credits
-reStrainingOrder was written by Felix Krueger at the [Babraham Bioinformatics Group](http://www.bioinformatics.babraham.ac.uk/).
+reStrainingOrder was written by Felix Krueger at the [Babraham Bioinformatics Group](http://www.bioinformatics.babraham.ac.uk/), now maintained at [Altos Bioinformatics](https://altoslabs.com/).
 
 ![Babraham Bioinformatics](Images/bioinformatics_logo.png)
diff --git a/README.md b/README.md
@@ -5,7 +5,7 @@
 # reStrainingOrder - why do we need one?
 reStrainingOrder is intended as QC tool that attempts to identify the genotype of pure strain or hybrid mouse samples. It can be be used to check public data as well as provide useful insight into mouse strains commonly used in your own lab.
 
-To do this, reStrainingOrder harnesses single-nucleotide polymorphism (SNP) information collected by the Mouse Genomes Project (MGP, http://www.sanger.ac.uk/science/data/mouse-genomes-project), and constructs a fully N-masked genome similar to the approach of [SNPsplit](https://github.com/FelixKrueger/SNPsplit/blob/master/SNPsplit_User_Guide.md).
+To do this, reStrainingOrder harnesses single-nucleotide polymorphism (SNP) information collected by the [Mouse Genomes Project](https://www.mousegenomes.org/), and constructs a fully N-masked genome similar to the approach of [SNPsplit](http://felixkrueger.github.io/SNPsplit/). The project has been updated to work with the latest release of the Mouse Genomes Project, which means that it will now assume the GRCm39 mouse genome build by default, and use the latest SNP annotation file (v8: [mgp_REL2021_snps.vcf.gz](https://ftp.ebi.ac.uk/pub/databases/mousegenomes/REL-2112-v8-SNPs_Indels/mgp_REL2021_snps.vcf.gz)).
 
 reStrainingOrder is intended to work with most common types of Illumina sequencing - including `RNA-seq`, `ChIP-seq`, `ATAC-seq` or any kind of `Bisulfite-seq`. Supported aligners include [`Bowtie2`](https://github.com/BenLangmead/bowtie2), [`HISAT2`](https://ccb.jhu.edu/software/hisat2/index.shtml), [`STAR`](https://github.com/alexdobin/STAR), and [`Bismark`](https://github.com/FelixKrueger/Bismark) (Oxford comma, anyone?).
 
@@ -29,15 +29,10 @@ reStrainingOrder requires the following tools installed and ideally available in
 ## Documentation
 The reStrainingOrder documentation can be found here: [reStrainingOrder User Guide](./Docs/README.md)
 
-
-## Links
-
-This project was started as part of the 2018 Cambridge area bioinformatics hackathon (https://www.cambiohack.uk/).
-
 ## Licences
 
 reStrainingOrder itself is free software, `reStrainingReport` produces HTML graphs powered by [Plot.ly](https://plot.ly/javascript/) which are also free to use and look at!
 
 ## Credits
-reStrainingOrder was written by Felix Krueger, part of the [Babraham Bioinformatics](https://www.bioinformatics.babraham.ac.uk) group.
+This project was started as part of the 2018 Cambridge area bioinformatics hackathon; reStrainingOrder was written by Felix Krueger, part of [Babraham Bioinformatics](https://www.bioinformatics.babraham.ac.uk), now part of Altos Bioinformatics.
 <p align="center"> <img title="Babraham Bioinformatics" id="logo_img" src="Docs/Images/bioinformatics_logo.png"></p>