Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hg19 and hg38 analysis from same workflow branch #8

Open
wants to merge 53 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
ba8b0f5
Removing ExAC and EVS from annotation and no-control filtering
NagaComBio Aug 19, 2019
8584941
Removing DUKE, DAC and HiSeqDepth based confidence annotaions
NagaComBio Oct 11, 2019
7e49b5d
Removing getRefGenomeAndChrPrefixFromHeader functions
NagaComBio Oct 11, 2019
ce97964
Updating annotation file paths to hg38 directory
NagaComBio Oct 11, 2019
c514297
Removing the hg38 encode blacklist filter
NagaComBio Jan 21, 2020
3bebd1f
Variant calling from CRAM files
NagaComBio Jan 7, 2021
b0ac261
Checking for nonREFnonALT in BoolCounter class
NagaComBio Feb 11, 2021
674900f
updating pysam to 0.16.0.1
NagaComBio Feb 11, 2021
128c1c0
Moving files to ngs_share
NagaComBio Feb 11, 2021
2de0727
Updating BoolCounter class
NagaComBio Mar 11, 2021
d8f3cf1
REF name via BAM header
NagaComBio Mar 11, 2021
b99be8f
Reverting generic xml to hg19
NagaComBio Mar 11, 2021
98bb76e
New xml for GRCh38 files
NagaComBio Mar 11, 2021
90b9978
chrLength file with 'chr' prefix
NagaComBio Mar 11, 2021
58b77cd
Removing hard-coded header parsing
NagaComBio Mar 16, 2021
1af4b68
Liftover local control for hg38 WES and WGS
NagaComBio Mar 16, 2021
baabd82
hg19 specific annotations
NagaComBio Mar 16, 2021
5a764ef
User defined threshold-based 'SNP_support_germline' annotations
NagaComBio Mar 16, 2021
da88940
Removed extra spaces
NagaComBio Mar 16, 2021
ad3f650
Merge branch 'master' into hg38
NagaComBio Mar 16, 2021
aef5111
Uncommenting reference detection
NagaComBio Apr 26, 2021
22a55ec
updating refgenome help
NagaComBio Apr 30, 2021
71900b3
Raise error: unknown alignment suffix
NagaComBio Apr 30, 2021
dbd835d
Reformatting in_dbSNPcounter.pl
NagaComBio Apr 30, 2021
df184d9
Reformatting IMD R file
NagaComBio Apr 30, 2021
f93779c
Increasing the mem requirement
NagaComBio Sep 29, 2021
7beae93
safer IO for in_dbSNPcounter.pl
NagaComBio Oct 18, 2021
d270358
Add variant calling in HLA/ALT contigs
NagaComBio May 4, 2022
b683df2
Update ngs_share path
NagaComBio May 4, 2022
1c0bc26
Remove RE/MAP from hg38 penalties
NagaComBio May 30, 2022
6d47982
Add m2e2,HLA/ALT mappability files
NagaComBio Jul 8, 2022
5fcc9e8
Merge branch 'mappability branch' into hg38
NagaComBio Jul 8, 2022
01567ac
Fix the diagnostic plots for GRCh38
NagaComBio Sep 16, 2022
6321e0a
Remove the quote for RAW_SNV_FILTER_OPTIONS
NagaComBio Sep 16, 2022
0adf1e4
Upgrade to gencodev39 for hg38
NagaComBio Oct 6, 2022
72a4135
Merge branch 'master' into hg38
NagaComBio Oct 27, 2022
c4fb48b
Bug fix with dbSNP counter
NagaComBio Nov 7, 2022
dc2ca36
Update WGS local control
NagaComBio Nov 7, 2022
973ae9c
hg38: Add local control and gnomAD based confidence annotation
NagaComBio Nov 7, 2022
56aa726
Update README
NagaComBio Nov 7, 2022
da45d2a
Exempt classification with FREQ
NagaComBio Nov 7, 2022
45fa141
Reverting SNP based confidence scoring
NagaComBio Mar 20, 2023
5479faf
Add exception for 'N' in createErrorPlots.py
NagaComBio Mar 20, 2023
f633856
Remove NA values in quantile calculation
NagaComBio Mar 20, 2023
ea7154b
Update reference
NagaComBio Mar 20, 2023
3f8419c
Update raw_filter_punishment in accordance with RAW_SNV_FILTER_OPTIONS
NagaComBio Mar 20, 2023
13bdc5f
Move python env
NagaComBio Mar 23, 2023
e2f7db9
Update virtual env path
NagaComBio Jun 12, 2023
44614b6
XML comments to description
NagaComBio Jun 12, 2023
dfff910
Merge branch 'master' into hg38
NagaComBio Jun 12, 2023
3a47349
Minor update
NagaComBio Jun 12, 2023
b8d4664
Update readme with hg38 calls
NagaComBio Jun 12, 2023
4dca636
Fixing the missed conflict
NagaComBio Jun 27, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 33 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# DKFZ SNVCalling Workflow

An SNV calling workflow developed in the Applied Bioinformatics and Theoretical Bioinformatics groups at the DKFZ. An earlier version (pre Github) of this workflow was used in the [Pancancer](https://dockstore.org/containers/quay.io/pancancer/pcawg-dkfz-workflow) project. The workflow is only suited for human data (hg37/hg19; some later versions also hg38), because of the big role annotations play in this workflow.
An SNV calling workflow developed in the Applied Bioinformatics and Theoretical Bioinformatics groups at the DKFZ. An earlier version (pre Github) of this workflow was used in the [Pancancer](https://dockstore.org/containers/quay.io/pancancer/pcawg-dkfz-workflow) project. The workflow is only suited for human data (hg37/GRCh37/hg19; and also for hg38/GRCh38), because of the big role annotations play in this workflow.

> <table><tr><td><a href="https://www.denbi.de/"><img src="docs/images/denbi.png" alt="de.NBI logo" width="300" align="left"></a></td><td><strong>Your opinion matters!</strong> The development of this workflow is supported by the <a href="https://www.denbi.de/">German Network for Bioinformatic Infrastructure (de.NBI)</a>. By completing <a href="https://www.surveymonkey.de/r/denbi-service?sc=hd-hub&tool=SNVCallingWorkflow">this very short (30-60 seconds) survey</a> you support our efforts to improve this tool.</td></tr></table>

Expand Down Expand Up @@ -114,6 +114,8 @@ Please have a look at the `resources/configurationFiles/analysisSNVCalling.xml`

## Example Call

For hg19,

```bash
roddy.sh run projectConfigurationName@analysisName patientId \
--useconfig=/path/to/your/applicationProperties.ini \
Expand All @@ -123,6 +125,22 @@ roddy.sh run projectConfigurationName@analysisName patientId \
--cvalues="bamfile_list:/path/to/your/control.bam;/path/to/your/tumor.bam,sample_list:normal;tumor,possibleTumorSampleNamePrefixes:tumor,possibleControlSampleNamePrefixes:normal,REFERENCE_GENOME:/reference/data/hs37d5_PhiX.fa,CHROMOSOME_LENGTH_FILE:/reference/data/hs37d5_PhiX.chromSizes,extractSamplesFromOutputFiles:false"
```

For hg38, add the following workflow configs to the `projectConfigs`

```xml
<availableAnalyses>
<analysis id="snvCalling" configuration="snvCallingAnalysisGRCh38" useplugin="SNVCallingWorkflow:3.0.0"/>
</availableAnalyses>
```

```bash
roddy.sh run projectConfigurationName@analysisName patientId \
--useconfig=/path/to/your/applicationProperties.ini \
--configurationDirectories=/path/to/your/projectConfigs \
--useiodir=/input/directory,/output/directory/snv \
--cvalues="bamfile_list:/path/to/your/control.bam;/path/to/your/tumor.bam,sample_list:normal;tumor,possibleTumorSampleNamePrefixes:tumor,possibleControlSampleNamePrefixes:normal,REFERENCE_GENOME:/reference/data/GRCh38_decoy_ALT_HLA_PhiX.fa,CHROMOSOME_LENGTH_FILE:/reference/data/GRCh38_decoy_ALT_HLA_PhiX.chromSizes,extractSamplesFromOutputFiles:false"
```

### No Control

* Set the parameter `isNoControlWorkflow` to `true`.
Expand Down Expand Up @@ -157,6 +175,20 @@ The optional configuration JSON file defaults to the `convertToStdVCF.json` resi

* patch: Remove all code related to PyPy and hts-python (including `copysam.py` and `PYPY_OR_PYTHON_BINARY`)


* 3.0.0

* Major
* Support for hg38/GRCh38 reference genome and variant calling from ALT and HLA contigs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any new mandatory configuration values? At least list them. Details should be in the XML.
Was the semantics of conf. values changed?

* Minor
* For hg38: Removing mappability and repeat elements' annotations from penalty calculations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any new optional configuration values? (at least list them)

* skipREMAP: Option to remove repeat elements and mappability from confidence annotations in hg19.
* Removing EVS And ExAC AF from the annotations and no-control workflow filtering
* Support for variant calling from CRAM files
* Bug fix: Removing "quote" around the raw filter option `<RAW_SNV_FILTER_OPTIONS>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. But this is a technical/code description. What was the effect of the bug to the user? (probably that multiple RAW_SNV_FILTER_OPTIONS could not be used).

* Update `COWorkflowsBasePlugin` to `1.4.2`


* 2.2.0

* minor: Update virtualenv
Expand Down
2 changes: 1 addition & 1 deletion buildinfo.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
dependson=COWorkflowsBasePlugin:1.3.0
dependson=COWorkflowsBasePlugin:1.4.2
JDKVersion=1.8
GroovyVersion=2.4
RoddyAPIVersion=3.4
2 changes: 1 addition & 1 deletion buildversion.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
2.1
3.0
0
4 changes: 4 additions & 0 deletions resources/analysisTools/snvPipeline/PurityReloaded.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is passing over the VCF once. Not sure how much memory it uses, but maybe it is worthwhile limiting the memory with this approach.

Original file line number Diff line number Diff line change
Expand Up @@ -251,6 +251,10 @@ def parseVcf(file,num):
while (l!= ""):
t=l.split('\t')
if (t[0][0] != "#") and isValid(t):
# Skipping the non-primary assembly variants from purity calculations
if t[0].startswith('HLA') or t[0].endswith('_alt'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t -> fields
l -> line

BTW (2 lines up): line[0] != "#" is much clearer than fields[0][0] != "#" (not to speak of t[0][0]).

l=vcf.readline()
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is i (next line)? Maybe mapped_chromosome?

i = chromMap[t[0]]
if (t[12]=="germline"):
#DP5=string.split(string.split(t[11],";")[1],",")
Expand Down
256 changes: 143 additions & 113 deletions resources/analysisTools/snvPipeline/confidenceAnnotation_SNVs.py

Large diffs are not rendered by default.

6 changes: 5 additions & 1 deletion resources/analysisTools/snvPipeline/createErrorPlots.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ def calculateLogSize(size, max_val, base):
return math.log(((size/max_val)*(base-1.))+1., base)

def calculateRootedSize(size, max_val):
if(float(size) != 0.0):
if(float(size) != 0.0 and float(max_val) != 0.0):
return np.sqrt(size/max_val)
else:
return 0.0
Expand Down Expand Up @@ -199,6 +199,10 @@ def calculateErrorMatrix(vcfFilename, referenceFilename, errorType):
# 23.05.2016 JB: Excluded multiallelic SNVs
if ',' in split_line[header.index("ALT")]: continue

# 21.02.2023 NP: Excluded SNVs with 'N' before or after "," in context
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a short word about why you exclude these? I guess it is not obvious because it was added to the SNV workflow only after years of operation.

if {'N,', ',N'}.intersection(split_line[header.index("SEQUENCE_CONTEXT")]):
continue

chrom = split_line[header.index("CHROM")]
pos = int(split_line[header.index("POS")])
context = ""
Expand Down
Loading