Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure for chromosomes with "chr" prefix #4

Open
vinjana opened this issue Nov 27, 2023 · 3 comments · May be fixed by #6
Open

Failure for chromosomes with "chr" prefix #4

vinjana opened this issue Nov 27, 2023 · 3 comments · May be fixed by #6
Assignees

Comments

@vinjana
Copy link
Member

vinjana commented Nov 27, 2023

Running sophia with input that contains "chr" prefixes for chromosomes produces a segmentation fault.

@vinjana
Copy link
Member Author

vinjana commented Nov 27, 2023

Looking at the code, there are hard-coded mappings in src/ChrConverter.cpp. The indices are also used in the code, e.g. in include/ChrConverter.cpp. I hope there are no other places. Making that more flexible could be done in different ways:

  1. Reading the mapping table from an external file(s).
  2. Making two instances (not classes) of the ChrConverter, initialized with tables for hg37 and hg38.

I think solution 1 is certainly too complex. C++ is not magic, but writing good robust parsers is not that easy and will take time. My C++ is also rusty, and Sophia is also implemented in a much advanced version of the language, that I haven't yet learned. Furthermore, given that we have to invest so much time to obtait the parameters for sophia, I think, it really does not make sense to provide Sophia for anything but hg37 and hg38.

So my plan would be as follows:

  1. Make the mapping tables more dynamic (not const) and allow to initialize the ChrConverter with different tables.
  2. Create multiple versions of the tables, for hg37, hg38, and their variants (e.g. concerning conventions for prefixing or not prefixing, maybe also different prefixes). This would happen externally to the ChrConverter, to allow us to use the static tables to initialize the ChrConverter dependent on CLI parameters.

The tables with prefixes could still at runtime get derived from static tables, e.g. by prefixing all/certain chromosomes with a user-provided value, but they could also be hard-coded as static const.

We'd also have to add parameters for selecting the mapping table by its name (hg37, hg38, chr_hg37, chr_hg38) and/or the prefixes ("chr", "hsa_", etc.).

This solution would already provide some flexibility. There is also the issue, that in the hg38, we prefixed some, but not all chromosomes. Therefore, I think, we should go with the multiple static tables approach, but also with the initialization of the ChrConverter class to use any of these. This would open the path to a complete dynamic implementation with a mapping table file provided by the user.

Basically, I propose to switch from from the static const array<...> implementation in ChrConverter to a initialized class, and taking the initialization from static tables.

@vinjana
Copy link
Member Author

vinjana commented Nov 27, 2023

The first thing to do, anyway, would be to ensure no segfault happens if the input file contains unknown chromosomes, but a helpful error message :)

@NagaComBio
Copy link
Member

NagaComBio commented Nov 27, 2023

Hi @vinjana,

I have generated this example from the public SEQC2 data for Sophia/35 module in the cluster. I have tried to change the chr prefix on the go so that Sophia will recognise the reads. And then the segmentation fault happens.

samtools view -F 0x600 -f 0x001 /omics/odcf/project/public_data/seqc2/sequencing/whole_genome_sequencing/view-by-pid/SEQC2_IL2/tumor0
1/paired/merged-alignment/tumor01_SEQC2_IL2_merged.mdup.bam | \
sed 's/chr//g' | \
sophia --medianisize 395.0 --stdisizepercentage 21.0 --properpairpercentage 92.02 --defaultreadlength 101 --
clipsize 10 --basequality 23 --basequalitylow 12 --lowqualclipsize 5 --isizesigma 5 --bpsupport 3

And here is the error message:

[[clipped]]
1       31788   31789   0,0,0,1,0,0,36,6,5,0,0,0        36,36   .       .       .
1       33528   33529   0,0,1,0,0,0,44,41,3,0,1,0       44,44   .       .       .
1       33546   33547   0,0,2,0,0,0,54,26,2,13,2,0      50,54   .       .       .
1       33547   33548   0,0,4,0,0,0,55,17,1,11,0,0      54,55   |4:25808805_INV(0,1,4?/8)       .       .
1       34045   34046   5,2,17,0,0,0,96,1,0,2,0,0       92,101  1:35362_INV|(4,1,!/0);12:31020091-31020162_INV|(5,1,7?/14);X:68605275-68605276_INV|(0,1,2?/4);9:1862015-1862016_INV|(0,2,4?
/8)     1:205884_INV|(4,0,1?/2);19:75792-75844|(4,0,2?/4);2:113579039-113579040_INV|(4,0,1?/2)  >120_1:GGCCTCAAGCGATCCTCCCACCTTAGCCTCCCAAAGTGTTGGGATTATAGGCATGAGCCACTGCACCTGGCT|(4)
1       35362   35363   3,2,8,0,0,0,83,3,0,1,1,0        84,86   19:75654_INV|(3,1,2?/4);1:34045_INV|(3,2,!/0);2:113579421|(0,2,3?/6)    6:99450064-99450065|(3,0,1?/2);1:205883-205884|(3,0
,1?/2);X:68605275-68605276|(3,0,1?/2)   >124_1:GTGGGCATTTAGTGCTATAAATTTCTCTTTACACACTGCTTTAAATGCGTCCC|(3)
1       39262   39263   0,0,0,1,0,0,187,23,0,0,0,0      187,188 .       .       .
1       50482   50483   0,0,0,2,0,21,48,2,0,2,0,0       44,50   .       .       .
1       50605   50606   0,0,0,0,0,0,80,18,0,41,1,0      79,80   #       #       #
1       51611   51612   0,0,1,0,0,0,14,10,0,1,1,0       14,14   .       .       .
1       54778   54779   0,0,0,0,1,0,5,20,1,0,0,0        7,7     .       .       .
1       61352   61353   0,0,0,0,0,0,0,36,2,0,1,0        0,0     #       #       #
Segmentation fault

I assume the error is because of different columns in the SAM file in this new hg38 alignments compared to the hg19 ones. I am trying to figure this out.

@vinjana vinjana linked a pull request Dec 13, 2023 that will close this issue
@vinjana vinjana self-assigned this Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants