Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hg38 debug #23

Open
wants to merge 8 commits into
base: hg38
Choose a base branch
from
Open

Hg38 debug #23

wants to merge 8 commits into from

Conversation

vinjana
Copy link
Member

@vinjana vinjana commented Jun 6, 2024

Changes collected during debugging, why there are differences to the 35 version even for the classic HG37ChrConverter.

These differences indicate the refactoring did destroy something.

As soon as they are fixed, we can be sure, the Hg37ChrConverter is a correct refactoring of the SOPHIA 35 code, and then continue to test the GenericChrConverter with hg37.

NOTE: The non-static, non-develop build is broken. Also the Conda-version of Boost lacks the backtrace library. Always use a Boost, with static compilation and build the static+develop build of SOPHIA.

Summary by CodeRabbit

  • Documentation

    • Updated CONTRIBUTORS.md with detailed attribution for original and current developers.
    • Enhanced README.md with corrected capitalization, updated URLs, tool functionalities, and added support for generic assembly.
  • Refactor

    • Renamed methods for clarity and consistency across various files.
    • Introduced new methods for validity checks and handling chromosome indices.
  • New Features

    • Added new method parseSamLine for parsing SAM lines in the Alignment class.
  • Bug Fixes

    • Adjusted logic for supplementary alignments and mate information handling in Alignment.cpp.
  • Tests

    • Added and modified test cases in GenericChrConverter_test.cpp for new and existing parsing functions.

@vinjana vinjana self-assigned this Jun 6, 2024
@vinjana vinjana changed the base branch from master to hg38 June 6, 2024 13:20
Copy link

coderabbitai bot commented Jun 6, 2024

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The recent changes enhance the SOPHIA bioinformatics tool, improving functionality and code readability. Key updates include renaming methods for clarity, adding new methods for chromosome validation, updating build instructions, and refining alignment processing. The changes also address dependencies, enhance error handling, and streamline logic flow in various components, ensuring better performance and maintainability.

Changes

File Summary
CONTRIBUTORS.md Attributed original implementation to Umut Toprak and updated development credits.
Makefile Added testRunner to the binaries target.
Overview.puml Introduced a PlantUML diagram detailing tumor and control data processing.
README.md Updated tool descriptions, URLs, dependencies, and build instructions.
include/Alignment.h Added parseSamLine method to Alignment class.
include/ChrConverter.h Added new virtual functions isValid and getInvalid; renamed isInBlockedRegion.
include/GenericChrConverter.h Added isValid and getInvalid methods; renamed parseChrAndReturnIndex.
include/Hg37ChrConverter.h Renamed methods for clarity; added isValid and getInvalid methods.
include/SamSegmentMapper.h Updated parseSamStream function signature.
src/Alignment.cpp Modified Alignment class initialization and parsing logic.
src/Breakpoint.cpp Removed branch variable and simplified control flow.
src/ChrConverter.cpp Renamed isInBlockedRegion to isInIgnoredRegion.
src/GenericChrConverter.cpp Added isValid and getInvalid methods; improved error handling.
src/Hg37ChrConverter.cpp Refactored Hg37ChrConverter class for better encapsulation and clarity.
src/MasterRefProcessor.cpp Updated function call from parseChrAndReturnIndex to parseChrReturnIndex.
src/MrefEntry.cpp Added a comment about condition for backwards compatibility.
src/SamSegmentMapper.cpp Modified parseSamStream method and alignment processing logic.
src/SuppAlignment.cpp Renamed parseChrAndReturnIndex to parseChrReturnIndex.
src/SuppAlignmentAnno.cpp Renamed parseChrAndReturnIndex to parseChrReturnIndex.
src/SvEvent.cpp Adjusted conditions for setting the distant flag.
src/sophia.cpp Updated parseSamStream method to take std::cin argument.
src/sophiaAnnotate.cpp Renamed parseChrAndReturnIndex to parseChrReturnIndex.
tests/GenericChrConverter_test.cpp Added and modified test cases for GenericChrConverter class.

Poem

In the realm of code, where data flows,
SOPHIA now with clarity glows.
Alignments parse with newfound grace,
Chromosomes validated, errors displaced.
With each refined and polished line,
Bioinformatics shines, divine.
🎉✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Outside diff range and nitpick comments (4)
README.md (4)

Line range hint 211-216: Correct the indentation of unordered lists to improve markdown formatting.

-  * Boost 1.82.0
-  * libbacktrace 20220708
-  * gtest 1.14.0
-  * gmock 1.14.0
-  * rapidcsv 8.0.0
-  * strtk 0.6.0
+* Boost 1.82.0
+* libbacktrace 20220708
+* gtest 1.14.0
+* gmock 1.14.0
+* rapidcsv 8.0.0
+* strtk 0.6.0

Also applies to: 329-352

Tools
Markdownlint

211-211: Expected: 0; Actual: 2 (MD007, ul-indent)
Unordered list indentation


212-212: Expected: 0; Actual: 2 (MD007, ul-indent)
Unordered list indentation


168-168: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


169-169: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


161-161: Expected: 1; Actual: 2 (MD012, no-multiple-blanks)
Multiple consecutive blank lines


162-162: Expected: 1; Actual: 3 (MD012, no-multiple-blanks)
Multiple consecutive blank lines


201-201: Expected: indented; Actual: fenced (MD046, code-block-style)
Code block style


Line range hint 3-3: Remove trailing spaces to clean up the markdown file.

- SOPHIA is a Structural Variant(SV) detection algorithm based on the supplementary alignment(SA) concept of the aligner BWA-MEM, combined with filters based on expert-knowledge to increase specificity. 
+ SOPHIA is a Structural Variant(SV) detection algorithm based on the supplementary alignment(SA) concept of the aligner BWA-MEM, combined with filters based on expert-knowledge to increase specificity.

Also applies to: 11-11, 15-15, 30-30, 76-76, 90-90, 168-168, 169-169, 246-246, 247-247, 249-249, 318-318, 333-333

Tools
Markdownlint

21-21: Expected: h2; Actual: h3 (MD001, heading-increment)
Heading levels should only increment by one level at a time


15-15: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


20-20: Expected: 1; Actual: 2 (MD012, no-multiple-blanks)
Multiple consecutive blank lines


Line range hint 330-330: Ensure there are no spaces inside code span elements to adhere to markdown best practices.

- `true`, ` yes`, ` t`, ` y`, or ` 1`: The chromosome is part of the compressed master-ref set of chromosomes.
+ `true`, `yes`, `t`, `y`, or `1`: The chromosome is part of the compressed master-ref set of chromosomes.
Tools
Markdownlint

21-21: Expected: h2; Actual: h3 (MD001, heading-increment)
Heading levels should only increment by one level at a time


15-15: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


20-20: Expected: 1; Actual: 2 (MD012, no-multiple-blanks)
Multiple consecutive blank lines


Line range hint 54-54: Convert fenced code blocks to indented style for consistency across the document.

- ```bash
- samtools view -F 0x600 -f 0x001 /yourPositionSorted.bam \
-   | sophia --assemblyname hg38 \
-            --medianisizes 323.0 \
-            --stdisizepercentage 21.0 \
-            --properpairpercentage 94.32 \
-            --defaultreadlength 101 \
-            --clipsize 10 \
-            --basequality 23 \
-            --basequalitylow 12 \
-            --lowqualclipsize 5 \
-            --isizesigma 5 \
-            --bpsupport 3 \
-   | gzip > breakpoints.bed.gz
- ```
+     samtools view -F 0x600 -f 0x001 /yourPositionSorted.bam \
+       | sophia --assemblyname hg38 \
+                --medianisizes 323.0 \
+                --stdisizepercentage 21.0 \
+                --properpairpercentage 94.32 \
+                --defaultreadlength 101 \
+                --clipsize 10 \
+                --basequality 23 \
+                --basequalitylow 12 \
+                --lowqualclipsize 5 \
+                --isizesigma 5 \
+                --bpsupport 3 \
+       | gzip > breakpoints.bed.gz

Also applies to: 138-138, 201-201, 223-223, 234-234, 257-257, 273-273, 285-285, 295-295

Tools
Markdownlint

21-21: Expected: h2; Actual: h3 (MD001, heading-increment)
Heading levels should only increment by one level at a time


15-15: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


20-20: Expected: 1; Actual: 2 (MD012, no-multiple-blanks)
Multiple consecutive blank lines

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 3c14f17 and 0b4d5b2.

Files ignored due to path filters (1)
  • Overview.svg is excluded by !**/*.svg
Files selected for processing (23)
  • CONTRIBUTORS.md (1 hunks)
  • Makefile (1 hunks)
  • Overview.puml (1 hunks)
  • README.md (13 hunks)
  • include/Alignment.h (2 hunks)
  • include/ChrConverter.h (3 hunks)
  • include/GenericChrConverter.h (2 hunks)
  • include/Hg37ChrConverter.h (4 hunks)
  • include/SamSegmentMapper.h (1 hunks)
  • src/Alignment.cpp (6 hunks)
  • src/Breakpoint.cpp (6 hunks)
  • src/ChrConverter.cpp (1 hunks)
  • src/GenericChrConverter.cpp (6 hunks)
  • src/Hg37ChrConverter.cpp (9 hunks)
  • src/MasterRefProcessor.cpp (1 hunks)
  • src/MrefEntry.cpp (1 hunks)
  • src/SamSegmentMapper.cpp (7 hunks)
  • src/SuppAlignment.cpp (2 hunks)
  • src/SuppAlignmentAnno.cpp (1 hunks)
  • src/SvEvent.cpp (1 hunks)
  • src/sophia.cpp (1 hunks)
  • src/sophiaAnnotate.cpp (1 hunks)
  • tests/GenericChrConverter_test.cpp (3 hunks)
Files skipped from review due to trivial changes (3)
  • CONTRIBUTORS.md
  • Overview.puml
  • src/MrefEntry.cpp
Additional context used
Learnings (5)
include/ChrConverter.h (1)
User: vinjana
PR: DKFZ-ODCF/sophia#6
File: include/ChrConverter.h:19-20
Timestamp: 2023-12-22T13:09:49.772Z
Learning: The user has updated the header guards in `ChrConverter.h` as suggested, changing from `_CHRCONVERTER_H_` to `CHRCONVERTER_H_` to avoid potential conflicts with reserved identifiers.
include/Hg37ChrConverter.h (1)
User: vinjana
PR: DKFZ-ODCF/sophia#6
File: src/Hg37ChrConverter.cpp:0-0
Timestamp: 2023-12-22T12:57:25.832Z
Learning: The user has added comments to the `Hg37ChrConverter` constructor to indicate the need for handling exceptions as suggested.
src/GenericChrConverter.cpp (1)
User: vinjana
PR: DKFZ-ODCF/sophia#6
File: src/Hg37ChrConverter.cpp:0-0
Timestamp: 2023-12-22T12:57:25.832Z
Learning: The user has added comments to the `Hg37ChrConverter` constructor to indicate the need for handling exceptions as suggested.
src/SuppAlignment.cpp (1)
User: vinjana
PR: DKFZ-ODCF/sophia#6
File: src/SuppAlignment.cpp:0-0
Timestamp: 2024-02-01T14:13:51.950Z
Learning: The user clarified that the changes to the `saCloseness` method in `SuppAlignment.cpp` were only related to code layout, not logic.
src/Hg37ChrConverter.cpp (2)
User: vinjana
PR: DKFZ-ODCF/sophia#6
File: src/Hg37ChrConverter.cpp:0-0
Timestamp: 2023-12-22T12:57:25.832Z
Learning: The user has added comments to the `Hg37ChrConverter` constructor to indicate the need for handling exceptions as suggested.
User: vinjana
PR: DKFZ-ODCF/sophia#6
File: src/Hg37ChrConverter.cpp:0-0
Timestamp: 2023-12-22T13:00:49.431Z
Learning: User plans to consider moving hardcoded values to a configuration file or database in the future.
Markdownlint
README.md

21-21: Expected: h2; Actual: h3 (MD001, heading-increment)
Heading levels should only increment by one level at a time


211-211: Expected: 0; Actual: 2 (MD007, ul-indent)
Unordered list indentation


212-212: Expected: 0; Actual: 2 (MD007, ul-indent)
Unordered list indentation


213-213: Expected: 0; Actual: 2 (MD007, ul-indent)
Unordered list indentation


214-214: Expected: 0; Actual: 2 (MD007, ul-indent)
Unordered list indentation


215-215: Expected: 0; Actual: 2 (MD007, ul-indent)
Unordered list indentation


216-216: Expected: 0; Actual: 2 (MD007, ul-indent)
Unordered list indentation


329-329: Expected: 0; Actual: 2 (MD007, ul-indent)
Unordered list indentation


330-330: Expected: 2; Actual: 4 (MD007, ul-indent)
Unordered list indentation


331-331: Expected: 2; Actual: 4 (MD007, ul-indent)
Unordered list indentation


332-332: Expected: 2; Actual: 4 (MD007, ul-indent)
Unordered list indentation


333-333: Expected: 2; Actual: 4 (MD007, ul-indent)
Unordered list indentation


334-334: Expected: 2; Actual: 4 (MD007, ul-indent)
Unordered list indentation


338-338: Expected: 0; Actual: 2 (MD007, ul-indent)
Unordered list indentation


339-339: Expected: 0; Actual: 2 (MD007, ul-indent)
Unordered list indentation


340-340: Expected: 2; Actual: 4 (MD007, ul-indent)
Unordered list indentation


341-341: Expected: 2; Actual: 4 (MD007, ul-indent)
Unordered list indentation


342-342: Expected: 0; Actual: 2 (MD007, ul-indent)
Unordered list indentation


343-343: Expected: 2; Actual: 4 (MD007, ul-indent)
Unordered list indentation


344-344: Expected: 2; Actual: 4 (MD007, ul-indent)
Unordered list indentation


345-345: Expected: 2; Actual: 4 (MD007, ul-indent)
Unordered list indentation


346-346: Expected: 2; Actual: 4 (MD007, ul-indent)
Unordered list indentation


347-347: Expected: 2; Actual: 4 (MD007, ul-indent)
Unordered list indentation


348-348: Expected: 2; Actual: 4 (MD007, ul-indent)
Unordered list indentation


349-349: Expected: 2; Actual: 4 (MD007, ul-indent)
Unordered list indentation


350-350: Expected: 2; Actual: 4 (MD007, ul-indent)
Unordered list indentation


351-351: Expected: 2; Actual: 4 (MD007, ul-indent)
Unordered list indentation


352-352: Expected: 2; Actual: 4 (MD007, ul-indent)
Unordered list indentation


3-3: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


11-11: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


15-15: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


30-30: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


76-76: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


90-90: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


168-168: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


169-169: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


246-246: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


247-247: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


249-249: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


318-318: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


333-333: Expected: 0 or 2; Actual: 1 (MD009, no-trailing-spaces)
Trailing spaces


20-20: Expected: 1; Actual: 2 (MD012, no-multiple-blanks)
Multiple consecutive blank lines


161-161: Expected: 1; Actual: 2 (MD012, no-multiple-blanks)
Multiple consecutive blank lines


162-162: Expected: 1; Actual: 3 (MD012, no-multiple-blanks)
Multiple consecutive blank lines


231-231: Expected: 1; Actual: 2 (MD012, no-multiple-blanks)
Multiple consecutive blank lines


304-304: Expected: 1; Actual: 2 (MD012, no-multiple-blanks)
Multiple consecutive blank lines


329-329: null (MD032, blanks-around-lists)
Lists should be surrounded by blank lines


330-330: null (MD038, no-space-in-code)
Spaces inside code span elements


39-39: null (MD045, no-alt-text)
Images should have alternate text (alt text)


54-54: Expected: indented; Actual: fenced (MD046, code-block-style)
Code block style


138-138: Expected: indented; Actual: fenced (MD046, code-block-style)
Code block style


201-201: Expected: indented; Actual: fenced (MD046, code-block-style)
Code block style


223-223: Expected: indented; Actual: fenced (MD046, code-block-style)
Code block style


234-234: Expected: indented; Actual: fenced (MD046, code-block-style)
Code block style


257-257: Expected: indented; Actual: fenced (MD046, code-block-style)
Code block style


273-273: Expected: indented; Actual: fenced (MD046, code-block-style)
Code block style


285-285: Expected: indented; Actual: fenced (MD046, code-block-style)
Code block style


295-295: Expected: indented; Actual: fenced (MD046, code-block-style)
Code block style


389-389: null (MD047, single-trailing-newline)
Files should end with a single newline character

Additional comments not posted (40)
src/ChrConverter.cpp (1)

35-35: Renaming isInBlockedRegion to isInIgnoredRegion improves clarity.

Verification successful

The renaming of the method from isInBlockedRegion to isInIgnoredRegion has been successfully propagated across the codebase.

  • src/SamSegmentMapper.cpp
  • src/Hg37ChrConverter.cpp
  • include/ChrConverter.h
  • src/ChrConverter.cpp
  • include/Hg37ChrConverter.h
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify all references to `isInIgnoredRegion` have been updated.

# Test: Search for the old function name. Expect: No occurrences.
rg --type cpp $'isInBlockedRegion'

Length of output: 33



Script:

#!/bin/bash
# Description: Verify all references to `isInIgnoredRegion` have been updated.

# Test: Search for the new function name. Expect: Multiple occurrences.
rg --type cpp $'isInIgnoredRegion'

Length of output: 686

include/SamSegmentMapper.h (1)

47-47: Adding std::istream parameter to parseSamStream enhances flexibility.

Verification successful

The call to parseSamStream in src/sophia.cpp has been correctly updated to use std::cin, which is an std::istream. No other calls to parseSamStream without the new parameter were found.

Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify all calls to `parseSamStream` have been updated.

# Test: Search for the function usage without the new parameter. Expect: No occurrences.
rg --type cpp $'parseSamStream\\('

Length of output: 274

include/Alignment.h (2)

38-38: Including <iostream> suggests debugging or output enhancements.


48-48: Adding parseSamLine method aligns with input handling refactoring.

Makefile (1)

173-173: Including testRunner in the binaries target promotes a testing culture.

Verification successful

The testRunner is correctly defined and integrated into the Makefile. It is linked properly and included in the binaries target.

  • testRunner target is defined with the necessary linking flags.
  • testRunner is included in the binaries target.
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify the integration of `testRunner` in the build process.

# Test: Check the build logs for successful compilation and linking of `testRunner`. Expect: Successful build.
cat build.log | grep 'testRunner'

Length of output: 75



Script:

#!/bin/bash
# Description: Verify the integration of `testRunner` in the Makefile.

# Test: Check the Makefile for the definition and usage of `testRunner`.
grep -E 'testRunner' Makefile

Length of output: 305

include/ChrConverter.h (3)

58-59: Added virtual methods isValid and getInvalid enhance the flexibility of chromosome validation across different implementations.


126-126: Renaming isInBlockedRegion to isInIgnoredRegion clarifies the intent and usage of the method, aligning better with its functionality.


154-157: Modification of the parseChrReturnIndex method signature to include stopCharExt parameter enhances its flexibility in parsing chromosome names from various formats.

include/Hg37ChrConverter.h (4)

56-62: Renaming of validation methods to include a leading underscore (_isValid, _assertValid) indicates their intended private or internal use, which is a good practice for clarity in method visibility and usage.


125-126: Addition of public methods isValid and getInvalid provides a clear interface for checking chromosome validity, which is crucial for robust chromosome handling in genomic applications.


179-179: Renaming isInBlockedRegion to isInIgnoredRegion in Hg37ChrConverter maintains consistency with changes in the base class and clarifies the method's purpose.


208-211: The modification in parseChrReturnIndex to ignore stopCharExt and maintain legacy behavior is a critical change. Ensure that this behavior is well-documented and understood by users, as it affects how chromosome names are parsed.

include/GenericChrConverter.h (1)

122-123: The addition of isValid and getInvalid methods in GenericChrConverter aligns with similar changes in other parts of the project, ensuring a consistent approach to chromosome validity checking across different converter implementations.

src/sophia.cpp (1)

186-186: The update to parseSamStream to take std::cin directly as an argument simplifies the function's interface and usage, making it more straightforward to integrate with standard input streams.

tests/GenericChrConverter_test.cpp (2)

122-129: LGTM! The test GenericChrConverterTest_ParseSimpleStrings effectively validates the parsing of simple chromosome strings.


170-194: LGTM! The test GenericChrConverterTest_ParseToIndex effectively validates the parsing of chromosome strings to their respective indices.

src/sophiaAnnotate.cpp (1)

203-203: LGTM! The error handling in the parseChrReturnIndex call is robust, ensuring that any domain errors are caught and appropriately logged.

src/MasterRefProcessor.cpp (1)

213-213: LGTM! The error handling in the parseChrReturnIndex call within processFile is robust, ensuring that any domain errors are caught and appropriately logged.

src/SuppAlignmentAnno.cpp (1)

84-84: LGTM! The implementation of parseChrReturnIndex in the SuppAlignmentAnno constructor is robust, ensuring accurate parsing of chromosome data from supplementary alignment strings.

src/GenericChrConverter.cpp (2)

49-49: LGTM! Good use of static_cast for type safety.

Also applies to: 59-59


114-120: LGTM! Proper use of std::numeric_limits for handling invalid indices.

src/SamSegmentMapper.cpp (3)

303-313: Check for potential performance issues with frequent container operations.
[PERFORMANCE]
The method incrementCoverages performs multiple container operations within a loop, which could lead to performance issues, especially with large data sets. Consider optimizing these operations, possibly by reducing the frequency of container resizing or by using more efficient data structures.


Line range hint 244-271: Validate the logic for handling mate chromosome indices.

The conditions for handling mate chromosome indices and ignored regions seem complex and are critical for the correctness of the program. It would be beneficial to add explicit unit tests to ensure that these conditions are handled correctly, especially since they involve multiple logical operators and method calls.


58-76: Ensure proper handling of invalid alignments.

The logic for handling invalid alignments (i.e., those with a chromosome index greater than 1000) has been changed. It's crucial to ensure that this new logic correctly filters out invalid alignments as intended. Consider adding unit tests to verify this behavior.

README.md (1)

18-18: Update the description to reflect support for both hg37 and hg38 assemblies.

- Currently, SOPHIA only is optimized for the hg37 assembly of the human genome.
+ Currently, SOPHIA is optimized for both the hg37 and hg38 assemblies of the human genome.

Likely invalid or redundant comment.

src/Alignment.cpp (6)

60-60: Initialization of validLine to false in the constructor.

This change ensures that validLine is explicitly set to false, which is a good practice to avoid uninitialized variables that might lead to undefined behavior.


Line range hint 70-93: Addition of parseSamLine method to handle SAM line parsing.

This method efficiently parses SAM lines and handles potential domain errors by rethrowing exceptions with additional error information. This is a robust way to ensure that any issues during parsing are caught and reported.


Line range hint 83-93: Handling of unaligned reads in SAM format.

The method correctly checks for unaligned reads with chromosome name '*'. This is crucial for handling SAM files correctly and avoiding errors in downstream processing.


213-215: Updated method call to parseChrReturnIndex for parsing mate chromosome index.

The update reflects the renamed method parseChrReturnIndex, ensuring consistency across the codebase. This change is part of the broader refactoring to improve clarity and maintainability.


466-475: Handling of 'SA' tag in SAM format.

The method correctly identifies and processes the 'SA' tag, which is used for secondary alignments in SAM files. This is important for accurate representation and processing of alignment data.


700-708: Validation of chromosome indices in supplementary alignments.

The method isValid and isTechnical are used to ensure that only valid and non-technical chromosome indices are processed. This is a good practice to prevent errors and ensure data integrity.

src/Hg37ChrConverter.cpp (8)

302-302: Update to use std::numeric_limits for defining NA is a good practice for type safety and clarity.


362-364: The method getInvalid() correctly returns the INVALID constant. This encapsulation enhances maintainability.


366-368: The method _isValid(ChrIndex) correctly checks for non-invalid and non-zero indices, aligning with the domain logic.


370-372: The public isValid() method correctly delegates to the private _isValid() method, maintaining encapsulation.


402-426: The method _buildCompressedMrefIndexToIndex effectively maps compressed mref indices to global indices and validates them. The use of throw_with_trace for error handling is consistent with other parts of the code.


605-606: The method isInIgnoredRegion uses a clear and concise logic to determine if a position is within a specific ignored region on chromosome 2. This is a good example of domain-specific optimizations.


374-376: The use of throw_with_trace in _assertValid methods enhances error reporting by providing stack trace information. Ensure that this pattern is consistently used across the project for error handling.

Also applies to: 391-391

Verification successful

The use of throw_with_trace for error handling is consistent across the project.

  • Instances of throw_with_trace are found in multiple files, including Hg37ChrConverter.cpp, Sdust.cpp, SvEvent.cpp, SuppAlignmentAnno.cpp, MasterRefProcessor.cpp, IndexRange.cpp, GlobalAppConfig.cpp, GenericChrConverter.cpp, ChrInfoTable.cpp, ChrInfo.cpp, ChrCategory.cpp, Alignment.cpp, and include/global.h.
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify consistent use of `throw_with_trace` for error handling across the project.

# Test: Search for error handling patterns. Expect: Consistent use of `throw_with_trace`.
rg --type cpp 'throw_with_trace'

Length of output: 3620


Line range hint 637-689: The method parseChrReturnIndex has been refactored to handle various chromosome identifiers more robustly. The use of structured exception handling and detailed error messages will aid in debugging and maintenance. However, ensure that the handling of the G case (GL chromosomes) correctly parses the numbers following 'GL'.

src/Breakpoint.cpp (1)

1202-1202: Ensure proper error handling for parseChrReturnIndex.

Verification successful

Error handling for parseChrReturnIndex is properly implemented across the codebase. Each usage is wrapped in a try-catch block handling DomainError.

Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify that the error handling for `parseChrReturnIndex` is implemented correctly across the codebase.

# Test: Search for the usage of `parseChrReturnIndex` and check for error handling.
rg --type cpp 'parseChrReturnIndex' --context 5

Length of output: 12799

Comment on lines +359 to +360
if (distant &&
chrIndex1 == chrIndex2 &&
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider simplifying the nested conditionals for clarity.

- if (distant && chrIndex1 == chrIndex2 && (selectedSa1.isFuzzy() || selectedSa1.isStrictFuzzyCandidate())) {
-     if (eventSize < 5000) {
-         distant = false;
-     }
- }
+ if (distant && chrIndex1 == chrIndex2 && (selectedSa1.isFuzzy() || selectedSa1.isStrictFuzzyCandidate()) && eventSize < 5000) {
+     distant = false;
+ }

This change combines the nested if statements into a single conditional, improving readability and reducing the complexity of the code.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
if (distant &&
chrIndex1 == chrIndex2 &&
if (distant && chrIndex1 == chrIndex2 && (selectedSa1.isFuzzy() || selectedSa1.isStrictFuzzyCandidate()) && eventSize < 5000) {
distant = false;
}

Comment on lines +51 to +83
SamSegmentMapper::parseSamStream(std::istream &inputStream) {
const ChrConverter &chrConverter = GlobalAppConfig::getInstance().getChrConverter();
unsigned long count = 0;
while (true) {
auto alignment = std::make_shared<Alignment>();

// Used to be `alignment->getChrIndex() > 1000`, i.e. excluding MT and phiX and INVALID.
// This is the same as `!chrConverter.isCompressedMref(alignment->getChrIndex())`.
if (!chrConverter.isCompressedMref(alignment->getChrIndex())) {
continue;
}
std::shared_ptr<Alignment> alignment = std::make_shared<Alignment>(Alignment());
alignment->parseSamLine(inputStream);

// There used to be a `continue` statement here, if `alignment->getChrIndex() > 1000`.
// The condition is equivalent to the new `!chrConverter.isCompressedMref(alignment->getChrIndex())`.
// Instead, we negated the condition and only process the data, *after* the check, whether
// the line is valid.
if (alignment->isValidLine()) {
if (alignment->getChrIndex() != chrIndexCurrent) {
switchChromosome(*alignment);
#ifdef DEBUG
++count;
if (count % 1000000 == 0) {
std::cerr << "Read " << count << " lines ..." << std::endl;
}
#endif
if (chrConverter.isCompressedMref(alignment->getChrIndex())) {
if (alignment->getChrIndex() != chrIndexCurrent) {
switchChromosome(*alignment);
}
alignment->continueConstruction();
printBps(alignment->getStartPos());
incrementCoverages(*alignment);
assignBps(alignment);
}
alignment->continueConstruction();
printBps(alignment->getStartPos());
incrementCoverages(*alignment);
assignBps(alignment);
} else {
break;
}
}
// EOF event for the samtools pipe. printing the end of the very last
// chromosome,
std::cerr << "SamSegmentMapper(" << getpid() << "): Processed " << count << " lines." << std::endl;
// EOF event for the samtools pipe. printing the end of the very last chromosome.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor the parseSamStream method to improve readability and maintainability.

The parseSamStream method is quite lengthy and handles multiple responsibilities. Consider breaking it down into smaller, more manageable functions. For instance, the loop body could be extracted into a separate method that processes a single Alignment object. This would make the code easier to understand and maintain.

Comment on lines +307 to +325
ChrIndex GenericChrConverter::parseChrReturnIndex(std::string::const_iterator startIt,
std::string::const_iterator endIt,
char stopChar,
const std::string &stopCharsExt) const {
ChrName chrName = parseChr(startIt, endIt, stopChar, stopCharsExt);

// Map to ChrIndex and return it, of if the chromosome is not registered, give a helpful
// error message, that shows the parsed name and from what input it was parsed.
try {
return allChromosomeLookup.at(chrName);
} catch (std::out_of_range& e) {
throw_with_trace(DomainError(
"Chromosome name '" + chrName + "' not found for assembly '" +
getAssemblyName() + "'."));
if (chrName == "*") {
// When reading SAM format, we may encounter '*' as chromosome for unaligned reads.
return getInvalid();
} else {
// Map to ChrIndex and return it, of if the chromosome is not registered, give a helpful
// error message, that shows the parsed name and from what input it was parsed.
try {
return allChromosomeLookup.at(chrName);
} catch (std::out_of_range& e) {
throw_with_trace(DomainError(
"Chromosome name '" + chrName + "' not found for assembly '" +
getAssemblyName() + "'."));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The updated error handling in parseChrReturnIndex enhances robustness.

Would you like me to help create a unit test for this method to ensure its robustness?

@@ -168,7 +168,7 @@ namespace sophia {
// Update `chrIndex` field.
const ChrConverter &chrConverter = GlobalAppConfig::getInstance().getChrConverter();
try {
result.chrIndex = chrConverter.parseChrAndReturnIndex(
result.chrIndex = chrConverter.parseChrReturnIndex(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Correct handling of chromosome parsing in parseSamSaTag.

Consider refactoring the parsing logic to improve readability and maintainability. Here's a potential simplification:

// Simplified parsing logic
void SuppAlignment::parseSamSaTag(...) {
    // Existing setup code...
    auto parseField = [&](auto begin, auto end, auto delimiter) {
        std::vector<std::string::const_iterator> parts;
        std::copy_if(begin, end, std::back_inserter(parts), [&](auto c) { return c == delimiter; });
        return parts;
    };

    auto fields = parseField(saCbegin, saCend, ',');
    // Use fields to update chrIndex, pos, etc.
}

Comment on lines +82 to +83
std::cerr << "SamSegmentMapper(" << getpid() << "): Processed " << count << " lines." << std::endl;
// EOF event for the samtools pipe. printing the end of the very last chromosome.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve logging for better traceability.

The logging statement at the end of parseSamStream could include more context about the process, such as the total number of valid and invalid lines processed. This would provide better traceability and debugging capabilities.

- std::cerr << "SamSegmentMapper(" << getpid() << "): Processed " << count << " lines." << std::endl;
+ std::cerr << "SamSegmentMapper(" << getpid() << "): Processed " << count << " valid lines and " << invalidCount << " invalid lines." << std::endl;

Committable suggestion was skipped due to low confidence.

> | sophiaAnnotate | TBD | TBD | TBD | 4 GB |
>
> Note that this is data from single runs on a dedicated compute node at *different* times, but using the same input data and parameters, except for the assembly name.

* 35 (9e3b6ed)
* Forked from [bitbucket](https://bitbucket.org/compbio_charite/sophia/src/master/)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure the file ends with a single newline character.

- 
+ \n

Committable suggestion was skipped due to low confidence.

Tools
Markdownlint

389-389: null (MD047, single-trailing-newline)
Files should end with a single newline character

@@ -15,7 +15,7 @@ For a detailed description of the algorithm, please refer to [Umut Topraks's dis
SOPHIA is a very fast and resource-light algorithm.
It uses 2GB RAM, 2 CPU cores and runs in ~3.5 hours for 50x coverage WGS, and can detect variants with a single pass of the input BAM files. No local assembly is done.

Sophia is included in the [SophiaWorkflow](https://github.com/DKFZ-ODCF/SophiaWorkflow) that uses the [Roddy Workflow Management Systems](https://github.com/TheRoddyWMS/Roddy).
SOPHIA is included in the [SophiaWorkflow](https://github.com/DKFZ-ODCF/SophiaWorkflow) that uses the [Roddy Workflow Management Systems](https://github.com/TheRoddyWMS/Roddy).


Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure there are no multiple consecutive blank lines to maintain a clean and professional appearance of the markdown file.

-
-
+ 

Also applies to: 161-162, 231-231, 304-304

Committable suggestion was skipped due to low confidence.

Tools
Markdownlint

20-20: Expected: 1; Actual: 2 (MD012, no-multiple-blanks)
Multiple consecutive blank lines

@@ -36,12 +36,18 @@ The code here is a fork of that repository. If you use the newer versions here,

For instructions and short explanations of on commandline parameters, invoke each tool with `--help`.

![](Overview.svg)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add alt text to the image for accessibility.

- ![](Overview.svg)
+ ![Overview of SOPHIA Workflow](Overview.svg)
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
![](Overview.svg)
![Overview of SOPHIA Workflow](Overview.svg)
Tools
Markdownlint

39-39: null (MD045, no-alt-text)
Images should have alternate text (alt text)

…file defaults. Also added unused annotation (for non-devel build).

Note: The non-develop, non-static build is broken.

Signed-off-by: Philip R. Kensche <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant