Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run "generate_sequence_features_single" with UNSORTED bam #169

Open
GuoYang-qd opened this issue Jul 31, 2024 · 2 comments
Open

How to run "generate_sequence_features_single" with UNSORTED bam #169

GuoYang-qd opened this issue Jul 31, 2024 · 2 comments

Comments

@GuoYang-qd
Copy link

Thank you for developing such an excellent tool as semibin2, which performs exceptionally well and can generate a large number of high-quality MAGs.

Therefore, we are interested in applying semibin2 to the analysis of our large datasets. Considering that the analysis of large datasets is usually very time-consuming, we hope to streamline the pipline as much as possible.

Sorting Bam files often consumes a significant amount of computational and storage resources (e.g., temporary files when sorting are usually hundreds of Gbs per bam in our case). However, it seems that Semibin2 does not support unsorted bam as input, as an error occurs when running the "generate_sequence_features_single" module:

Input error: Chromosome k127_4971567 found in non-sequential lines. This suggests that the input file is not sorted correctly.

I would like to ask if there are any alternative tools or ways to generate the "data.csv" and "data.split.csv" based on unsorted bam files? Or, is it possible to make simple modifications on the "generate_sequence_features_single" module to adapt it to unsorted bam?

@luispedro
Copy link
Member

Unfortunately, it's not trivial to use non-sorted files. It's conceptually possible (we do so in NGLess), but not in a way that fits semibin

@GuoYang-qd
Copy link
Author

Thanks for the reply. Currently, I can generate tetramer frequencies in "data.csv". The abundance calculated by NGLess seems to be similar to the trend of abundance generated by Bedtools in semibin. So, can the abundance calculated by NGLess replace the abundance calculated by Bedtools?

Additionally, I noticed that "data_split.csv" appears to sample the contig from "data.csv", and then split its abundance and tetramer frequencies into two numbers (it seems the average of this two values is the number in "data.csv"). How is this process achieved? Could you briefly introduce the logic behind it?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants