-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
computational efficiency of pggb #370
Comments
It seems that the cited paper had a misunderstanding about how the variation graph building methods are currently used in the HPRC. PGGB (and minigraph-cactus) are run on each chromosome individually. This allows for high parallelism in graph building. Just throwing all data from all human chromosomes in the HPRC into a single node is likely to take a very long time and produce a result which may be hard to understand. Improving the partitioning process is critical to enabling this kind of use. To minimize bias, we propose a community detection method to partition the graph building process into pieces that each can be processed independently on a cluster. Refining this is the main area of ongoing work with PGGB, as it will lead to automatic and unbiased graph building in any context, not just those where there is a clear partitioning by chromosome (or in humans, most chromosomes, the sex chromosomes, and the acrocentrics). |
Also, the pggb version used in the paper https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03098-2 was |
Thank you for the reply! I have another pggb related question, posted here: ekg/seqwish#121. I am wondering if you have any idea about that. Thanks! |
Hi,
A recent paper, "Comparing methods for constructing and representing human pangenome graphs", shows that pggb cannot construct graphs from 104 human haplotypes because of low computational efficiency. This result kind of contradicts to the results shown in the paper "A draft human pangenome reference", where pggb is used to construct graphs from around 90 haplotypes, a number very close to 104. Therefore, I am wondering the computational efficiency of pggb, can it deal with hundreds or even thousands of haplotypes? If not, what would be the key bottleneck?
Thanks!
The text was updated successfully, but these errors were encountered: