Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCSA on smoothxg GFAs #5

Open
fbemm opened this issue Sep 11, 2020 · 2 comments
Open

GCSA on smoothxg GFAs #5

fbemm opened this issue Sep 11, 2020 · 2 comments

Comments

@fbemm
Copy link

fbemm commented Sep 11, 2020

Dear Erik,

I am trying to GCSA index a graph from edyeet->seqwish->smoothxg->vg view->vg prune -r.

Input are 3 small genomes (<100Mb), each with around 40 "contigs". The graph has 190k segments and 240k edges.

Running vg index to generate the gcsa index results in very large tmp files (>2Tb) and practically does not finish.

I am not sure where to start digging at the moment. I am trying to index the seqwish output now directly.

Bests,
F

@ekg
Copy link
Collaborator

ekg commented Sep 13, 2020

Yes, that makes sense. I think you're probably running across a lot of bubbles during the kmer generation. This is the basic flaw of the GCSA2 indexing strategy, at least as it's currently implemented. (We might simplify things for ourselves by just indexing the actual paths directly rather than the graph and its implied recombinations..)

It's worth trying to get this to work though. Usually, by decreasing the graph complexity (with pruning) and/or reducing the GCSA2 index kmer size you can always build the index.

I think you may need to use vg prune -u, to "unfold" the reference paths in bubbles to decrease the overall complexity of the graph. @jltsiren would know

I would also just try to index with a much smaller kmer size for the GCSA2 index. For instance:

vg index -x g.xg -g g.gcsa -k 11 -X 2

This would result in a 11 * 2^2 = 44 mer index. This should be faster to make. It'll be slightly worse for mapping, but not too much worse. Remember that these are just seeds for the mapping.

@jltsiren
Copy link

jltsiren commented Sep 13, 2020

This looks like the "small graph with many paths" scenario in the wiki. vg prune -u should work here.

The vg prune -r approach is only appropriate for reference+VCF graphs. It first removes complex graph regions and then restores all nodes and edges used by paths. If the graph is based on multiple sequence alignment, vg prune -r probably won't do anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants