GCSA on smoothxg GFAs #5

fbemm · 2020-09-11T12:05:19Z

Dear Erik,

I am trying to GCSA index a graph from edyeet->seqwish->smoothxg->vg view->vg prune -r.

Input are 3 small genomes (<100Mb), each with around 40 "contigs". The graph has 190k segments and 240k edges.

Running vg index to generate the gcsa index results in very large tmp files (>2Tb) and practically does not finish.

I am not sure where to start digging at the moment. I am trying to index the seqwish output now directly.

Bests,
F

The text was updated successfully, but these errors were encountered:

ekg · 2020-09-13T17:01:31Z

Yes, that makes sense. I think you're probably running across a lot of bubbles during the kmer generation. This is the basic flaw of the GCSA2 indexing strategy, at least as it's currently implemented. (We might simplify things for ourselves by just indexing the actual paths directly rather than the graph and its implied recombinations..)

It's worth trying to get this to work though. Usually, by decreasing the graph complexity (with pruning) and/or reducing the GCSA2 index kmer size you can always build the index.

I think you may need to use vg prune -u, to "unfold" the reference paths in bubbles to decrease the overall complexity of the graph. @jltsiren would know

I would also just try to index with a much smaller kmer size for the GCSA2 index. For instance:

vg index -x g.xg -g g.gcsa -k 11 -X 2

This would result in a 11 * 2^2 = 44 mer index. This should be faster to make. It'll be slightly worse for mapping, but not too much worse. Remember that these are just seeds for the mapping.

jltsiren · 2020-09-13T17:16:10Z

This looks like the "small graph with many paths" scenario in the wiki. vg prune -u should work here.

The vg prune -r approach is only appropriate for reference+VCF graphs. It first removes complex graph regions and then restores all nodes and edges used by paths. If the graph is based on multiple sequence alignment, vg prune -r probably won't do anything.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCSA on smoothxg GFAs #5

GCSA on smoothxg GFAs #5

fbemm commented Sep 11, 2020 •

edited

Loading

ekg commented Sep 13, 2020

jltsiren commented Sep 13, 2020 •

edited

Loading

GCSA on smoothxg GFAs #5

GCSA on smoothxg GFAs #5

Comments

fbemm commented Sep 11, 2020 • edited Loading

ekg commented Sep 13, 2020

jltsiren commented Sep 13, 2020 • edited Loading

fbemm commented Sep 11, 2020 •

edited

Loading

jltsiren commented Sep 13, 2020 •

edited

Loading