Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancies between Edyeet and minimap2 PAF output #1

Open
brettChapman opened this issue Sep 1, 2020 · 7 comments
Open

Discrepancies between Edyeet and minimap2 PAF output #1

brettChapman opened this issue Sep 1, 2020 · 7 comments

Comments

@brettChapman
Copy link

Hi Erik

I've run Edyeet with 2 chromosomes of the same variety, but different assembly versions (related to the issue I posted for smoothxg: pangenome/smoothxg#2). I also ran with minimap2. However, the resulting size of the PAF file for both is hugely different.

Edyeet run gives me a 37Mb PAF file and also gives an error when running the resulting GFA with smoothxg:

singularity exec --bind /data/pangenome_edyeet:/data/pangenome_edyeet /data/edyeet_builds/edyeet.sif edyeet -s 10000 -p 85 -a 90 -n 10 -X /data/pangenome_edyeet/all_genomes.fasta /data/pangenome_edyeet/all_genomes.fasta > Morex_v1_5H_vs_Morex_v2_5H.paf

Minimap2 run gives me a 8.3GB PAF file (and it's still running and has been for the last 4 days, compared to the relatively quick completion time of edyeet):

singularity exec --bind /data/pangenome_minimap2:/data/pangenome_minimap2 /data/minimap2_builds/minimap2.sif minimap2 -t 16 -f100 -x asm20 -c -X /data/pangenome_minimap2/all_genomes.fasta /data/pangenome_minimap2/all_genomes.fasta > Morex_v1_5H_vs_Morex_v2_5H.paf

Are the discrepancies in size between the two, mainly based on the parameters, or is minimap2 much more sensitive? Or perhaps Edyeet finished prematurely, and hence why its resulting GFA file gives me an error with smoothxg? Do you generally expect smaller PAF files with Edyeet?

Thanks.

@ekg
Copy link
Owner

ekg commented Sep 1, 2020

edyeet and minimap2 are fundamentally different algorithms. Both produce semantically equivalent PAF, but they get there with different approaches.

minimap2 starts with minimizer chaining. Chains are built and filtered using heuristic criteria. Then, when -c is specified, each chain is aligned using adaptive banded semi-global alignment. The implementation in minimap2 uses affine gaps and should do a good job with large indels. If all chains are evaluated, down to a short length, then the result can be very large in size. The evaluation of the chaining DAG can also consume a lot of memory and time in some cases. In practice, I found it necessary to filter out short alignments from the output of minimap2 before giving the results to seqwish, or I would get a complex, collapsed graph in result.

edyeet starts with the mashmap2 approximate mapping algorithm, which scans segments of each query across the reference set, scoring each possible mapping location based on mash distance. A top number of mappings are kept (the number of secondaries given by -n plus 1) for each segment. The mappings must have an approximate identity greater than -p based on the jaccard metric of the minimizers. Then, each mapping is aligned using edlib, with an edit distance set based on -a. If an alignment is found within the given edit distance threshold, then it is emitted by the method. This does not have affine gaps and this could cause problems in some cases, and that's part of the rationale behind smoothxg.

The mapping and alignment phases in edyeet are relatively tightly bounded. Given a fixed segment length (you're using 10kb but experiments suggest that going even higher might be good for pangenomes), the mapping time for each segment across the reference set is approximately the same. Then, we get no more than -n+1 mappings for each segment. Finally, the alignment time for each segment is approximately the same--- it can be made faster by setting -a higher, or slower (and more sensitive to large variants) by setting -a lower, but there is little variance between segments.

These tight bounds do not appear to be the case for minimap2. The chaining process does not provide the same kinds of bounds, and it seems that sequences that are repetitive can use dramatically more time and memory to complete this step. The segment boundaries are not fixed, but determined by minimizer chaining heuristics. In result, we may end up aligning many small segments from some sequences and single long ones for others. This in turn can affect the runtime and (possibly) memory requirements of the alignment.

I suspect that minimap2 is more likely to give us good alignments. But, it can cost much more. MashMap2, and by consequence edyeet, provide a very fast method for seeing the high-level, large-scale relationships in a set of sequences. I think that this is a good place to start when building a pangenome graph, in that we can later go in and compress or homogenize the alignment using methods for multiple alignment that have affine gaps and such (smoothxg is the first prototype of this).

@ekg
Copy link
Owner

ekg commented Sep 1, 2020

Just a second note to keep things organized. I'm still learning how to use edyeet for this application, and I'm finding that a combination of large segment sizes and permissive identity bounds seems to give good results on a collection of yeast genomes.

Taking these genomes in yeast.pan.fa.gz, I found that this worked very well:

edyeet -t 16 -X -s 50000 -n 20 -p 75 -a 75 yeast.pan.fa.gz yeast.pan.fa.gz >yeast.pan.edyeet.s50000n20p75a75.paf

So if you continue with edyeet, I'd suggest trying larger segment sizes and lower identity thresholds.

@brettChapman
Copy link
Author

Thanks for the very detailed explanation. I'll try as you suggested with larger segment sizes and lower identity thresholds.

My barley genomes are very repetitive, so that would probably explain why it's taking longer, and why the files are much larger with minimap2 by comparison.

I agree, it may be more prudent starting at a higher level with edyeet + smoothxg, and then working down.

One aspect about a pangenome graph with VG that is attractive is being able to produce variant calls directly from the graph using vg deconstruct.

Between a graph built by minimap2 vs edyeet + smoothxg, would the resulting variants in the VCF file be significantly different? Would minimap2 identify more variants by comparison, or could edyeet, assisted with smoothxg, basically make the two graph and resulting variants (vg deconstruct) reasonably comparable?

Thanks.

@ekg
Copy link
Owner

ekg commented Sep 3, 2020

One aspect about a pangenome graph with VG that is attractive is being able to produce variant calls directly from the graph using vg deconstruct.

That's the hope. Please report if you have success with vg deconstruct. I find it to be difficult to get to work on arbitrary graphs, although the theory behind it is clean, it seems to be difficult to get things into VCF space.

Between a graph built by minimap2 vs edyeet + smoothxg, would the resulting variants in the VCF file be significantly different? Would minimap2 identify more variants by comparison, or could edyeet, assisted with smoothxg, basically make the two graph and resulting variants (vg deconstruct) reasonably comparable?

It's really not clear to me. It depends on the way that things are parameterized, and there are unfortunately a lot of parameters. I think you just have to try and see how it goes. At this point, it's all blue sky research. Hopefully in time we'll have standard approaches that work in general for most cases.

@brettChapman
Copy link
Author

Thanks.

I'll try first with the parameters you suggest with larger segment lengths, and then adjust as needed. I'll also still try with minimap2 (although I've found it takes around 7x longer to run). I'm also going to time the alignments between 2 chromosomes and then 4 chromosomes using edyeet and minimap2, to compare alignment times, and to see how they would likely scale to all 7 chromosomes and 20 genomes. I'm guessing the alignment times for the whole pangenome will likely turn out huge (up to several months I'm guessing) and I'll have no choice but to split up the alignments across nodes, even though I may miss secondary alignments. It'll be good to know the alignment times, pros/cons moving forward and to justify that choice. I've just upgraded my cluster with 10 more nodes (now 20 nodes) and 100TB of space, so running across many nodes shouldn't be a problem.

@ekg
Copy link
Owner

ekg commented Sep 13, 2020

You might be interested in the pangenome graph builder, pggb. It pulls together the steps for pangenome graph construction in one script.

@brettChapman
Copy link
Author

Hi Erik

Thanks. This looks really useful. In my case I'll need to run edyeet across multiple nodes and merge the paf files, but I could include some sbatch jobs in the bash script with dependencies for that. It's a good framework to build a good pangenome pipeline from.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants