Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GFA output #2

Open
ekg opened this issue Sep 19, 2020 · 15 comments
Open

GFA output #2

ekg opened this issue Sep 19, 2020 · 15 comments

Comments

@ekg
Copy link
Contributor

ekg commented Sep 19, 2020

Have you considered adding GFA output to abPOA?

I am likely to make a patch to do this, but I wouldn't mind if you beat me to it. 🙂

@yangao07
Copy link
Owner

Thanks for the suggestion!
That should not be very hard to do, I will give it a try!

@ekg
Copy link
Contributor Author

ekg commented Sep 19, 2020 via email

@yangao07
Copy link
Owner

Hi Erik,

I just updated the abPOA repo, please try out the GFA format output with parameter '-r4'.
Let me know if this works for you!

Yan

@ekg
Copy link
Contributor Author

ekg commented Sep 22, 2020

Hey! This looks great. Here's one of the graphs I made with it.

abpoa ~/HLA-zoo/seqs/DRB1-3123.fa -r 4 >DRB1-3123.abpoa.gfa

In Bandage:

graph

This is the view of the graph given by odgi viz:

odgi build -g DRB1-3123.abpoa.gfa -o - | odgi unchop -i - -o - | odgi sort -i - -o - >DRB1-3123.abpoa.og
odgi viz -i DRB1-3123.abpoa.og -b -w 20 -y 500 -o DRB1-3123.abpoa.og.viz.png

DRB1-3123 abpoa og viz

And this will show if there are any inversions in the graph.

odgi viz -i DRB1-3123.abpoa.og -b -w 20 -y 500 -o DRB1-3123.abpoa.og.viz.inv.png -z

DRB1-3123 abpoa og viz inv

I was surprised to not see any, because one of the sequences in the set is inverted relative to the others.

edyeet -X -n 20 -N -n 10 DRB1-3123.fa DRB1-3123.fa
...
gi|345525392:5000-18402 13403   0       13403   -       gi|568815567:3779003-3792415    13413   2       13404   13391   13403   33      id:f:0.999552   ma:i:13391      mm:i:1  ni:i:0  nd:i:10ns:i:11 ed:i:22 al:i:13413      se:f:0.0016402  cg:Z:4894=10D6889=1X1608=11I

This test is not possible to complete using spoa, or I would show a direct comparison. You can see the orientation in the pggb output though:

pggb -i DRB1-3123.fa -s 3000 -p 70 -a 70 -n 10 -t 16 -v -l -k 8

DRB1-3123 fa pggb-s3000-p70-n10-a70-K16-k8-w10000-j5000-W0-e100 smooth og viz inv

So it would seem that when you write the sequence in the GFA, you'd want to indicate that it's inverting if it aligns in the reverse orientation. I patched spoa to support this as well, so it shouldn't be hard to fix here.

By the way, the direct abPOA output is much nicer than pggb here. In pggb, the smoothxg step can't handle very large graphs due to the memory requirements of the exact alignment calculation in spoa. Swapping in abPOA should resolve that to some extent.

This is the input if you'd like to test: https://github.com/ekg/HLA-zoo/blob/master/seqs/DRB1-3123.fa.

@yangao07
Copy link
Owner

yangao07 commented May 15, 2021

Hi @ekg ,

I added a minimizer-based seeding in the latest version of abPOA (v1.2.0).
It can reduce the memory usage for very long input sequence, e.g this DRB1.
Also, it should produce similar or even better alignment results.

Yan

@ekg
Copy link
Contributor Author

ekg commented May 15, 2021 via email

@yangao07
Copy link
Owner

The seeding step is actually very simple.
I collected all the minimizer hits between two sequences based on the input order.
Then, I applied a similar chaining strategy as used in minimap2 to find all the co-linear chains, but not allowing large gaps.
Lastly, all the non-overlapping co-linear chains are chained together to produce the final chaining result, which guides the POA in the next step.

@ekg
Copy link
Contributor Author

ekg commented May 16, 2021 via email

@yangao07
Copy link
Owner

Based on the input order, the i-th sequence is the reference for the (i+1)-th.

@ekg
Copy link
Contributor Author

ekg commented May 16, 2021 via email

@ekg
Copy link
Contributor Author

ekg commented May 16, 2021 via email

@yangao07
Copy link
Owner

The longest one is actually just this DRB1.

@ekg
Copy link
Contributor Author

ekg commented May 17, 2021 via email

@yangao07
Copy link
Owner

That will be great.
Curious what will happen if we feed abPOA with that long sequences.

@yangao07
Copy link
Owner

Just updated to v1.2.1.
Removes a redundant and very time-consuming sorting step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants