Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Homopolymer indels not consistently aligned #48

Open
rlorigro opened this issue May 31, 2023 · 6 comments
Open

Homopolymer indels not consistently aligned #48

rlorigro opened this issue May 31, 2023 · 6 comments

Comments

@rlorigro
Copy link

Hi, I am trying to get a reasonable alignment in a region which has some tandem repeats, flanked by non-repetitive sequence. I can get good (enough) results in the tandem region using these parameters:

abpoa \
-n 10 \
--progressive \
--amb-strand \
-b 1000 \
-r 1 \

However, in the (mostly non-repetitive) flanking region there is a long homopolymer, where I get this result:

TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGTCTGGGCAACATAGTGAGACATTGTCTCTAC------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGTCTGGGCAACATAGTGAGACATTGTCTCTAC------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTACA-------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTACA-------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTACA-------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC----------------AAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC----AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------------AAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------------AAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC----------------AAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTA-------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTA-------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCAGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTA-------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCAGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTA-------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTA-------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCAGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTA-------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCAGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTA-------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTA-------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTA-------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC----------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC----------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC----------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------ACAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------ACAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------ACAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------AC-AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------AC-AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------AC-AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------AC-AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------AC-AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------AC-AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------AC-AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------AC-AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------AC-AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------AC-AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------AC-AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------AC-AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------AC-AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------AC-AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC----------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC----------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC----------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC----------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC----------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC----------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC----------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC----------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC----------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC------------------------AAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC----------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC------------------------AAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC

Where it seems to arbitrarily assign different paths to the same AC prefix. Do you think this can be resolved with parameter choices or is this an unavoidable aspect of POA?

Thanks

@yangao07
Copy link
Owner

This is actually the optimal alignment in terms of the whole partial-order graph.
For some sequences, e.g.,

TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCT--------------------ACAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC

the "AC" was put after the gaps: ...CTCT--------------------ACAA..., instead of before the gaps.

This is because the number of gaps matches ...CTCT--------------------AAAAA... in other sequences, which leads to only a mismatch, instead of indels for "...CTAC----------------------AAAA...".

So if you set the mismatch penalty larger than 2 or more indels, it may give you the result you want. But this may lead to other alignment issues where a lot of indels might show up.

@rlorigro
Copy link
Author

rlorigro commented Jun 1, 2023

right... interesting.

I thought if I isolated these subsequences and reran progressive alignment using the same parameters, it could remove some order dependence. The result was slightly more interpretable, but still not ideal.

However, if I sort the reads by ascending length first and then rerun I get this output, which is closer to what I was hoping for:

>1
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC-----------------------AAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>2
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC-----------------------AAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>3
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>4
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>5
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>6
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>7
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>8
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>9
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>10
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>11
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>12
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>13
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>14
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>15
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------------------AAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>16
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------------------AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>17
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------------------AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>18
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------------------AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>19
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------------------AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>20
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------------------AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>21
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------------------AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>22
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------------------AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>23
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------------------AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>24
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------------------AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>25
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------------------AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>26
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------------------AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>27
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------------------AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>28
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------------------AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>29
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------------------AAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>30
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAA-------------------AAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCAGGTGTGGTGGTGCC
>31
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAA-------------------AAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>32
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAA-------------------AAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>33
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAA-------------------AAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>34
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAA-------------------AAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>35
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAA-------------------AAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>36
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC-------------------AAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>37
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC-------------------AAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>38
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC-------------------AAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>39
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTA------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>40
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTA------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCAGGTGTGGTGGTGCC
>41
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTA------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCAGGTGTGGTGGTGCC
>42
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTA------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>43
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTA------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCAGGTGTGGTGGTGCC
>44
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTA------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>45
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTA------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>46
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTA------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>47
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTA------------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>48
TTCAAGACCAGTCTGGGCAACATAGTGAGACATTGTCTCTAC-----------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>49
TTCAAGACCAGTCTGGGCAACATAGTGAGACATTGTCTCTAC-----------------AAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>50
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------------AAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>51
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---------------AAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>52
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------------AAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>53
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------------AAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>54
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC-------------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>55
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC-------------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>56
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>57
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>58
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>59
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>60
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>61
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>62
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>63
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC--------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>64
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC-------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>65
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC-------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>66
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC-------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>67
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC-------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>68
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC-------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>69
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC-------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>70
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC-------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>71
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC-------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>72
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC-------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>73
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>74
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>75
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC------AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>76
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC-----AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>77
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTAC---AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC
>78
TTCAAGACCAGCCTGGGCAACATAGTGAGACATTGTCTCTACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACACAAAATTAGTCGGGTGTGGTGGTGCC

Is the --progressive parameter being used in this context? I would have thought that the output would be independent of the input order when using progressive.

Regardless, this result makes me think that a combination of seeding/chaining and locally computed guide trees could be interesting. But perhaps it would turn out to be just another exercise in manual parameter tuning.

@glennhickey
Copy link
Contributor

@rlorigro FWIW in cactus we pass in the sequences in descending order of length to abpoa. Sorting in this way really helped accuracy if I recall, even (counterintuitively?) when abPOA's progressive mode is enabled.

@rlorigro
Copy link
Author

rlorigro commented Jun 1, 2023

Yea I tried both and got "better" results with ascending order this time. I think ascending order works in this case because we want to enforce that a gap is introduced early on in the graph, allowing a lower cost path for future sequences to extend the gap successively.

@rlorigro
Copy link
Author

rlorigro commented Jun 1, 2023

It would be interesting to see each stage of the graph being built to verify what is happening

@yangao07
Copy link
Owner

yangao07 commented Jun 1, 2023

@rlorigro The --progressive in abPOA does not perform pairwise alignment between each of the two sequences, only calculates the approximate similarity to minimize the run time of this step.
In your case, all the sequences are highly similar except for those gaps, so --progressive may not be able to differentiate them, which means that the input order still makes a lot of differences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants