Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/fix issue 36 #39

Merged
merged 7 commits into from
Feb 6, 2024
Merged

Feature/fix issue 36 #39

merged 7 commits into from
Feb 6, 2024

Conversation

migbro
Copy link
Contributor

@migbro migbro commented Feb 2, 2024

My colleague @dmiller15 had a much simpler suggestion to simply skip * entries as a start in the annotator tool. It'd likely tidy up our current concerns...expanding to skip all non ACGT alleles (like N or ACN) could be tackled, but not sure it needs to immediately.

src/commands/annotate_cmd.rs Outdated Show resolved Hide resolved
@migbro
Copy link
Contributor Author

migbro commented Feb 5, 2024

Hmmm, your suspicions were correct. Performance took an absolute beating. Before:

[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 54724
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 109576
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 136645
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 820929
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 820932
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 820933
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 820937
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 820940
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 820946
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 820948
[echtvar] chr1:3856686 evaluated 10000 variants (20618 / second). wrote 10000 variants.
[echtvar] chr1:7278646 evaluated 20000 variants (25348 / second). wrote 20000 variants.
[echtvar] chr1:11711930 evaluated 30000 variants (24135 / second). wrote 30000 variants.
[echtvar] chr1:41528737 evaluated 100000 variants (26082 / second). wrote 100000 variants.
[echtvar] chr1:88677065 evaluated 200000 variants (26648 / second). wrote 200000 variants.
[echtvar] chr1:123927955 evaluated 300000 variants (28639 / second). wrote 300000 variants.
[echtvar] chr2:191385523 evaluated 1000000 variants (28066 / second). wrote 1000000 variants.
[echtvar] chr4:166499610 evaluated 2000000 variants (28035 / second). wrote 2000000 variants.
[echtvar] chr7:26761630 evaluated 3000000 variants (28011 / second). wrote 3000000 variants.
[echtvar] chr10:5013757 evaluated 4000000 variants (27790 / second). wrote 4000000 variants.
[echtvar] chr12:131568036 evaluated 5000000 variants (27859 / second). wrote 5000000 variants.
[echtvar] chr17:26955403 evaluated 6000000 variants (27892 / second). wrote 6000000 variants.
[echtvar] chrX:44338293 evaluated 7000000 variants (27958 / second). wrote 7000000 variants.
[echtvar] evaluated 7168554 variants (27870 / second). wrote 7168554 variants.

real	4m18.420s
user	5m44.228s
sys	0m11.149s

After:

contig chr1 pos 54724 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 109576 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 136645 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820929 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820932 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820933 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820937 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820940 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820946 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820948 alt has * value, skipping annotation, outputting entry as-is
[echtvar] chr1:3856686 evaluated 10000 variants (3594 / second). wrote 9932 variants.
[echtvar] chr1:7278646 evaluated 20000 variants (4405 / second). wrote 19871 variants.
[echtvar] chr1:11711930 evaluated 30000 variants (4131 / second). wrote 29808 variants.
[echtvar] chr1:41528737 evaluated 100000 variants (4478 / second). wrote 99431 variants.
[echtvar] chr1:88677065 evaluated 200000 variants (4530 / second). wrote 198983 variants.
[echtvar] chr1:123927955 evaluated 300000 variants (4948 / second). wrote 298617 variants.
[echtvar] chr2:191385523 evaluated 1000000 variants (4800 / second). wrote 995154 variants.
[echtvar] chr4:166499610 evaluated 2000000 variants (4830 / second). wrote 1990363 variants.
[echtvar] chr7:26761630 evaluated 3000000 variants (4885 / second). wrote 2985468 variants.

I didn't bother letting it finish...I'll see if we can think of something else. At that speed, it's no better than bcftools I don't think.

@migbro
Copy link
Contributor Author

migbro commented Feb 5, 2024

This was me testing on a trio germline dataset

@brentp
Copy link
Owner

brentp commented Feb 5, 2024

hmm. it shouldn't have that much difference. Are you sure it wasn't somehow cached on the disk? In first run? And did you compile in release mode both times?

@migbro
Copy link
Contributor Author

migbro commented Feb 5, 2024

Hmmm, the first time was from a docker image that I had built (which is release mode), the second time built like it is in the big.sh script with very few flags. I can see if that makes a difference, it'd be nice if that were the case.

@brentp
Copy link
Owner

brentp commented Feb 5, 2024

yeah, default is debug mode, so if you add --release, it should help a lot.

@dmiller15
Copy link
Contributor

I guess if we're hard set on avoiding additional vectors we could potentially move this into the update_expr_values function. We're already create the alleles vector there.

This eidx if/else https://github.com/brentp/echtvar/blob/main/src/lib/echtvar.rs#L360-L374 could get an additional check for alleles[1][0] being star and set Err(0).

Changes the behavior in that we're going to get the user defined missing values for these star records.

@migbro
Copy link
Contributor Author

migbro commented Feb 5, 2024

Ha, you weren't kidding. In release mode:

contig chr1 pos 54724 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 109576 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 136645 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820929 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820932 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820933 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820937 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820940 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820946 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820948 alt has * value, skipping annotation, outputting entry as-is
not reporting further warnings
[echtvar] chr1:3856686 evaluated 10000 variants (23809 / second). wrote 9932 variants.
[echtvar] chr1:7278646 evaluated 20000 variants (28735 / second). wrote 19871 variants.
[echtvar] chr1:11711930 evaluated 30000 variants (27700 / second). wrote 29808 variants.
[echtvar] chr1:41528737 evaluated 100000 variants (29895 / second). wrote 99431 variants.
[echtvar] chr1:88677065 evaluated 200000 variants (30280 / second). wrote 198983 variants.
[echtvar] chr1:123927955 evaluated 300000 variants (32761 / second). wrote 298617 variants.
[echtvar] chr2:191385523 evaluated 1000000 variants (32364 / second). wrote 995154 variants.
[echtvar] chr4:166499610 evaluated 2000000 variants (32402 / second). wrote 1990363 variants.
[echtvar] chr7:26761630 evaluated 3000000 variants (32299 / second). wrote 2985468 variants.
[echtvar] chr10:5013757 evaluated 4000000 variants (31676 / second). wrote 3980662 variants.
[echtvar] chr12:131568036 evaluated 5000000 variants (31154 / second). wrote 4975474 variants.
[echtvar] chr17:26955403 evaluated 6000000 variants (30839 / second). wrote 5970308 variants.
[echtvar] chrX:44338293 evaluated 7000000 variants (30891 / second). wrote 6964726 variants.
[echtvar] evaluated 7168554 variants (30847 / second). wrote 7132548 variants.

real	3m53.756s
user	5m33.645s
sys	0m9.607s

And to rule out disk caching, I grabbed another file:

...
[echtvar] chr1:3621740 evaluated 10000 variants (23866 / second). wrote 9923 variants.
[echtvar] chr1:6984629 evaluated 20000 variants (28694 / second). wrote 19863 variants.
[echtvar] chr1:11068271 evaluated 30000 variants (29013 / second). wrote 29818 variants.
[echtvar] chr1:36348307 evaluated 100000 variants (29682 / second). wrote 99474 variants.
[echtvar] chr1:80929174 evaluated 200000 variants (28731 / second). wrote 199011 variants.
[echtvar] chr1:122617572 evaluated 300000 variants (30117 / second). wrote 298533 variants.
[echtvar] chr2:161533605 evaluated 1000000 variants (30614 / second). wrote 995313 variants.
[echtvar] chr4:119815132 evaluated 2000000 variants (31118 / second). wrote 1990591 variants.
[echtvar] chr6:133804491 evaluated 3000000 variants (32149 / second). wrote 2985724 variants.
[echtvar] chr9:40358564 evaluated 4000000 variants (31931 / second). wrote 3980823 variants.
[echtvar] chr12:20835434 evaluated 5000000 variants (31591 / second). wrote 4975837 variants.
[echtvar] chr15:97834681 evaluated 6000000 variants (31408 / second). wrote 5970966 variants.
[echtvar] chr20:47014627 evaluated 7000000 variants (31129 / second). wrote 6965828 variants.
[echtvar] evaluated 7576734 variants (31222 / second). wrote 7539482 variants.
Skipped 37252 variants with * alt.

Lastly, cleared the disk cache with echo 3 | sudo tee /proc/sys/vm/drop_caches and reverted to pre-* skip:

[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 180951
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 631860
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 789516
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 789517
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 791549
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 820931
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 820940
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 820946
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 904491
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 904493
[echtvar] chr1:3621740 evaluated 10000 variants (18761 / second). wrote 10000 variants.
[echtvar] chr1:6984629 evaluated 20000 variants (21482 / second). wrote 20000 variants.
[echtvar] chr1:11068271 evaluated 30000 variants (20242 / second). wrote 30000 variants.
[echtvar] chr1:36348307 evaluated 100000 variants (24260 / second). wrote 100000 variants.
[echtvar] chr1:80929174 evaluated 200000 variants (25169 / second). wrote 200000 variants.
[echtvar] chr1:122617572 evaluated 300000 variants (26221 / second). wrote 300000 variants.
[echtvar] chr2:161533605 evaluated 1000000 variants (27826 / second). wrote 1000000 variants.
[echtvar] chr4:119815132 evaluated 2000000 variants (27360 / second). wrote 2000000 variants.
[echtvar] chr6:133804491 evaluated 3000000 variants (28070 / second). wrote 3000000 variants.
[echtvar] chr9:40358564 evaluated 4000000 variants (28185 / second). wrote 4000000 variants.
[echtvar] chr12:20835434 evaluated 5000000 variants (28432 / second). wrote 5000000 variants.
[echtvar] chr15:97834681 evaluated 6000000 variants (28554 / second). wrote 6000000 variants.
[echtvar] chr20:47014627 evaluated 7000000 variants (28659 / second). wrote 7000000 variants.
[echtvar] evaluated 7576734 variants (28659 / second). wrote 7576734 variants.

real	4m25.610s
user	6m0.630s
sys	0m11.247s

Perhaps surprisingly or not, it's actually faster. My guess is cutting out needing to encode the * values, and as @dmiller15 pointed out to me, also the annotation of the records all together.

@brentp brentp merged commit 45b468e into brentp:main Feb 6, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants