Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite ascii check to allow compiler auto vectorization #133

Merged
merged 1 commit into from
Apr 26, 2024
Merged

Conversation

rhpvorderman
Copy link
Collaborator

@rhpvorderman rhpvorderman commented Apr 8, 2024

Rewriting it to use size_t will create the most optimal code for all platofrms. Also I did a small optimization that works well for small strings.

Say if the string is length 13. Previously the function would check 8 bytes at once, then check 5 bytes individually. What I do now is simply substract 8 from the length and check an 8 byte vector at the end. This is much more optimal for smaller strings.

Not that we need that, as we check 128kb chunks. But it eliminates a loop from the function, so might as well leave it in.

The 8byte loop is unrolled to do 4 OR operations simultaneously. This is automatically compiled to do two SSE2 OR operations with -O2 on GCC. As a result this code is not slower than the code that it replaces with had vectors inserted manually.

The resulting code is much more portable as a result, and should also compile to use vectors on ARM platforms.

EDIT: See also godbolt compiler explorer

@rhpvorderman
Copy link
Collaborator Author

rhpvorderman commented Apr 8, 2024

Damn! Clang does not recognize the optimization opportunity. But it does so for the last loop (that is not unrolled). However if I write it like that, GCC does not recognize the optimization opportunity. In any case, this code + GCC optimization results in the most optimal assembly. But it will be not usable for clang users.

So, best to not merge this, I guess?

EDIT: I did a bit of benchmarking. Due to the unrolling, the code is more optimal. Even in such a way that before and after this PR there is no significant difference with CLANG. With GCC however, it is even faster.

@rhpvorderman
Copy link
Collaborator Author

@marcelm, friendly reminder ping. I decided to merge this code into sequali because GCC is used on linux and Linux is the only production platform for bioinformatics. (I am not aware of any MacOS and Windows compute clusters out there). Furthermore it should be a lot faster on platforms that do not benefit from auto vectorization due to the loop unrolling. In addition to that, I think the code is nicer when it does not make have use of if __SSE2__ and other compile guards. But it is your call in this case.

@marcelm marcelm merged commit 2aaaba6 into main Apr 26, 2024
22 checks passed
@marcelm marcelm deleted the asciiagain branch April 26, 2024 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants