Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize find_end #4943

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

AlexGuteniev
Copy link
Contributor

📜 Overview

  • Similar to Vectorize std::search of 1 and 2 bytes elements with pcmpestri #4745 in the opposite direction
  • The algorithm is not very similar though. I have not attempted the one-index pcmpestri instruction and variable step, but it looked like multiple-indices pcmpestrm and fixed step would work better for this direction, because the former would return more partial false matches than in forward direction, as the match with higher index is more likely to be partial.
    • If that's not convincing enough, I can try the other approach
    • pcmpestrm approach could have been made more efficient if DevCom-10689455 is fixed, but honestly I don't hope for that much
  • Also restructured control flow a vit in __std_search_impl, the place where _Match_1st_16 is introduced. I basically implemented the same thing from scratch, and the new attempt gave clearer control flow implementation. They are not shared though, as _First1 > _Stop1 check is needed for variable step, but not needed for fixed step, so here's still some variation.
  • The types larger than 2 bytes were excluded from test and benchmark -- same way as we don't test replace with smaller elements.
  • The benchmark patterns list is expanded to test small and large needle closer to the other end. I deliberately left coverage for search + closer to start and find_end + closer to end, It is useful to test the case where startup cost contributes more.
  • Also added couple of "evil" patterns in benchmark. The goal for them is not to make them too much worse. I leave up to maintainers to decide if this goal is achieved

🏁 Benchmark results

Benchmark main this
classic_find_endstd::uint8_t/0 102 ns 22.4 ns
classic_find_endstd::uint8_t/1 107 ns 23.8 ns
classic_find_endstd::uint8_t/2 1447 ns 298 ns
classic_find_endstd::uint8_t/3 1734 ns 384 ns
classic_find_endstd::uint8_t/4 5489 ns 4998 ns
classic_find_endstd::uint8_t/5 17246 ns 15661 ns
classic_find_endstd::uint16_t/0 103 ns 39.2 ns
classic_find_endstd::uint16_t/1 103 ns 44.7 ns
classic_find_endstd::uint16_t/2 1444 ns 630 ns
classic_find_endstd::uint16_t/3 1740 ns 822 ns
classic_find_endstd::uint16_t/4 6487 ns 9816 ns
classic_find_endstd::uint16_t/5 14878 ns 20222 ns
ranges_find_endstd::uint8_t/0 101 ns 22.9 ns
ranges_find_endstd::uint8_t/1 109 ns 24.9 ns
ranges_find_endstd::uint8_t/2 1451 ns 303 ns
ranges_find_endstd::uint8_t/3 1767 ns 390 ns
ranges_find_endstd::uint8_t/4 5363 ns 5087 ns
ranges_find_endstd::uint8_t/5 16254 ns 15942 ns
ranges_find_endstd::uint16_t/0 103 ns 39.4 ns
ranges_find_endstd::uint16_t/1 104 ns 46.1 ns
ranges_find_endstd::uint16_t/2 1457 ns 633 ns
ranges_find_endstd::uint16_t/3 1728 ns 811 ns
ranges_find_endstd::uint16_t/4 5343 ns 9832 ns
ranges_find_endstd::uint16_t/5 14883 ns 20222 ns
Re-testong of the search PR against then-main as there are more cases in the benchmark
Benchmark main this
c_strstr/0 190 ns 188 ns
c_strstr/1 219 ns 212 ns
c_strstr/2 14.2 ns 13.8 ns
c_strstr/3 10.7 ns 9.92 ns
c_strstr/4 1478 ns 1416 ns
c_strstr/5 17965 ns 16897 ns
classic_searchstd::uint8_t/0 2193 ns 275 ns
classic_searchstd::uint8_t/1 2455 ns 305 ns
classic_searchstd::uint8_t/2 144 ns 30.6 ns
classic_searchstd::uint8_t/3 66.9 ns 17.0 ns
classic_searchstd::uint8_t/4 5315 ns 1439 ns
classic_searchstd::uint8_t/5 15060 ns 11813 ns
classic_searchstd::uint16_t/0 1460 ns 519 ns
classic_searchstd::uint16_t/1 1606 ns 571 ns
classic_searchstd::uint16_t/2 130 ns 56.1 ns
classic_searchstd::uint16_t/3 56.9 ns 27.1 ns
classic_searchstd::uint16_t/4 5342 ns 7312 ns
classic_searchstd::uint16_t/5 28964 ns 20472 ns
ranges_searchstd::uint8_t/0 2102 ns 275 ns
ranges_searchstd::uint8_t/1 2384 ns 301 ns
ranges_searchstd::uint8_t/2 147 ns 30.5 ns
ranges_searchstd::uint8_t/3 76.2 ns 17.0 ns
ranges_searchstd::uint8_t/4 5325 ns 1438 ns
ranges_searchstd::uint8_t/5 15573 ns 11803 ns
ranges_searchstd::uint16_t/0 1482 ns 519 ns
ranges_searchstd::uint16_t/1 1634 ns 571 ns
ranges_searchstd::uint16_t/2 155 ns 55.4 ns
ranges_searchstd::uint16_t/3 70.0 ns 28.1 ns
ranges_searchstd::uint16_t/4 5339 ns 7338 ns
ranges_searchstd::uint16_t/5 22691 ns 20377 ns
search_default_searcherstd::uint8_t/0 1963 ns 273 ns
search_default_searcherstd::uint8_t/1 2182 ns 301 ns
search_default_searcherstd::uint8_t/2 147 ns 30.4 ns
search_default_searcherstd::uint8_t/3 60.4 ns 16.6 ns
search_default_searcherstd::uint8_t/4 5816 ns 1441 ns
search_default_searcherstd::uint8_t/5 20702 ns 11753 ns
search_default_searcherstd::uint16_t/0 2443 ns 519 ns
search_default_searcherstd::uint16_t/1 2671 ns 607 ns
search_default_searcherstd::uint16_t/2 204 ns 55.6 ns
search_default_searcherstd::uint16_t/3 92.1 ns 27.4 ns
search_default_searcherstd::uint16_t/4 5676 ns 7294 ns
search_default_searcherstd::uint16_t/5 30609 ns 20342 ns

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner September 9, 2024 20:07
@StephanTLavavej StephanTLavavej added the performance Must go faster label Sep 9, 2024
@StephanTLavavej StephanTLavavej self-assigned this Sep 9, 2024
@AlexGuteniev
Copy link
Contributor Author

  • Also added couple of "evil" patterns in benchmark. The goal for them is not to make them too much worse. I leave up to maintainers to decide if this goal is achieved

I can look into improving the "evil" case results by detecting it and switching strategy

Resolved adjacent add/edit conflict in stl/src/vector_algorithms.cpp.
@StephanTLavavej
Copy link
Member

I've pushed a merge with main, resolving a trivial adjacent add/edit conflict in vector_algorithms.cpp. No further clang-format regen was necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Status: Initial Review
Development

Successfully merging this pull request may close these issues.

2 participants