Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use AVX/AVX2 masks in minmax_element and minmax vectorization #4917

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Aug 26, 2024

🧭 Overview

Use AVX2 mask to read tails for minmax/minmax element, then use the same masks to populate the tail with previous data, and to exclude tail indices for _element algorithm.

⏱️ Benchmark results

8 and 16 bit Both_val cases are expected to improve in a significant way too, but it is currently hidden by #4913

Benchmark main this
bm<uint8_t, Op::Min>/8021 173 ns 168 ns
bm<uint8_t, Op::Min>/63 21.3 ns 9.64 ns
bm<uint8_t, Op::Max>/8021 173 ns 163 ns
bm<uint8_t, Op::Max>/63 22.1 ns 9.74 ns
bm<uint8_t, Op::Both>/8021 287 ns 277 ns
bm<uint8_t, Op::Both>/63 40.2 ns 20.4 ns
bm<uint8_t, Op::Min_val>/8021 73.5 ns 69.9 ns
bm<uint8_t, Op::Min_val>/63 14.9 ns 4.40 ns
bm<uint8_t, Op::Max_val>/8021 75.7 ns 67.1 ns
bm<uint8_t, Op::Max_val>/63 13.9 ns 4.29 ns
bm<uint8_t, Op::Both_val>/8021 3250 ns 3250 ns
bm<uint8_t, Op::Both_val>/63 29.3 ns 29.2 ns
bm<uint16_t, Op::Min>/8021 314 ns 318 ns
bm<uint16_t, Op::Min>/31 13.2 ns 8.51 ns
bm<uint16_t, Op::Max>/8021 314 ns 316 ns
bm<uint16_t, Op::Max>/31 13.2 ns 8.51 ns
bm<uint16_t, Op::Both>/8021 526 ns 538 ns
bm<uint16_t, Op::Both>/31 27.1 ns 18.4 ns
bm<uint16_t, Op::Min_val>/8021 131 ns 128 ns
bm<uint16_t, Op::Min_val>/31 5.53 ns 3.58 ns
bm<uint16_t, Op::Max_val>/8021 136 ns 127 ns
bm<uint16_t, Op::Max_val>/31 5.33 ns 3.59 ns
bm<uint16_t, Op::Both_val>/8021 4541 ns 4550 ns
bm<uint16_t, Op::Both_val>/31 18.1 ns 17.9 ns
bm<uint32_t, Op::Min>/8021 627 ns 605 ns
bm<uint32_t, Op::Min>/15 8.79 ns 7.11 ns
bm<uint32_t, Op::Max>/8021 622 ns 613 ns
bm<uint32_t, Op::Max>/15 8.81 ns 7.18 ns
bm<uint32_t, Op::Both>/8021 1045 ns 1052 ns
bm<uint32_t, Op::Both>/15 19.5 ns 17.0 ns
bm<uint32_t, Op::Min_val>/8021 258 ns 246 ns
bm<uint32_t, Op::Min_val>/15 6.01 ns 3.14 ns
bm<uint32_t, Op::Max_val>/8021 258 ns 253 ns
bm<uint32_t, Op::Max_val>/15 3.63 ns 3.13 ns
bm<uint32_t, Op::Both_val>/8021 364 ns 328 ns
bm<uint32_t, Op::Both_val>/15 8.17 ns 7.26 ns
bm<uint64_t, Op::Min>/8021 3480 ns 3565 ns
bm<uint64_t, Op::Min>/7 8.92 ns 9.99 ns
bm<uint64_t, Op::Max>/8021 3552 ns 3552 ns
bm<uint64_t, Op::Max>/7 8.72 ns 9.14 ns
bm<uint64_t, Op::Both>/8021 4079 ns 4089 ns
bm<uint64_t, Op::Both>/7 18.7 ns 17.9 ns
bm<uint64_t, Op::Min_val>/8021 2861 ns 2868 ns
bm<uint64_t, Op::Min_val>/7 4.53 ns 4.78 ns
bm<uint64_t, Op::Max_val>/8021 2849 ns 2871 ns
bm<uint64_t, Op::Max_val>/7 4.54 ns 4.77 ns
bm<uint64_t, Op::Both_val>/8021 2898 ns 2932 ns
bm<uint64_t, Op::Both_val>/7 9.56 ns 9.29 ns
bm<int8_t, Op::Min>/8021 166 ns 166 ns
bm<int8_t, Op::Min>/63 20.7 ns 12.6 ns
bm<int8_t, Op::Max>/8021 171 ns 165 ns
bm<int8_t, Op::Max>/63 21.4 ns 12.6 ns
bm<int8_t, Op::Both>/8021 286 ns 273 ns
bm<int8_t, Op::Both>/63 33.1 ns 19.4 ns
bm<int8_t, Op::Min_val>/8021 77.9 ns 64.2 ns
bm<int8_t, Op::Min_val>/63 14.4 ns 4.74 ns
bm<int8_t, Op::Max_val>/8021 75.0 ns 72.4 ns
bm<int8_t, Op::Max_val>/63 16.7 ns 4.54 ns
bm<int8_t, Op::Both_val>/8021 3233 ns 3226 ns
bm<int8_t, Op::Both_val>/63 28.8 ns 29.4 ns
bm<int16_t, Op::Min>/8021 315 ns 316 ns
bm<int16_t, Op::Min>/31 14.0 ns 11.3 ns
bm<int16_t, Op::Max>/8021 314 ns 321 ns
bm<int16_t, Op::Max>/31 13.9 ns 11.3 ns
bm<int16_t, Op::Both>/8021 527 ns 532 ns
bm<int16_t, Op::Both>/31 23.2 ns 18.0 ns
bm<int16_t, Op::Min_val>/8021 135 ns 130 ns
bm<int16_t, Op::Min_val>/31 11.2 ns 4.06 ns
bm<int16_t, Op::Max_val>/8021 134 ns 131 ns
bm<int16_t, Op::Max_val>/31 11.4 ns 4.08 ns
bm<int16_t, Op::Both_val>/8021 4180 ns 4249 ns
bm<int16_t, Op::Both_val>/31 18.1 ns 18.2 ns
bm<int32_t, Op::Min>/8021 619 ns 607 ns
bm<int32_t, Op::Min>/15 9.40 ns 10.1 ns
bm<int32_t, Op::Max>/8021 622 ns 608 ns
bm<int32_t, Op::Max>/15 9.81 ns 10.1 ns
bm<int32_t, Op::Both>/8021 1059 ns 1037 ns
bm<int32_t, Op::Both>/15 19.0 ns 16.6 ns
bm<int32_t, Op::Min_val>/8021 255 ns 251 ns
bm<int32_t, Op::Min_val>/15 4.56 ns 3.57 ns
bm<int32_t, Op::Max_val>/8021 251 ns 244 ns
bm<int32_t, Op::Max_val>/15 4.57 ns 3.59 ns
bm<int32_t, Op::Both_val>/8021 362 ns 336 ns
bm<int32_t, Op::Both_val>/15 9.89 ns 7.69 ns
bm<int64_t, Op::Min>/8021 3473 ns 3502 ns
bm<int64_t, Op::Min>/7 13.3 ns 14.9 ns
bm<int64_t, Op::Max>/8021 3542 ns 3464 ns
bm<int64_t, Op::Max>/7 13.1 ns 14.9 ns
bm<int64_t, Op::Both>/8021 4084 ns 4012 ns
bm<int64_t, Op::Both>/7 18.6 ns 17.9 ns
bm<int64_t, Op::Min_val>/8021 2879 ns 2851 ns
bm<int64_t, Op::Min_val>/7 3.77 ns 3.87 ns
bm<int64_t, Op::Max_val>/8021 2846 ns 2870 ns
bm<int64_t, Op::Max_val>/7 3.62 ns 3.88 ns
bm<int64_t, Op::Both_val>/8021 3131 ns 3183 ns
bm<int64_t, Op::Both_val>/7 8.63 ns 8.91 ns
bm<float, Op::Min>/8021 1179 ns 1173 ns
bm<float, Op::Min>/15 9.28 ns 7.06 ns
bm<float, Op::Max>/8021 1182 ns 1173 ns
bm<float, Op::Max>/15 9.94 ns 7.03 ns
bm<float, Op::Both>/8021 1338 ns 1345 ns
bm<float, Op::Both>/15 15.7 ns 16.4 ns
bm<float, Op::Min_val>/8021 1176 ns 1174 ns
bm<float, Op::Min_val>/15 8.84 ns 7.17 ns
bm<float, Op::Max_val>/8021 1182 ns 1170 ns
bm<float, Op::Max_val>/15 9.83 ns 7.19 ns
bm<float, Op::Both_val>/8021 1341 ns 1333 ns
bm<float, Op::Both_val>/15 13.4 ns 13.3 ns
bm<double, Op::Min>/8021 2325 ns 2354 ns
bm<double, Op::Min>/7 8.79 ns 7.39 ns
bm<double, Op::Max>/8021 2330 ns 2390 ns
bm<double, Op::Max>/7 9.99 ns 7.85 ns
bm<double, Op::Both>/8021 2695 ns 2724 ns
bm<double, Op::Both>/7 15.9 ns 16.4 ns
bm<double, Op::Min_val>/8021 2321 ns 2323 ns
bm<double, Op::Min_val>/7 7.66 ns 7.33 ns
bm<double, Op::Max_val>/8021 2347 ns 2367 ns
bm<double, Op::Max_val>/7 9.79 ns 7.43 ns
bm<double, Op::Both_val>/8021 2688 ns 2725 ns
bm<double, Op::Both_val>/7 13.0 ns 13.9 ns

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner August 26, 2024 18:04
@CaseyCarter CaseyCarter added the performance Must go faster label Aug 26, 2024
@StephanTLavavej StephanTLavavej self-assigned this Aug 26, 2024
@StephanTLavavej StephanTLavavej changed the title Use AVX/AVX2 masks in minmax_element and minmax vectoization Use AVX/AVX2 masks in minmax_element and minmax vectorization Aug 27, 2024
@StephanTLavavej
Copy link
Member

I've pushed a conflict-free merge with main to pick up the toolset update. No clang-format regen was necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Status: Initial Review
Development

Successfully merging this pull request may close these issues.

3 participants