Help the compiler vectorize `adjacent_difference` #4958

AlexGuteniev · 2024-09-14T10:33:39Z

📜 The approach

The following things prevented the original algorithm from vectorization:

Loop-carried dependency, the previous input is used as one of operands.
- This seems expected that the compiler doesn't transform such code to eliminate this automagically, too much of a transformation.
- This was addressed by transforming the code to read the input array twice per iteration instead of carrying the values through the loop.
Odd iterator pattern where the compiler cannot understand the iteration.
- This seemed to me a strange limitation, so it was reported as DevCom-10742868.
- This was addressed by using integer index.

✅ Checks for eligibility

The following checks were added:

No Aliasing (see below)
Iterators can be pointers
Source iterator is not volatile (read order is altered)
Trivially copyable (we skip copying where the standard asks for it)

There's no need in check for integral types or so, since the compiler makes the final decision anyway, and it may be able optimize even something that wouldn't pass a strict check.

⚠️ No Aliasing

Apparently there's no rule that the source and the destination ranges may not overlap.
We should handle aliasing.

Unlike the #4431 precedent, we can't yield to the compiler here. The compiler is able to insert overlaps check that prevents vectorization and go to the scalar fallback in case of checks failure, but:

We apply transformation that would change the meaning of the program in case of overlapping range, and the meaning would be changed no matter if vectorization happens
The checks that compiler inserts may be too loose, it may allow like equal source and destination pointer, as these are thc checks if the transformed algorithm would not change the meaning

So we do our own checks.

Then we tell the compiler with __restrict that we already checked, and it should not bother. This is done in a separate function, because the __restrict is not aliased within scope, so saying __restrict within the original algorithm would apparently be a lie.

The extra check by the compiler, if not prevented would slightly add run time and dead code size.

😾 Compiler warnings

We have a great feature called integral promotion, or whatever it is called. Smaller types are converted to integers, and there is a warning about converting them back. Local pragma does suppresses them in benchmark, but not in the test.

I don't know what to do with this.

⏱️ Benchmark results

Benchmark	main	this	this + AVX2
bm<uint8_t>/2255	745 ns	563 ns	562 ns
bm<uint16_t>/2255	799 ns	83.3 ns	75.1 ns
bm<uint32_t>/2255	731 ns	154 ns	141 ns
bm<uint64_t>/2255	805 ns	293 ns	272 ns
bm/2255	751 ns	154 ns	123 ns
bm/2255	753 ns	304 ns	233 ns

🥇 Results interpretation

Overall, we're good 😸
8-bit case failed to vectorize for no reason (didn't look up if it is known compiler issue, or to be reported)
Still 8-bit case is noticeably better. I didn't analyze that, but looks like this a consistent thing, not codegen gremlins. I think it is a side effect of eliminating loop-carried dependency, so the processor can parallelize and overlap iterations
AVX2 is only slightly faster. I did not analyze, but think that memory wall is being hit here 🧱

stl/inc/numeric

frederick-vs-ja · 2024-09-14T13:12:39Z

stl/inc/numeric

@@ -469,6 +487,30 @@ _CONSTEXPR20 _OutIt adjacent_difference(const _InIt _First, const _InIt _Last, _
    const auto _ULast = _STD _Get_unwrapped(_Last);
    auto _UDest       = _STD _Get_unwrapped_n(_Dest, _STD _Idl_distance<_InIt>(_UFirst, _ULast));
    if (_UFirst != _ULast) {
+        if constexpr (_Iterators_are_contiguous<_InIt, _OutIt> && !_Iterator_is_volatile<_InIt>
+                      && is_trivially_copyable_v<_Iter_value_t<_InIt>>) {


Is !_Iterator_is_volatile<_InIt> or is_trivially_copyable_v<_Iter_value_t<_InIt>> meaningful here, given we're not performing any trivial bitwise copy in this branch?

I'm not sure.

The Effects as in [adjacent.difference]/4 specifically requiring making a copy at certain time, which we do later. So the difference may be observable. But it is not observable, if the memory is not volatile and the copying is trivial.

AlexGuteniev · 2024-09-15T08:23:35Z

8-bit case failed to vectorize for no reason (didn't look up if it is known compiler issue, or to be reported)

Reported DevCom-10745948

CaseyCarter · 2024-09-15T15:20:40Z

8-bit case failed to vectorize for no reason (didn't look up if it is known compiler issue, or to be reported)

Interestingly it vectorizes if we use - directly instead of indirecting through std::minus, or if the output is a pointer to int. Something to do with narrowing the result of the promoted operation, maybe?

AlexGuteniev added 3 commits September 14, 2024 09:50

benchmark

85a67ec

test coverage

b0bab69

the optimization

eb0cf5a

AlexGuteniev requested a review from a team as a code owner September 14, 2024 10:33

AlexGuteniev changed the title ~~Help the compiler vectorize adjacent_differentce~~ Help the compiler vectorize adjacent_difference Sep 14, 2024

You shall pass!

eed039f

frederick-vs-ja reviewed Sep 14, 2024

View reviewed changes

stl/inc/numeric Outdated Show resolved Hide resolved

AlexGuteniev added 3 commits September 14, 2024 14:27

constexpr

bc92d54

ADL

fd51080

rvalue

3db07e7

frederick-vs-ja reviewed Sep 14, 2024

View reviewed changes

AlexGuteniev added 2 commits September 14, 2024 16:22

Review comments

53a60d4

types

51595a3

StephanTLavavej added the performance Must go faster label Sep 15, 2024

StephanTLavavej self-assigned this Sep 15, 2024

Pointer math

99e6058

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help the compiler vectorize `adjacent_difference` #4958

Help the compiler vectorize `adjacent_difference` #4958

AlexGuteniev commented Sep 14, 2024

frederick-vs-ja Sep 14, 2024 •

edited

Loading

AlexGuteniev Sep 14, 2024

AlexGuteniev commented Sep 15, 2024

CaseyCarter commented Sep 15, 2024 •

edited

Loading

Help the compiler vectorize adjacent_difference #4958

Are you sure you want to change the base?

Help the compiler vectorize adjacent_difference #4958

Conversation

AlexGuteniev commented Sep 14, 2024

📜 The approach

✅ Checks for eligibility

⚠️ No Aliasing

😾 Compiler warnings

⏱️ Benchmark results

🥇 Results interpretation

frederick-vs-ja Sep 14, 2024 • edited Loading

Choose a reason for hiding this comment

AlexGuteniev Sep 14, 2024

Choose a reason for hiding this comment

AlexGuteniev commented Sep 15, 2024

CaseyCarter commented Sep 15, 2024 • edited Loading

Help the compiler vectorize `adjacent_difference` #4958

Help the compiler vectorize `adjacent_difference` #4958

frederick-vs-ja Sep 14, 2024 •

edited

Loading

CaseyCarter commented Sep 15, 2024 •

edited

Loading