Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<xutility>: Use addition and multiplication overflow check MSVC intrinsics like _add_overflow_i8, _mul_overflow_i16, and _mul_full_overflow_i8 #4780

Open
AlexGuteniev opened this issue Jul 2, 2024 · 4 comments
Labels
decision needed We need to choose something before working on this performance Must go faster

Comments

@AlexGuteniev
Copy link
Contributor

Inspired by #4776

There are Clang intrinsics used in _Mul_overflow and _Add_overflow

MSVC has its own _mul_full_overflow_u8 and others.
Using them would certainly result in a better codegen than the generic implementation.

Caveats:

  • Only for x86 / x64, some are only for x64
  • Not a complete set of all operations possible
  • Not constexpr, unlike the Clang one
  • Need to include <intrin.h>

These might warrant the need of another internal header for _Mul_overflow and _Add_overflow, used by <ranges>, <mdspan>, and <numeric> after #4776.

@AlexGuteniev
Copy link
Contributor Author

Another caveat is maybe DevCom-10668881 -- need a good coverage

@StephanTLavavej StephanTLavavej changed the title <xutility>: Use addition and multiplication oveflow check MSVC intrinsics like _add_overflow_i8, _mul_overflow_i16, and _mul_full_overflow_i8 <xutility>: Use addition and multiplication overflow check MSVC intrinsics like _add_overflow_i8, _mul_overflow_i16, and _mul_full_overflow_i8 Jul 3, 2024
@StephanTLavavej StephanTLavavej added the performance Must go faster label Jul 3, 2024
@StephanTLavavej
Copy link
Member

Yeah, these were implemented on 2023-04-17 by internal MSVC-PR-456118 "Add overflow detection functions for addition, subtraction and multiplication on x86 and x64".

We'd eventually want to move them to intrin0.inl.h for the STL's use, if we want to go in this direction. We'd need to work around the known x86 codegen bug.

I'd like to understand the potential performance impact of this change before deciding whether to invest the contributor/maintainer effort in making/reviewing such a change, but if it's significant then this could be a reasonable thing to do.

@AlexGuteniev
Copy link
Contributor Author

I'd like to understand the potential performance impact of this change before deciding whether to invest the contributor/maintainer effort in making/reviewing such a change, but if it's significant then this could be a reasonable thing to do.

The changes of _Add_overflow and _Mul_overflow themselves performance are expected to be significant.

For _Add_overflow the addition itself is one of the cheapest operation. The current implementation adds combination of six conditions, that would definitely make it order of magnitudes slower. I expect from the intrinsic to only expose overflow flag, but stay the same instruction.

For _Mul_overflow it is even more clear. The current implementation uses division, which order of magnitude slower than multiplication. Again, I expect only multiplication and flag or result checking from the instruction.

The usages of these are, as I mentioned <ranges>, <mdspan>, and the proposed usage in lcm.

All potentially-runtime <mdspan> usages guarded by #if _CONTAINER_DEBUG_LEVEL > 0, if I'm not missing anything. For <ranges>, some usages are guarded fully, but there is at least one usage where #if _ITERATOR_DEBUG_LEVEL != 0 only guards the overflow check, but not the operation. I don't know if that optimizes away or doesn't.

The proposed runtime lcm usage is guarded by _DEBUG.

So it seems that currently improving _Add_overflow and _Mul_overflow would benefit only debugging modes, except for the mentioned <ranges> occurrence..

We could simplify the implementation in lcm and possibly elsewhere by making codepath the same for debug and non-debug, and guarding only the check (like in that <ranges> occurence), but taking into account the complexity of the intrinsics usage, the total simplification is likely to be negative.

@StephanTLavavej StephanTLavavej added the decision needed We need to choose something before working on this label Aug 2, 2024
@AlexGuteniev
Copy link
Contributor Author

We'd need to work around the known x86 codegen bug.

After thinking about possible workaround I conclude that the best would be not to use the broken variant/platform combinations. Any workaround that doesn't too much interfere the optimization might be itself optimized away.

It is is only x86 and not x64, this isn't too much impact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
decision needed We need to choose something before working on this performance Must go faster
Projects
None yet
Development

No branches or pull requests

2 participants