Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize bitset from string construction #4839

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Jul 13, 2024

📜 The approach

We need to:

  • Compare against '0' and '1' (or other characters) and report if there's neither of them
  • Reverse the order of characters
  • Convert byte mask or wchar_t to bit mask
  • Handle bitset and string odd length tails
  • Handle too short and too long string properly

It is implemented as follows:

  • We compare against '1' vector to get byte or wchar_t mask. We xor the input with '0' vector, and check if anything not masked in the resulting byte mask is zero. This way we both validate, and have the mask.
  • If we have longer string than bitset, we don't do further steps than byte or wchar_t mask, as we only need to validate,
  • To both reverse the order, and convert wchar_t mask to byte mask we use (v)pshufb (..._shuffle_epi8). The AVX2 version is not cross-lane, and additional cross-lane instruction is somewhat expensive, so we fix up that with bits, rather than now.
  • (v)pmovmskb is the instruction that does byte mask to bit mask conversion, it is emitted by ..._movemask_epi8 intrinsic.
  • Now we do rotations to fix up the missing AVX2 cross-lane reordering
  • Then put that to bitset. We assume padding, even if bitset length is smaller.
  • Repeat until string ends
  • Write zeros to remaining portion of bitset, if there's something left

⚠️ Padding assumption

To avoid unnecessary complication, and have a bit better performance, an assumption was made that bitset has at least a multiple of 2 padding on SSE4.2 code path, and at lest a multiple of 4 padding on AVX2 code path.

For large bitset we have units of uint64_t, and it is not reasonable to have unit less than native integer size, and AVX2 code path has the threshold of 256, since it is not efficient on smaller sizes.

For small bitset we have units of uint32_t, but there's a vNext comment to consider smaller units. But we have a threshold on the whole vectorization as 16 bits, so 8 or less bits bitset that could potentially be less that 2 bytes, would not vectorize.

⚖️ Threshold selection

According to my benchmarks, the vectorized approach benefits for 16 bits, but doesn't for 8 bits. I tried 12, and it seems in favor of non-vectorization. Let it be round up to 16 then.

📏 String length

There are overloads with implicit and explicit length.

For implicit length the non-vectorized implementation calls strlen/wcslen (via character traits).

Hypothetically, vectorized implementation could determine length during conversion. We don't do that here, because:

Unfortunately, this causes implicit length overload to benefit much less from vectorization.

⏱️ Benchmark results

Benchmark main this
explicit length, 1-byte chars
BM_bitset_from_string<length_type::char_count, 15, char> 1813 ns 1778 ns
BM_bitset_from_string<length_type::char_count, 16, char> 1794 ns 921 ns
BM_bitset_from_string<length_type::char_count, 36, char> 2230 ns 651 ns
BM_bitset_from_string<length_type::char_count, 64, char> 2075 ns 274 ns
BM_bitset_from_string<length_type::char_count, 512, char> 2026 ns 78.1 ns
BM_bitset_from_string<length_type::char_count, 2048, char> 2072 ns 65.7 ns
explicit length, 2-byte chars
BM_bitset_from_string<length_type::char_count, 15, wchar_t> 1777 ns 1774 ns
BM_bitset_from_string<length_type::char_count, 16, wchar_t> 1786 ns 979 ns
BM_bitset_from_string<length_type::char_count, 36, wchar_t> 2230 ns 741 ns
BM_bitset_from_string<length_type::char_count, 64, wchar_t> 1988 ns 349 ns
BM_bitset_from_string<length_type::char_count, 512, wchar_t> 1970 ns 189 ns
BM_bitset_from_string<length_type::char_count, 2048, wchar_t> 1912 ns 143 ns
implicit length, 1-byte chars
BM_bitset_from_string<length_type::null_term, 15, char> 2825 ns 2777 ns
BM_bitset_from_string<length_type::null_term, 16, char> 2118 ns 1439 ns
BM_bitset_from_string<length_type::null_term, 36, char> 2571 ns 1173 ns
BM_bitset_from_string<length_type::null_term, 64, char> 2710 ns 1018 ns
BM_bitset_from_string<length_type::null_term, 512, char> 2482 ns 606 ns
BM_bitset_from_string<length_type::null_term, 2048, char> 2407 ns 566 ns
implicit length, 2-byte chars
BM_bitset_from_string<length_type::null_term, 15, wchar_t> 2098 ns 2102 ns
BM_bitset_from_string<length_type::null_term, 16, wchar_t> 2515 ns 1613 ns
BM_bitset_from_string<length_type::null_term, 36, wchar_t> 2128 ns 1246 ns
BM_bitset_from_string<length_type::null_term, 64, wchar_t> 2739 ns 1055 ns
BM_bitset_from_string<length_type::null_term, 512, wchar_t> 2491 ns 719 ns
BM_bitset_from_string<length_type::null_term, 2048, wchar_t> 2407 ns 639 ns

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner July 13, 2024 13:59
@StephanTLavavej StephanTLavavej added the performance Must go faster label Jul 17, 2024
@StephanTLavavej StephanTLavavej self-assigned this Jul 17, 2024
benchmarks/src/bitset_from_string.cpp Outdated Show resolved Hide resolved
benchmarks/src/bitset_from_string.cpp Outdated Show resolved Hide resolved
benchmarks/src/bitset_from_string.cpp Outdated Show resolved Hide resolved
benchmarks/src/bitset_from_string.cpp Outdated Show resolved Hide resolved
benchmarks/src/bitset_from_string.cpp Outdated Show resolved Hide resolved
benchmarks/src/bitset_from_string.cpp Outdated Show resolved Hide resolved
benchmarks/src/bitset_from_string.cpp Outdated Show resolved Hide resolved
benchmarks/src/bitset_from_string.cpp Outdated Show resolved Hide resolved
benchmarks/src/bitset_from_string.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
@StephanTLavavej

This comment was marked as resolved.

@StephanTLavavej

This comment was marked as resolved.

#if _USE_STD_VECTOR_ALGORITHMS
constexpr size_t _Bitset_from_string_vector_threshold = 16;
if constexpr (_Bits >= _Bitset_from_string_vector_threshold
&& _Is_specialization_v<_Traits, char_traits> && sizeof(_Elem) <= 2) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ 🦄🦄🦄 ⚠️

Maybe it would be best to wait for #4951 and replace _Is_specialization_v<_Traits, char_traits> with _Is_implementation_handled_char_traits<_Traits>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Status: Initial Review
Development

Successfully merging this pull request may close these issues.

2 participants