Speed up `pad_nulls` for `FixedLenByteArrayBuffer` #6297

etseidl · 2024-08-23T21:42:33Z

Which issue does this PR close?

Closes #6296.

Rationale for this change

See issue.

What changes are included in this PR?

Changes pad_nulls for FixedLenByteArrayBuffer as described in the issue. Benchmarking seems to indicate that using the new approach for byte_length > 4 is optimal. The loop is better for byte_length <= 4 because the compiler will eliminate the loop via unrolling.

This PR does include extensive changes to the arrow_reader bench to allow benchmarking these changes.

Are there any user-facing changes?

No

etseidl · 2024-08-23T21:49:01Z

Benchmark results for these changes (pad loop is as in master branch):

group                                                                                                      pad_loop                               pad_nulls_vec2
-----                                                                                                      --------                               --------------
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/byte_stream_split encoded, optional, half NULLs    1.45    558.3±4.21µs        ? ?/sec    1.00    385.9±3.98µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/plain encoded, optional, half NULLs                1.67    422.5±4.04µs        ? ?/sec    1.00    253.3±4.98µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/byte_stream_split encoded, optional, half NULLs       1.03    119.9±0.74µs        ? ?/sec    1.00    116.5±1.73µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/plain encoded, optional, half NULLs                   1.04    105.7±1.24µs        ? ?/sec    1.00    101.2±1.09µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/byte_stream_split encoded, optional, half NULLs                   1.63    440.0±1.65µs        ? ?/sec    1.00    269.5±1.76µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/plain encoded, optional, half NULLs                               2.17    297.8±3.19µs        ? ?/sec    1.00    137.1±3.25µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/byte_stream_split encoded, optional, half NULLs                    1.07    108.4±4.14µs        ? ?/sec    1.00    101.2±1.03µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/plain encoded, optional, half NULLs                                1.00     86.8±1.53µs        ? ?/sec    1.00     86.8±1.62µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/byte_stream_split encoded, optional, half NULLs                    1.01    148.4±1.92µs        ? ?/sec    1.00    146.9±1.62µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/plain encoded, optional, half NULLs                                1.02    117.0±0.40µs        ? ?/sec    1.00    115.2±1.27µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/byte_stream_split encoded, optional, half NULLs                    1.33    241.4±2.93µs        ? ?/sec    1.00    181.5±2.62µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/plain encoded, optional, half NULLs                                1.49    174.5±3.44µs        ? ?/sec    1.00    116.9±1.26µs        ? ?/sec

etseidl · 2024-08-23T22:28:58Z

I must have hallucinated last night...turns out copy_from_slice actually is a bit faster for the long arrays. Master vs 1a1e967

Linux with newer xeon

group                                                                                                      pad_loop                               pad_nulls_slice
-----                                                                                                      --------                               ---------------
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/byte_stream_split encoded, optional, half NULLs    1.52    558.3±4.21µs        ? ?/sec    1.00    367.8±3.17µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/plain encoded, optional, half NULLs                1.85    422.5±4.04µs        ? ?/sec    1.00    228.8±3.73µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/byte_stream_split encoded, optional, half NULLs       1.07    119.9±0.74µs        ? ?/sec    1.00    112.3±1.55µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/plain encoded, optional, half NULLs                   1.16    105.7±1.24µs        ? ?/sec    1.00     91.3±0.85µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/byte_stream_split encoded, optional, half NULLs                   1.69    440.0±1.65µs        ? ?/sec    1.00    259.7±2.18µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/plain encoded, optional, half NULLs                               2.36    297.8±3.19µs        ? ?/sec    1.00    126.3±1.59µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/byte_stream_split encoded, optional, half NULLs                    1.20    108.4±4.14µs        ? ?/sec    1.00     90.6±0.81µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/plain encoded, optional, half NULLs                                1.16     86.8±1.53µs        ? ?/sec    1.00     74.9±0.63µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/byte_stream_split encoded, optional, half NULLs                    1.06    148.4±1.92µs        ? ?/sec    1.00    140.4±2.67µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/plain encoded, optional, half NULLs                                1.08    117.0±0.40µs        ? ?/sec    1.00    108.0±1.19µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/byte_stream_split encoded, optional, half NULLs                    1.39    241.4±2.93µs        ? ?/sec    1.00    173.1±2.77µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/plain encoded, optional, half NULLs                                1.60    174.5±3.44µs        ? ?/sec    1.00    109.0±1.27µs        ? ?/sec

Hmm, this is pretty system specific.

Old mac laptop

group                                                                                                      pad_loop                               pad_slice                               pad_vec
-----                                                                                                      --------                               ---------                               -------
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/byte_stream_split encoded, optional, half NULLs    1.15  1373.8±32.79µs        ? ?/sec    1.02  1218.4±282.47µs        ? ?/sec    1.00  1199.0±38.80µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/plain encoded, optional, half NULLs                1.32  1035.7±27.45µs        ? ?/sec    1.00   785.5±16.07µs        ? ?/sec     1.02   804.6±25.53µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/byte_stream_split encoded, optional, half NULLs       1.00    308.7±6.87µs        ? ?/sec    1.00   309.2±14.52µs        ? ?/sec     1.04    320.1±9.10µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/plain encoded, optional, half NULLs                   1.07    266.8±8.11µs        ? ?/sec    1.00    249.0±6.98µs        ? ?/sec     1.12    278.7±8.53µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/byte_stream_split encoded, optional, half NULLs                   1.32  1041.8±28.99µs        ? ?/sec    1.00   790.8±19.98µs        ? ?/sec     1.01   795.1±16.92µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/plain encoded, optional, half NULLs                               1.55   697.3±13.37µs        ? ?/sec    1.00   449.4±11.70µs        ? ?/sec     1.01    456.0±9.74µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/byte_stream_split encoded, optional, half NULLs                    1.07    279.4±5.72µs        ? ?/sec    1.00   260.4±16.31µs        ? ?/sec     1.11    289.6±5.28µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/plain encoded, optional, half NULLs                                1.08    236.6±4.57µs        ? ?/sec    1.00    218.3±4.90µs        ? ?/sec     1.14    249.2±9.71µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/byte_stream_split encoded, optional, half NULLs                    1.00    371.7±7.78µs        ? ?/sec    1.04    386.3±7.68µs        ? ?/sec     1.11   410.9±10.57µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/plain encoded, optional, half NULLs                                1.00    287.8±6.90µs        ? ?/sec    1.05    303.1±6.91µs        ? ?/sec     1.13    323.9±6.61µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/byte_stream_split encoded, optional, half NULLs                    1.15   594.5±16.03µs        ? ?/sec    1.04   535.7±11.47µs        ? ?/sec     1.00   517.0±13.36µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/plain encoded, optional, half NULLs                                1.22   430.9±11.69µs        ? ?/sec    1.12   396.6±21.44µs        ? ?/sec     1.00   354.6±10.20µs        ? ?/sec

alamb · 2024-08-28T11:19:33Z

I must have hallucinated last night...turns out copy_from_slice actually is a bit faster for the long arrays.

Sorry I didn't follow all the nance of the results here. It seems like you have concluded that which approach is faster is a result of what the target architecture is?

If this is the case, I think my preference would be go with the simplest code (e.g. copy_from_slice?) otherwise we run the risk of over fitting / over optimizing for some specific conditions)

etseidl · 2024-08-28T14:14:42Z

I must have hallucinated last night...turns out copy_from_slice actually is a bit faster for the long arrays.

Sorry I didn't follow all the nance of the results here. It seems like you have concluded that which approach is faster is a result of what the target architecture is?

Sorry, I tend to ramble...I concluded that copy_from_slice is the way to go since it's better on modern CPUs. I only added the mac results to show I hadn't imagined the other approach being faster when I did my initial testing. 😅

If this is the case, I think my preference would be go with the simplest code (e.g. copy_from_slice?) otherwise we run the risk of over fitting / over optimizing for some specific conditions)

agreed!

alamb · 2024-09-18T20:13:40Z

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

wiedld · 2024-09-27T05:27:43Z

I'm starting to review this one. Just give me a little bit to dig into the decoding.

etseidl and others added 9 commits August 22, 2024 17:34

optimize pad_nulls for fixed_len_byte_array

f77b125

start refactor of benchmarks

da0bbd4

refactor float16

fe97120

add fixed_len_byte_array to float16 benches

1ca0222

replace copy_within with vectorizable copy

b715f80

update comment

1c5221e

move branch on byte_length outside of loop

90c21a8

Merge remote-tracking branch 'origin/master' into bss_pad_nulls

cea8ca3

reduce code duplication while preserving performance gains

4e2dc4f

github-actions bot added the parquet Changes to the parquet crate label Aug 23, 2024

etseidl added 3 commits August 23, 2024 14:51

formatting

63116aa

silence clippy

a13b96d

clippy won again

1a1e967

Merge remote-tracking branch 'origin/master' into bss_pad_nulls

764d4aa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up `pad_nulls` for `FixedLenByteArrayBuffer` #6297

Speed up `pad_nulls` for `FixedLenByteArrayBuffer` #6297

etseidl commented Aug 23, 2024

etseidl commented Aug 23, 2024 •

edited

Loading

etseidl commented Aug 23, 2024 •

edited

Loading

alamb commented Aug 28, 2024

etseidl commented Aug 28, 2024

alamb commented Sep 18, 2024

wiedld commented Sep 27, 2024

Speed up pad_nulls for FixedLenByteArrayBuffer #6297

Are you sure you want to change the base?

Speed up pad_nulls for FixedLenByteArrayBuffer #6297

Conversation

etseidl commented Aug 23, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

etseidl commented Aug 23, 2024 • edited Loading

etseidl commented Aug 23, 2024 • edited Loading

alamb commented Aug 28, 2024

etseidl commented Aug 28, 2024

alamb commented Sep 18, 2024

wiedld commented Sep 27, 2024

Speed up `pad_nulls` for `FixedLenByteArrayBuffer` #6297

Speed up `pad_nulls` for `FixedLenByteArrayBuffer` #6297

etseidl commented Aug 23, 2024 •

edited

Loading

etseidl commented Aug 23, 2024 •

edited

Loading