Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up pad_nulls for FixedLenByteArrayBuffer #6297

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Aug 23, 2024

Which issue does this PR close?

Closes #6296.

Rationale for this change

See issue.

What changes are included in this PR?

Changes pad_nulls for FixedLenByteArrayBuffer as described in the issue. Benchmarking seems to indicate that using the new approach for byte_length > 4 is optimal. The loop is better for byte_length <= 4 because the compiler will eliminate the loop via unrolling.

This PR does include extensive changes to the arrow_reader bench to allow benchmarking these changes.

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 23, 2024
@etseidl
Copy link
Contributor Author

etseidl commented Aug 23, 2024

Benchmark results for these changes (pad loop is as in master branch):

group                                                                                                      pad_loop                               pad_nulls_vec2
-----                                                                                                      --------                               --------------
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/byte_stream_split encoded, optional, half NULLs    1.45    558.3±4.21µs        ? ?/sec    1.00    385.9±3.98µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/plain encoded, optional, half NULLs                1.67    422.5±4.04µs        ? ?/sec    1.00    253.3±4.98µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/byte_stream_split encoded, optional, half NULLs       1.03    119.9±0.74µs        ? ?/sec    1.00    116.5±1.73µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/plain encoded, optional, half NULLs                   1.04    105.7±1.24µs        ? ?/sec    1.00    101.2±1.09µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/byte_stream_split encoded, optional, half NULLs                   1.63    440.0±1.65µs        ? ?/sec    1.00    269.5±1.76µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/plain encoded, optional, half NULLs                               2.17    297.8±3.19µs        ? ?/sec    1.00    137.1±3.25µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/byte_stream_split encoded, optional, half NULLs                    1.07    108.4±4.14µs        ? ?/sec    1.00    101.2±1.03µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/plain encoded, optional, half NULLs                                1.00     86.8±1.53µs        ? ?/sec    1.00     86.8±1.62µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/byte_stream_split encoded, optional, half NULLs                    1.01    148.4±1.92µs        ? ?/sec    1.00    146.9±1.62µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/plain encoded, optional, half NULLs                                1.02    117.0±0.40µs        ? ?/sec    1.00    115.2±1.27µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/byte_stream_split encoded, optional, half NULLs                    1.33    241.4±2.93µs        ? ?/sec    1.00    181.5±2.62µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/plain encoded, optional, half NULLs                                1.49    174.5±3.44µs        ? ?/sec    1.00    116.9±1.26µs        ? ?/sec

@etseidl
Copy link
Contributor Author

etseidl commented Aug 23, 2024

I must have hallucinated last night...turns out copy_from_slice actually is a bit faster for the long arrays. Master vs 1a1e967

Linux with newer xeon
group                                                                                                      pad_loop                               pad_nulls_slice
-----                                                                                                      --------                               ---------------
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/byte_stream_split encoded, optional, half NULLs    1.52    558.3±4.21µs        ? ?/sec    1.00    367.8±3.17µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/plain encoded, optional, half NULLs                1.85    422.5±4.04µs        ? ?/sec    1.00    228.8±3.73µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/byte_stream_split encoded, optional, half NULLs       1.07    119.9±0.74µs        ? ?/sec    1.00    112.3±1.55µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/plain encoded, optional, half NULLs                   1.16    105.7±1.24µs        ? ?/sec    1.00     91.3±0.85µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/byte_stream_split encoded, optional, half NULLs                   1.69    440.0±1.65µs        ? ?/sec    1.00    259.7±2.18µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/plain encoded, optional, half NULLs                               2.36    297.8±3.19µs        ? ?/sec    1.00    126.3±1.59µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/byte_stream_split encoded, optional, half NULLs                    1.20    108.4±4.14µs        ? ?/sec    1.00     90.6±0.81µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/plain encoded, optional, half NULLs                                1.16     86.8±1.53µs        ? ?/sec    1.00     74.9±0.63µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/byte_stream_split encoded, optional, half NULLs                    1.06    148.4±1.92µs        ? ?/sec    1.00    140.4±2.67µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/plain encoded, optional, half NULLs                                1.08    117.0±0.40µs        ? ?/sec    1.00    108.0±1.19µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/byte_stream_split encoded, optional, half NULLs                    1.39    241.4±2.93µs        ? ?/sec    1.00    173.1±2.77µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/plain encoded, optional, half NULLs                                1.60    174.5±3.44µs        ? ?/sec    1.00    109.0±1.27µs        ? ?/sec

Hmm, this is pretty system specific.

Old mac laptop
group                                                                                                      pad_loop                               pad_slice                               pad_vec
-----                                                                                                      --------                               ---------                               -------
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/byte_stream_split encoded, optional, half NULLs    1.15  1373.8±32.79µs        ? ?/sec    1.02  1218.4±282.47µs        ? ?/sec    1.00  1199.0±38.80µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/plain encoded, optional, half NULLs                1.32  1035.7±27.45µs        ? ?/sec    1.00   785.5±16.07µs        ? ?/sec     1.02   804.6±25.53µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/byte_stream_split encoded, optional, half NULLs       1.00    308.7±6.87µs        ? ?/sec    1.00   309.2±14.52µs        ? ?/sec     1.04    320.1±9.10µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/plain encoded, optional, half NULLs                   1.07    266.8±8.11µs        ? ?/sec    1.00    249.0±6.98µs        ? ?/sec     1.12    278.7±8.53µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/byte_stream_split encoded, optional, half NULLs                   1.32  1041.8±28.99µs        ? ?/sec    1.00   790.8±19.98µs        ? ?/sec     1.01   795.1±16.92µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/plain encoded, optional, half NULLs                               1.55   697.3±13.37µs        ? ?/sec    1.00   449.4±11.70µs        ? ?/sec     1.01    456.0±9.74µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/byte_stream_split encoded, optional, half NULLs                    1.07    279.4±5.72µs        ? ?/sec    1.00   260.4±16.31µs        ? ?/sec     1.11    289.6±5.28µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/plain encoded, optional, half NULLs                                1.08    236.6±4.57µs        ? ?/sec    1.00    218.3±4.90µs        ? ?/sec     1.14    249.2±9.71µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/byte_stream_split encoded, optional, half NULLs                    1.00    371.7±7.78µs        ? ?/sec    1.04    386.3±7.68µs        ? ?/sec     1.11   410.9±10.57µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/plain encoded, optional, half NULLs                                1.00    287.8±6.90µs        ? ?/sec    1.05    303.1±6.91µs        ? ?/sec     1.13    323.9±6.61µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/byte_stream_split encoded, optional, half NULLs                    1.15   594.5±16.03µs        ? ?/sec    1.04   535.7±11.47µs        ? ?/sec     1.00   517.0±13.36µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/plain encoded, optional, half NULLs                                1.22   430.9±11.69µs        ? ?/sec    1.12   396.6±21.44µs        ? ?/sec     1.00   354.6±10.20µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Aug 28, 2024

I must have hallucinated last night...turns out copy_from_slice actually is a bit faster for the long arrays.

Sorry I didn't follow all the nance of the results here. It seems like you have concluded that which approach is faster is a result of what the target architecture is?

If this is the case, I think my preference would be go with the simplest code (e.g. copy_from_slice?) otherwise we run the risk of over fitting / over optimizing for some specific conditions)

@etseidl
Copy link
Contributor Author

etseidl commented Aug 28, 2024

I must have hallucinated last night...turns out copy_from_slice actually is a bit faster for the long arrays.

Sorry I didn't follow all the nance of the results here. It seems like you have concluded that which approach is faster is a result of what the target architecture is?

Sorry, I tend to ramble...I concluded that copy_from_slice is the way to go since it's better on modern CPUs. I only added the mac results to show I hadn't imagined the other approach being faster when I did my initial testing. 😅

If this is the case, I think my preference would be go with the simplest code (e.g. copy_from_slice?) otherwise we run the risk of over fitting / over optimizing for some specific conditions)

agreed!

@alamb
Copy link
Contributor

alamb commented Sep 18, 2024

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

@wiedld
Copy link
Contributor

wiedld commented Sep 27, 2024

I'm starting to review this one. Just give me a little bit to dig into the decoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reading FIXED_LEN_BYTE_ARRAY columns with nulls is inefficient
3 participants