-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-1264: [C++] Add a writer option to align compression block with row group #2005
base: main
Are you sure you want to change the base?
Conversation
…ow group boundary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM.
* Get the current stripe position entries for the specified column. | ||
* @return the position entries for the specified column. | ||
*/ | ||
virtual std::vector<std::vector<int>> getCurrentStripePositionEntries(uint64_t columnId) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I was thinking if we should design a better data structure, something like:
struct RowGroupPositions {
uint64_t columnId;
std::vector<int32_t> positions;
}
Then it can be reused if we want to add std::vector<RowGroupPositions> getPositionEntries(int stripe, std::vector<int> columnIds)
in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll mark it in the next work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to do this all at once. If we don't change it in time, the public API will get released and we cannot change it any more.
Co-authored-by: Gang Wu <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
What changes were proposed in this pull request?
Add support for the ORC writer to ensure that the compression block is aligned with the row group boundary。
Why are the changes needed?
To reduce unnecessary I/O and decompression when PPD is in effect, we can enforce the compression block to be aligned with the row group boundary. For more detail, see link
How was this patch tested?
Uts in TestWriter.cc can convert this patch.
Was this patch authored or co-authored using generative AI tooling?
NO