Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add union_extract kernel #6387

Merged
merged 4 commits into from
Sep 25, 2024
Merged

Add union_extract kernel #6387

merged 4 commits into from
Sep 25, 2024

Conversation

gstvg
Copy link
Contributor

@gstvg gstvg commented Sep 11, 2024

Which issue does this PR close?

Closes #6386

Rationale for this change

What changes are included in this PR?

union_extract kernel implementation, docs and tests

Are there any user-facing changes?

A new public kernel union_extract is added

@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 11, 2024
@alamb
Copy link
Contributor

alamb commented Sep 18, 2024

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

1 similar comment
@alamb
Copy link
Contributor

alamb commented Sep 18, 2024

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

@wiedld
Copy link
Contributor

wiedld commented Sep 19, 2024

I've started reviewing this one, but it's going to take me a bit (learning) to make it thru. Thank you for the documentation and testing.

Copy link
Contributor

@wiedld wiedld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the well documented code.

.iter()
.find(|field| field.1.name() == target)
.ok_or_else(|| {
ArrowError::InvalidArgumentError(format!("field {target} not found on union"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for discussion: this behavior matches the duckdb implementation of union_extract, whereas the postgres json/jsonb operators ->> and #> would return NULL.

let set_bits = set_bits - set_bits % 64;

let mut buffer =
MutableBuffer::new(bit_util::ceil(type_ids.len(), 64) * 8).with_bitset(set_bits / 8, val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this bit-math needed, since the MutableBuffer::new uses bit_util::round_upto_multiple_of_64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure relative to which part, will explain both:

Without set_bits = set_bits - set_bits % 64, if the type_ids len is already a multiple of 64 and aligned, and set_bits is not multiple of 64, buffer.extend below would push an additional partial packed mask, requiring a buffer grow. Even with EQ_SCALAR_CHUNK_SIZE = 512, set_bits can be a non multiple of 64 if the first bit flip happens in the last chunk, which is potentially smaller than 512

Relative to bit_util::ceil(type_ids.len(), 64) * 8, I didn't realized that in MutableBuffer::new/with_capacity, so yes, using 64 as divisor and then * 8 is redundant, removed it. Thanks!

I also noticed that buffer.truncate is unnecessary so I removed it too. BooleanBuffer::new uses the correct len from type_ids

Comment on lines 293 to 296
enum BoolValue {
Scalar(bool),
Buffer(BooleanBuffer),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document BoolValue? Something along of lines of Scalar == all same existence (e.g. all is, or all is not, the target type_id), and Buffer is target type_id mask.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 360 to 365
// checks a common form of non sequential offsets, when sequential nulls reuses the same value,
// pointed by the same offset, while valid values offsets increases one by one
// this also checks if the last chunk/remainder is sequential relative to the first offset
if offsets[0] + offsets.len() as i32 - 1 != offsets[offsets.len() - 1] {
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't quite get the first explanation.

But the second reason (that it checks the remainder when we have a single remainder element past the last chunk boundary) makes sense -- and is covered in the tests. 👍🏼

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment to more clear and added an example

This is used on C++ and Go union builders
Our UnionBuilder doesn't have an equivalent append_nulls method

}

#[test]
fn test_is_sequential() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add assert!(!is_sequential(&[1, 2, 3, 5, 6, 7])); to demonstrate that it checks the incr at the chunk boundary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Added with a few more variations too.

}

#[cfg(test)]
mod tests {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also confirmed branch test coverage with cov. 👍🏼

@gstvg
Copy link
Contributor Author

gstvg commented Sep 22, 2024

Thank you for the well documented code.

I pushed like dozens of faster ways of doing the same thing, felt obligated to document it 😅

Thanks @wiedld for your review

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @gstvg and @wiedld for the review. This looks 💯 to me ❤️

@alamb alamb changed the title Add union_extract kernel Add union_extract kernel Sep 23, 2024
@alamb
Copy link
Contributor

alamb commented Sep 23, 2024

I'll plan to merge this tomorrow unless there are any other comments or anyone else wants to comment

@alamb
Copy link
Contributor

alamb commented Sep 23, 2024

Looks like the CI failure is not related -- https://github.com/apache/arrow-rs/actions/runs/10984024899/job/30494124939?pr=6387

@alamb
Copy link
Contributor

alamb commented Sep 24, 2024

I restarted the failed CI job https://github.com/apache/arrow-rs/actions/runs/10984024899/job/30494124939?pr=6387

I think it will pass now that #6437 is merged

@alamb
Copy link
Contributor

alamb commented Sep 24, 2024

I took the liberty of merging up from master to try and get a clean CI run for this PR

@alamb
Copy link
Contributor

alamb commented Sep 24, 2024

This CI run looks entirely unrelated - I will close/reopen this PR to retrigger and see if it is still happening

@alamb alamb closed this Sep 24, 2024
@alamb alamb reopened this Sep 24, 2024
@alamb
Copy link
Contributor

alamb commented Sep 24, 2024

integration CI is failing due to #6448

@alamb alamb merged commit 922a1ff into apache:master Sep 25, 2024
49 of 51 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 25, 2024

#6448 is fixed, so merging it in

@alamb
Copy link
Contributor

alamb commented Sep 25, 2024

Thanks again @gstvg and @wiedld

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add union_extract kernel
3 participants