Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Full UnionArray validation #1444

Closed
wants to merge 11 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Mar 14, 2022

Draft (need to write tests and file a ticket)

Which issue does this PR close?

Closes #1486

Rationale for this change

While reviewing #1412, I remembered to check the validation code for UnionArray and it turns out it was not implemented yet.

Since I had the union layout already loaded in my head, I figured I would pound out this PR:

What changes are included in this PR?

Validate all Union array data in validate_full

Are there any user-facing changes?

More data validation. The APIs are all the same

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 14, 2022
@codecov-commenter
Copy link

codecov-commenter commented Mar 14, 2022

Codecov Report

Merging #1444 (d56cf46) into master (2932da3) will increase coverage by 0.02%.
The diff coverage is 89.38%.

@@            Coverage Diff             @@
##           master    #1444      +/-   ##
==========================================
+ Coverage   82.71%   82.73%   +0.02%     
==========================================
  Files         187      187              
  Lines       54255    54365     +110     
==========================================
+ Hits        44877    44979     +102     
- Misses       9378     9386       +8     
Impacted Files Coverage Δ
arrow/src/array/data.rs 83.50% <89.38%> (+0.70%) ⬆️
arrow/src/array/transform/mod.rs 86.46% <0.00%> (ø)
parquet_derive/src/parquet_field.rs 66.21% <0.00%> (+0.22%) ⬆️
arrow/src/datatypes/datatype.rs 66.80% <0.00%> (+0.39%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2932da3...d56cf46. Read the comment docs.

arrow/src/array/data.rs Outdated Show resolved Hide resolved
Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct to me.

arrow/src/array/data.rs Outdated Show resolved Hide resolved
self.validate_offsets_full::<i8>(self.child_data.len())?;
}
UnionMode::Dense => {
self.validate_dense_union_full()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to suggest that this should validate that the null bitmasks are disjoint, but this may not even be a requirement - the specification says "All “unselected” values are ignored and could be any semantically correct array value."

@@ -1117,6 +1119,44 @@ impl ArrayData {
)
}

/// Ensures that for each union element, the offset is correct for
/// the corresponding child array
fn validate_dense_union_full(&self) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think should also check that offsets are monotonic for a given array type, but that could definitely be left as a todo

// https://github.com/apache/arrow-rs/issues/85
//
// TODO file follow on ticket for full union validation
DataType::Union(_fields, mode) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, we should probably also add buffer length checks into ArrayData::validate as I don't think these are currently present anywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean "buffer length checks"? Which buffers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this got lost in my various rewording of this, I mean child arrays 🤦‍♂️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The length of the ArrayData::buffers are checked as part of the 'layout' in ArrayData::validate

https://github.com/apache/arrow-rs/blob/master/arrow/src/array/data.rs#L1210

The child arrays ArrayData::child_data are validated recursively in ArrayData::validate_child_data
https://github.com/apache/arrow-rs/blob/master/arrow/src/array/data.rs#L788

/// Ensures that for each union element, the offset is correct for
/// the corresponding child array
fn validate_dense_union_full(&self) -> Result<()> {
// safety justification is that the size of the buffers was validated in self.validate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could potentially make validate also check that all child arrays have the length of the parent in the case of a dense representation

Copy link
Contributor Author

@alamb alamb Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think actually that all child arrays need to have the same length for a sparse representation (which is checked in cheap validate)

Which is covered by test_validate_sparse_union_different_child_len

@alamb alamb marked this pull request as ready for review March 24, 2022 20:53
child_offset, i))
})?;

if let Some(last_offset) = last_offset.as_ref() {
Copy link
Contributor

@tustvold tustvold Mar 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect, the requirement is that for a given type the offsets must be increasing. But it is perfectly correct, in fact desirable, for the offsets array itself to not be increasing.

e.g. the following would be valid

types: [0, 1, 1, 0],
offsets: [0, 0, 1, 0],

Otherwise the dense representation is no different from the sparse representation

#[should_panic(
expected = "Offset invariant failure: position 2 invalid. 0 should be >= 1"
)]
fn test_validate_dense_union_non_increasing_offset() {
Copy link
Contributor

@tustvold tustvold Mar 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test should not panic, it is valid as no single child-array has non-increasing offsets.

This is not valid

type_ids = [0, 0, 0],
offsets = [0, 1, 0],

But this is

type_ids = [0, 0, 1],
offsets = [0, 1, 0],

As is the example from https://arrow.apache.org/docs/format/Columnar.html#dense-union

type_ids = [0, 0, 0, 1]
offsets = [0, 1, 2, 0]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 2769573

@alamb
Copy link
Contributor Author

alamb commented Apr 6, 2022

I am thinking I may try and add a CI check and a feature that runs all the tests with full data checking enabled to double check this isn't introducing any issues

@alamb
Copy link
Contributor Author

alamb commented Apr 12, 2022

#1544 tracks checking validation in CI, #1546 is the PR

@alamb
Copy link
Contributor Author

alamb commented Apr 12, 2022

FWIW I tested this PR locally with full validation enabled and several tests failed.


failures:
    array::array_union::tests::test_dense_mixed_with_nulls
    array::array_union::tests::test_dense_mixed_with_nulls_and_offset
    array::equal::tests::test_union_equal_dense

Putting back to draft until is sort that out

@alamb alamb marked this pull request as draft April 12, 2022 14:46
@tustvold
Copy link
Contributor

I plan to implement this in #1799

@tustvold tustvold closed this Feb 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement full validation for UnionArrays construction from ArrayData
4 participants