Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Slicing of ArrayData for StructArray #1750

Closed
tustvold opened this issue May 26, 2022 · 18 comments
Closed

Inconsistent Slicing of ArrayData for StructArray #1750

tustvold opened this issue May 26, 2022 · 18 comments
Labels
bug question Further information is requested

Comments

@tustvold
Copy link
Contributor

Problem

ArrayData::Slice contains a special case for StructArray where it recurses the offset into its children. However, it preserves the offset on the parent ArrayData, in order for the validity buffer to work correctly.

The result is the offset has been pushed down to the child arrays, but is still present on the parent ArrayData. This makes it unclear if an offset should or should not be applied to the child_array.

Proposal

There are longer term suggestions around handling offsets in ArrayData differently, but until then I would like to propose:

  • Remove the ArrayData::slice special-case
  • Slice child data within StructArray when constructing boxed_fields

This is consistent with how we handle offsets elsewhere, with ArrayData a dumb container, and the Array providing the logic to interpret the offset correctly.

Thoughts @nevi-me

@tustvold tustvold added question Further information is requested bug labels May 26, 2022
@HaoYang670
Copy link
Contributor

Hi @tustvold
Here are some thoughts of mine:

  1. (Agree with you) We have to preserve offset on parent ArrayData (the struct array) because of the validity buffer.
  2. Slice child data within StructArray when constructing boxed_fields
    If what you mean is to slice child arrays in StructArray::from:

    fn from(data: ArrayData) -> Self {
        let boxed_fields = data
            .child_data()
            .iter()
            .map(|cd| make_array(cd.slice(data.offset(), data.len())))
            .collect();

        Self { data, boxed_fields }
    }

I am afraid this cannot solve the problem. Because

  • Inconsistent offsets still exist. For example, given a struct array:
StructArray (offset = 0, length = 5, field0 = i32, field1 = bool, validity_buffer = ...)
-- Int32Array (offset = 1, length = 5)
-- BooleanArray (offset = 2, length = 5)

After calling array.slice(offset = 1, length = 3), we will get a struct array like this

StructArray (offset = 1, length = 3, field0 = i32, field1 = bool, validity_buffer = ...)
-- Int32Array (offset = 2, length = 3)
-- BooleanArray (offset = 3, length = 3)

We still have 3 offsets in parent array and child arrays. And it is not clear whether we have pushed down the new offset to children.
3. My suggestion (workaround)

  • Do not push down the offset to child arrays when slicing (aka, just update the offset and null_count of parent array)
  • Do not slicing child arrays when building boxed_fields. Because this will produce inconsistent offsets
  • Slice child arrays when you call array.column(idx). This is also the way of getting a buffer of array now:
    Example: getting the offsets of binary array.
    pub fn value_offsets(&self) -> &[OffsetSize] {
        unsafe {
            std::slice::from_raw_parts(
                self.value_offsets.as_ptr().add(self.data.offset()),
                self.len() + 1,
            )
        }
    }

Slice when getting the child array. (API change, because we own the ArrayRef now)

    pub fn column(&self, pos: usize) -> ArrayRef {
        self.boxed_fields[pos].slice(self.offset, self.length)
    }
  • Long term suggestion: push down offset and length to Buffer(Bitmap).

@HaoYang670
Copy link
Contributor

cc @nevi-me @alamb @viirya

@tustvold
Copy link
Contributor Author

tustvold commented Jun 2, 2022

And it is not clear whether we have pushed the new offset to children

The suggestion is never to push offsets down within ArrayData, and so you should never have this ambiguity?

When constructing an instance of StructArray you "hydrate" the offsets to boxed fields to avoid having to recompute slices on access, but this does not modify the underlying ArrayData?

@HaoYang670
Copy link
Contributor

The suggestion is never to push offsets down within ArrayData

Although we don't push down the offset within ArrayData::slice (Based on your 1st suggestion Remove the ArrayData::slice special-case), we do push it down within StructArray::from (Based on your 2nd suggestion Slice child data within StructArray when constructing boxed_fields) because we must slice on each child_data to construct the correct child array.

As a result, I guess your suggestions (slicing child arrays within StructArray::from()) and the current implementation (slicing the child data within ArrayData::slice()) are equivalent. I am not 100% sure.

@tustvold
Copy link
Contributor Author

tustvold commented Jun 2, 2022

Aah, I see the confusion. I'm suggesting storing slices of the child data as boxed_fields on StructArray. So in StructArray::from it would inspect the ArrayData, compute slices, and store those as members. I am not suggesting StructArray::from make any modifications to ArrayData.

@HaoYang670
Copy link
Contributor

HaoYang670 commented Jun 2, 2022

So in StructArray::from it would inspect the ArrayData, compute slices, and store those as members.

Implement as this:

    fn from(data: ArrayData) -> Self {
        let boxed_fields = data
            .child_data()
            .iter()
            .map(|cd| make_array(cd.slice(data.offset(), data.len())))
            .collect();

        Self { data, boxed_fields }
    }

?

@tustvold
Copy link
Contributor Author

tustvold commented Jun 2, 2022

I think that should work, but there could be some arcane edge case I've failed to consider

@HaoYang670
Copy link
Contributor

This will not modify the parent ArrayData, which is great.
But It will cause the inconsistency between bixed_fields and self.data.child_data, which means StructArray::column(self, 0) != make_array(self.data.child_data()[0])?

@HaoYang670
Copy link
Contributor

I don't find StructArray::slice in the c++ implementation (I just find the ArrayData::slice https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/data.cc#L94-L110)
And in Arrow2, everything is pushed down to Buffer and Bitmap

@tustvold
Copy link
Contributor Author

tustvold commented Jun 2, 2022

But It will cause the inconsistency between bixed_fields and self.data.child_data, which means StructArray::column(self, 0) != make_array(self.data.child_data()[0])?

We have the same "inconsistency" on other array types, e.g. GenericStringArray::value_offsets or PrimitiveArray::values. I personally don't think this is a problem. In my mental model Arrays are the user-facing abstraction that hides all the ArrayData shenanigans.

I don't find StructArray::slice in the c++ implementation

Yeah this is just a usability thing, the Rust implementation is just calling through to make_array(self.data_ref().slice(offset, length)).

everything is pushed down to Buffer and Bitmap

I would definitely support this change, and it would be the better long-term solution. It has been discussed at length, all that really remains is someone to volunteer to do it 😅. The only outstanding sticking point has been handling offsets for BooleanArrays, but I don't imagine there would be resistance to special casing these as a Bitmap in ArrayData...

@alamb
Copy link
Contributor

alamb commented Jun 2, 2022

cc @jhorstmann and @bjchambers who I think have interest in this area

@jhorstmann
Copy link
Contributor

Sorry for being late to this discussion, the proposed change makes sense to me. If I understand it correctly, this would be a silent behavior change of the column/columns/columns_ref methods on StructArray, which previously did not take the struct offset into account. We might want to deprecate those and instead add new methods like pub fn column_slice(&self, pos: usize) -> &ArrayRef (exact name up for discussion).

@tustvold
Copy link
Contributor Author

tustvold commented Jun 7, 2022

which previously did not take the struct offset into account.

Is this not a bug?

@jhorstmann
Copy link
Contributor

which previously did not take the struct offset into account.

Is this not a bug?

Difficult to say without a spec or comment describing the expected behavior. Someone might rely on the current behavior and acccess a StructArray using something like a.column(0).downcast().value(a.offset() + i). I didn't see any such usages in the arrow-rs codebase. Our codebase makes heavy use of slices, but we do not use StructArray. If we consider the current behavior a bug, then we should clearly hightlight the change in the release notes.

@nevi-me
Copy link
Contributor

nevi-me commented Jun 12, 2022

I've read this thread a few times, but I'm still hazy on what a good approach is, given how out of the loop I have been for so long. We had discussed with Jorge many moons ago that passing the offset and length to Buffer and Bitmap would be a good solution (as is done in arrow2 like you mention @tustvold).

I haven't written arrow code in very long, so I can't quite remember the details. However, what I recall was having a challenge figuring out what happens in the below scenario.

An array is of type struct[a]<struct[b]<struct[c]<struct[d]<int32[e]>>>> and we slice it, what happens when we select a.b.c?

The trouble was that if we don't pass down the offset and length to the ArrayData of a's children, we'd be bound to always knowing a's offset, which forces us to compute it each time we access a or any of its children.

So in principle I favoured pushing down the offset at the time. Which I suppose has led us here:

ArrayData::Slice contains a special case for StructArray where it recurses the offset into its children. However, it preserves the offset on the parent ArrayData, in order for the validity buffer to work correctly.


There are longer term suggestions around handling offsets in ArrayData differently, but until then I would like to propose:

  • Remove the ArrayData::slice special-case
  • Slice child data within StructArray when constructing boxed_fields

This makes sense to implement as a solution (interim?), but yea perhaps first-prize would be propagating offsets and value lengths to a redesigned Buffer and Bitmap

@tustvold
Copy link
Contributor Author

The trouble was that if we don't pass down the offset and length to the ArrayData of a's children, we'd be bound to always knowing a's offset, which forces us to compute it each time we access a or any of its children.

I think caching the sliced array on StructArray::boxed_fields would avoid this, and is consistent with what we do for PrimtiveArray / StringArray, etc... ?

but yea perhaps first-prize would be propagating offsets and value lengths to a redesigned Buffer and Bitmap

Agreed, and I think the first step of this is removing the remaining places that "leak" offsets into the Array-level interfaces 😄

@tustvold
Copy link
Contributor Author

Closing in favor of removing the offsets entirely (#1799)

@tustvold tustvold reopened this Mar 16, 2023
@tustvold
Copy link
Contributor Author

Closed by #4061

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants