Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strongly Typed ArrayData #1799

Closed
tustvold opened this issue Jun 6, 2022 · 9 comments · Fixed by #3894
Closed

Strongly Typed ArrayData #1799

tustvold opened this issue Jun 6, 2022 · 9 comments · Fixed by #3894
Assignees
Labels
api-change Changes to the arrow API enhancement Any new improvement worthy of a entry in the changelog question Further information is requested

Comments

@tustvold
Copy link
Contributor

tustvold commented Jun 6, 2022

TLDR

Make ArrayData layout explicit so that we can eventually push offsets down into the underlying buffers/bitmaps, instead of tracking them as a top-level concept which has proven to be rather error prone.

This is also the enabling feature that will support easy and zero cost interoperability between arrow-rs and arrow2 -- see jorgecarleitao/arrow2#1429

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently ArrayData is defined as follows.

pub struct ArrayData {
    /// The data type for this array data
    data_type: DataType,

    /// The number of elements in this array data
    len: usize,

    /// The number of null elements in this array data
    null_count: usize,

    /// The offset into this array data, in number of items
    offset: usize,

    /// The buffers for this array data. Note that depending on the array types, this
    /// could hold different kinds of buffers (e.g., value buffer, value offset buffer)
    /// at different positions.
    buffers: Vec<Buffer>,

    /// The child(ren) of this array. Only non-empty for nested types, currently
    /// `ListArray` and `StructArray`.
    child_data: Vec<ArrayData>,

    /// The null bitmap. A `None` value for this indicates all values are non-null in
    /// this array.
    null_bitmap: Option<Bitmap>,
}

This is simple, but has a couple of caveats:

  • It isn't clear what is present for specific layout types
  • There is no clear path to storing BooleanArray as BitMap vs Buffer, which would allow removing offset
  • Vec allocations for one or two elements (the C++ implementation inlines these)
  • There is potential for accidentally interpreting a buffer incorrectly

Describe the solution you'd like

Introduce a new ArrayDataLayout enumeration:

pub enum ArrayDataLayout {
  Boolean { values: Bitmap },
  Primitive{ values: Buffer },
  Offsets { offsets: Buffer, values: Buffer },
  Dictionary { keys: Buffer, values: ArrayData },
  List { offsets: Buffer, elements: ArrayData },
  Struct { children: Vec<ArrayData> },
  Union { offsets: Option<Buffer>, types: Buffer, children: Vec<ArrayData> },
}
pub struct ArrayData {
    /// The data type for this array data
    data_type: DataType,

    /// The number of elements in this array data
    len: usize,

    /// The number of null elements in this array data
    null_count: usize,

    /// The offset into this array data, in number of items
    offset: usize,

    /// The null bitmap. A `None` value for this indicates all values are non-null in
    /// this array.
    null_bitmap: Option<Bitmap>,

    /// The array data layout
    layout: ArrayDataLayout
}

We could then progressively deprecate the methods that explicitly refer to buffers by index, etc...

Describe alternatives you've considered

We could not do this

Additional context

This could be seen as an evolution of @HaoYang670 's proposal in #1640

It also relates to @jhorstmann 's proposal on #1499 (comment)

It could also be seen as an interpretation of the arrow2 physical vs logical type separation.

@tustvold tustvold added question Further information is requested enhancement Any new improvement worthy of a entry in the changelog api-change Changes to the arrow API labels Jun 6, 2022
@alamb
Copy link
Contributor

alamb commented Jun 6, 2022

This sounds like a great way to evolve

@HaoYang670
Copy link
Contributor

pub struct ArrayData {
    /// The data type for this array data
    data_type: DataType,

    /// The number of elements in this array data
    len: usize,

    /// The number of null elements in this array data
    null_count: usize,

    /// The offset into this array data, in number of items
    offset: usize,

    /// The null bitmap. A `None` value for this indicates all values are non-null in
    /// this array.
    null_bitmap: Option<Bitmap>,

    /// The array data layout
    layout: ArrayDataLayout
}

It seems like that ArrayData::data_type is redundant because ArrayData::layout can also tell you the type of the array?

@tustvold
Copy link
Contributor Author

tustvold commented Jun 6, 2022

In arrow-rs currently, as bitwise operations are related to Buffer but not BitMap, I guess Buffer is a better type for BooleanArray.

I specifically want to change this, as we can support zero-copy slicing of BitMap, but not Buffer (as you can't slice at the bit-level) - #1802

Curiously, why not directly declare each type of Array with Buffers, for example:

A couple of reasons:

  • There are operations that don't and shouldn't care about the logical type, e.g. IPC, FFI, nullif, etc... ArrayData provides this
  • Buffer is not strongly typed and so is not a drop-in replacement for RawPtrBox (which I also have a separate plan to tweak)
  • Reduces code churn, this could theoretically not even be a breaking change

Basically I see Array and ArrayData filling different roles:

  • Array, ArrayBuilder, etc... are user-facing and should prioritise providing a strongly-typed, idiomatic, safe API for users
  • ArrayData is the low level API, with a focus on interoperability with other arrow systems

It seems like that ArrayData::data_type is redundant because ArrayData::layout can also tell you the type of the array?

You still need the DataType to roundtrip the actual type, e.g. int32 vs uint32, the Field for nested types, etc...

@HaoYang670
Copy link
Contributor

The reason that I prefer removing ArrayData::data_type is that it introduces the possibility of the inconsistency between ArrayData::data_type and ArrayData::layout. And this could increase the workload of ArrayData::validate (lots of pattern matching ...).

You still need the DataType to roundtrip the actual type, e.g. int32 vs uint32, the Field for nested types, etc...

The first way I thought is that we could inject dataType into ArrayDataLayout. For example:

pub enum ArrayDataLayout {
  ...
  Primitive(type: PrimitiveType, values: Buffer },
  Binary (is_large: Boolean, values: Buffer ...},
  ...
}

pub enum PrimitiveType {
    Int32,
    Int64, 
    ...
}

But this cannot support nested types well.

My second thought is that we could refactor DataType like this:

enum DataType {
    Primitive(type: PrimitiveType)
    List(type: ListType)
    ...
}

enum PrimitiveType {
    Int32,
    Int64,
    ...
}

enum ListType {
    List(Box<Field>),
    FixedSizeList(Box<Field>, i32),
    LargeList(Box<Field>),
}

I guess this could decrease the workload of ArrayData::validate.

@tustvold
Copy link
Contributor Author

tustvold commented Jun 6, 2022

My second thought is that we could refactor DataType like this:

I think this would break the desired separation of logical vs physical types. We don't want callers to have to match on all the possible DataType variants in order to interpret the buffers. An IPC writer shouldn't care if it is a primitive array of floats, or int32, just that it is a buffer of n bytes.

introduces the possibility of the inconsistency between ArrayData::data_type and ArrayData::layout
I guess this could decrease the workload of ArrayData::validate.

I don't think this is something to optimise for, we can easily add a cheap sanity check on construction. Tbh I don't think it would be all that bad match data_type would just change to match (data_type, layout)? I would be extremely hesitant to make fundamental changes to DataType.

@jhorstmann
Copy link
Contributor

I like it! One additional idea would be to change the layout of Boolean to use a Bitmap instead of Buffer:

pub enum ArrayDataLayout {
  Boolean { values: Bitmap },
  ...

If the Bitmap itself then stores a bit offset, we could remove the offset from ArrayData. Slicing of Bitmap would then always be zero-copy and slicing an array would push down the slicing into the validity buffer and layout.

This could in theory lead to a situation where you have different bit offsets for validity and data in a BooleanArray. Probably not a problem for any compute kernels, but could require copying in the FFI layer.

@tustvold
Copy link
Contributor Author

tustvold commented Jun 7, 2022

That's actually a typo, I meant to do exactly that and store Booleans as Bitmap 😅 will update

@tustvold tustvold self-assigned this Feb 22, 2023
tustvold added a commit that referenced this issue Mar 2, 2023
* Return Buffers from ArrayData::buffers instead of slice (#1799)

* Clippy
tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 7, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 8, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 8, 2023
tustvold added a commit that referenced this issue Mar 8, 2023
* Add RunEndBuffer (#1799)

* Fix test

* Revert rename

* Format

* Clippy

* Remove unnecessary check

* Fix

* Tweak docs

* Add docs
tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 8, 2023
tustvold added a commit that referenced this issue Mar 9, 2023
* Add ArrayDataLayout (#1799)

* Fix ArrayData::buffer

* Don't export macros, yet

* Fix doc

* Review feedback

* Further review feedback
@tustvold
Copy link
Contributor Author

tustvold commented Mar 16, 2023

Having worked through an implementation of this, the additional branching on operations that used to be free, e.g. looking up the datatype or null buffer, causes some quite serious performance regressions. Whilst it is possible to eliminate these, it turns into a game of performance wack-a-mole. Whilst I still think the design as articulated here has some compelling advantages, pragmatically I don't have the time or the inclination to work through every implementation fixing such regressions.

Taking a step back, the enumerations are not strictly necessary to achieve the goal of a type-safe, low-level data abstraction that can form a common basis between arrow and arrow2, as articulated here.

Instead the modified plan is as follows:

  • ArrayData remains as is without modification
  • The ArrayData enumerations and associated trait plumbing are removed
  • The strongly typed ArrayData structures, (e.g. PrimitiveArrayData, BytesArrayData), are made public
  • Provide From conversions between ArrayData and *ArrayData, etc...
  • Provide From conversions between *Array and *ArrayData
  • Implement From conversions between ArrayData and *Array via *ArrayData

We can then slowly reduce the usage of ArrayData within the codebase, in favor of *Array and *ArrayData. This achieves the stated goals, without requiring everything to move at the same time.

Edit: On further reflection there may be an even simpler option

@tustvold tustvold changed the title ArrayData Layout Enumeration Strongly Typed ArrayData Mar 16, 2023
@tustvold
Copy link
Contributor Author

Closing in favour of #3880 and #3879

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API enhancement Any new improvement worthy of a entry in the changelog question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants