Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GenericListViewArray for ListView & LargeListView #5723

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

Kikkon
Copy link
Contributor

@Kikkon Kikkon commented May 5, 2024

Which issue does this PR close?

A part of #5501

Rationale for this change

What changes are included in this PR?

Add GenericListViewArray implement

Are there any user-facing changes?

@Kikkon
Copy link
Contributor Author

Kikkon commented Jun 11, 2024

@tustvold If you have time, I would appreciate it if you could help me review this PR. Thanks!

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, been swamped and then sick, but this is looking almost there

@Kikkon
Copy link
Contributor Author

Kikkon commented Jun 16, 2024

Thank you for taking the time to review my code despite being busy and unwell.

@Kikkon Kikkon requested a review from tustvold June 16, 2024 14:20
@tustvold
Copy link
Contributor

tustvold commented Jun 26, 2024

For future reference, it is helpful for reviewers if you don't force push branches once reviews have been made, this makes it much easier to see what has changed. I'll try to find time to review this later today, but I now have to do a full review again which is quite time consuming on such a large PR

@Kikkon
Copy link
Contributor Author

Kikkon commented Jul 1, 2024

For future reference, it is helpful for reviewers if you don't force push branches once reviews have been made, this makes it much easier to see what has changed. I'll try to find time to review this later today, but I now have to do a full review again which is quite time consuming on such a large PR

@tustvold Thank you for your help. I will be more careful about the force push issue in my future commits. I am very sorry.

@Kikkon
Copy link
Contributor Author

Kikkon commented Jul 22, 2024

Hi @tustvold , perhaps when you're free this week, you could help me review this. This way, we can continue to move forward with this work. Thank you.

@tustvold
Copy link
Contributor

tustvold commented Jul 22, 2024

I'm afraid I don't have capacity to assist with this initiative further, but perhaps @alamb may do. I am reducing my involvement in arrow-rs as I no longer use it in my day job

@Kikkon
Copy link
Contributor Author

Kikkon commented Jul 24, 2024

@alamb Perhaps you can help advance this PR when you have time. 🥸

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions but overall this matches my understanding of list view arrays. Thanks for the work!

arrow-array/src/array/list_view_array.rs Outdated Show resolved Hide resolved
Comment on lines 52 to 62
impl<OffsetSize: OffsetSizeTrait> Clone for GenericListViewArray<OffsetSize> {
fn clone(&self) -> Self {
Self {
data_type: self.data_type.clone(),
nulls: self.nulls.clone(),
values: self.values.clone(),
value_offsets: self.value_offsets.clone(),
value_sizes: self.value_sizes.clone(),
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you derive the Clone impl?

Comment on lines 111 to 114
for i in 0..offsets.len() {
let length = values.len();
let offset = offsets[i].as_usize();
let size = sizes[i].as_usize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, is it faster to use offsets.iter().zip(sizes) (maybe the optimize down to the same thing?)

///
/// * `offsets.len() != sizes.len()`
/// * `offsets.len() != nulls.len()`
/// * `offsets.last() > values.len()`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies you only check the last offset but you validate all offsets

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the code again and realized that I had already validated all the offsets, so I changed it to offsets[i] > values.len().

length
)));
}
if offset.saturating_add(size) > values.len() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need the above length comparison if you also have this comparison?

let mut sizes = Vec::with_capacity(iter.size_hint().0);
for size in iter {
offsets.push(OffsetSize::usize_as(acc));
acc = acc.checked_add(size).expect("usize overflow");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor corner case nit: Imagine we are working with i32 offsets and we have 1Gi - 1 fixed size lists of size 2. Will this panic even thought it shouldn't.

acc = acc.checked_add(size).expect("usize overflow");
sizes.push(OffsetSize::usize_as(size));
}
OffsetSize::from_usize(acc).expect("offset overflow");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this final check? It seems redundant.

arrow-array/src/array/list_view_array.rs Outdated Show resolved Hide resolved
Comment on lines 465 to 466
// SAFETY:
// ArrayData is valid, and verified type above
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why SAFETY? We aren't doing anything unsafe here that I can see.

let sizes = list.value_sizes();
assert_eq!(sizes, &[3, 3, 3]);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other tests might be:

  • Test where the lists overlap (e.g. sizes = [5, 5] and offsets = [0, 3] and values has length 8.
  • Test where offsets does not cover the entire range of values (e.g. values has 50 things and offsets only go up to 10)
  • List of empty lists (with an empty values)

@Kikkon
Copy link
Contributor Author

Kikkon commented Jul 31, 2024

hi @westonpace please help me review again

@Kikkon Kikkon requested a review from westonpace July 31, 2024 16:37
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few last small things.

for (offset, size) in offsets.iter().zip(sizes.iter()) {
let offset = offset.as_usize();
let size = size.as_usize();
if offset.saturating_add(size) > values.len() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a checked_add.

Comment on lines 396 to 400
for size in iter {
offsets.push(OffsetSize::usize_as(acc));
acc = acc.saturating_add(size);
sizes.push(OffsetSize::usize_as(size));
}
Copy link
Member

@westonpace westonpace Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did like that you were checking for overflow here, my only concern was that it was in the wrong spot. What if, before the for loop, you checked to ensure that value.len() is less than or equal to the max offset? Then you can keep using saturating_add here. Or even better, use the basic add since we've verified it will never overflow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reconsidered, and it seems that no additional validation is needed here. 🤔

@alamb
Copy link
Contributor

alamb commented Sep 18, 2024

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

@Kikkon
Copy link
Contributor Author

Kikkon commented Sep 21, 2024

@westonpace 😭 Apologies for the delay in processing this pull request due to work reasons. I hope you can help review it when you have time.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor thing but I think this is good to go. Thanks for addressing the earlier reviews :)

Comment on lines 406 to 407
//check if the size is usize

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure I understand the meaning of this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is no longer needed, so I've removed it.

self.value_sizes[i]
}

/// Returns the offset for value as index `i`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns the offset for value as index `i`.
/// Returns the offset for value at index `i`.

self.values.slice(offset, length)
}

/// Returns ith value of this list view array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should get a # Panics note?

Comment on lines +111 to +115
"Offset + size for {}ListViewArray must be within the bounds of the child array, got offset: {}, size: {}, child array length: {}",
OffsetSize::PREFIX,
offset,
size,
values.len()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Offset + size for {}ListViewArray must be within the bounds of the child array, got offset: {}, size: {}, child array length: {}",
OffsetSize::PREFIX,
offset,
size,
values.len()
"Offset + size for {}ListViewArray must be within the bounds of the child array, got offset: {offset}, size: {size}, child array length: {}",
OffsetSize::PREFIX,
values.len()

Comment on lines +95 to +96
"Length of offsets buffer and sizes buffer must be equal for {}ListViewArray, got {} and {}",
OffsetSize::PREFIX, len, sizes.len()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Length of offsets buffer and sizes buffer must be equal for {}ListViewArray, got {} and {}",
OffsetSize::PREFIX, len, sizes.len()
"Length of offsets buffer and sizes buffer must be equal for {}ListViewArray, got {len} and {}",
OffsetSize::PREFIX, sizes.len()

self.value_offsets[i]
}

/// constructs a new iterator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// constructs a new iterator
/// Constructs a new iterator

Comment on lines +266 to +268
pub fn iter<'a>(&'a self) -> GenericListViewArrayIter<'a, OffsetSize> {
GenericListViewArrayIter::<'a, OffsetSize>::new(self)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use lifetime elision?

Suggested change
pub fn iter<'a>(&'a self) -> GenericListViewArrayIter<'a, OffsetSize> {
GenericListViewArrayIter::<'a, OffsetSize>::new(self)
}
pub fn iter(&self) -> GenericListViewArrayIter<'_, OffsetSize> {
GenericListViewArrayIter::<'_, OffsetSize>::new(self)
}

Comment on lines +445 to +448
"[Large]ListViewArray's child datatype {:?} does not \
correspond to the List's datatype {:?}",
values.data_type(),
child_data_type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"[Large]ListViewArray's child datatype {:?} does not \
correspond to the List's datatype {:?}",
values.data_type(),
child_data_type
"{}ListViewArray's child datatype {:?} does not \
correspond to the List's datatype {:?}",
OffsetSize::PREFIX,
values.data_type(),
child_data_type

Comment on lines +453 to +454
"[Large]ListViewArray's datatype must be [Large]ListViewArray(). It is {:?}",
data.data_type()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"[Large]ListViewArray's datatype must be [Large]ListViewArray(). It is {:?}",
data.data_type()
"{}ListViewArray's datatype must be {}ListViewArray(). It is {:?}",
OffsetSize::PREFIX,
OffsetSize::PREFIX,
data.data_type()


/// A [`GenericListViewArray`] of variable size lists, storing offsets as `i32`.
///
// See [`ListViewBuilder`](crate::builder::ListViewBuilder) for how to construct a [`ListViewArray`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// See [`ListViewBuilder`](crate::builder::ListViewBuilder) for how to construct a [`ListViewArray`]
/// See [`ListViewBuilder`](crate::builder::ListViewBuilder) for how to construct a [`ListViewArray`]


/// A [`GenericListViewArray`] of variable size lists, storing offsets as `i64`.
///
// See [`LargeListViewBuilder`](crate::builder::LargeListViewBuilder) for how to construct a [`LargeListViewArray`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// See [`LargeListViewBuilder`](crate::builder::LargeListViewBuilder) for how to construct a [`LargeListViewArray`]
/// See [`LargeListViewBuilder`](crate::builder::LargeListViewBuilder) for how to construct a [`LargeListViewArray`]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants