Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add union_extract scalar function #12116

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft

Conversation

gstvg
Copy link
Contributor

@gstvg gstvg commented Aug 22, 2024

Which issue does this PR close?

Closes #11081

What changes are included in this PR?

union_extract implementation and docs
Add tables with union column on the session context of sqllogictests

Are these changes tested?

Yes. There's sqllogictests for one sparse and one dense union, and for each error case
Since it's not possible yet to create scalar unions in SQL, those cases are tested as unit tests

Are there any user-facing changes?

A new union_extract scalar function, and it's docs

@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) functions labels Aug 22, 2024
}

const EQ_SCALAR_CHUNK_SIZE: usize = 512;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those pub items with #[doc(hidden)] are there to be benchmarked, in case any reviewer wants to, but maybe it can be removed before merge. Just let me know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I think we could remove them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed on arrow-rs PR


fn criterion_benchmark(c: &mut Criterion) {
let union_extract = UnionExtractFun::new();

Copy link
Contributor Author

@gstvg gstvg Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although most part of these instantiations are very repetitive, the differences between them are so subtle that I personally prefer to be as explicit as possible. Just ask for simplification if that's preferable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure do we even need this 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you aren't sure, I'm more than happy to not even need to include them on the arrow-rs PR 😀

ctx.register_batch(table_name, batch).unwrap();
}

fn sparse_1_1_single_field(ctx: &SessionContext) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@samuelcolvin samuelcolvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gone through the implementation line by line, but overall this looks pretty good to me.

impl UnionExtractFun {
pub fn new() -> Self {
Self {
signature: Signature::any(2, Volatility::Immutable),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific than this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, I will take a look at Signature::UserDefnied and ScalarUDFImpl::coerce_types

Copy link
Contributor Author

@gstvg gstvg Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed using ScalarUDFImpl::coerce_types in case you wanna take a look
It changed the error messages, from this to this, but I'm not sure if it's better

);
}

let fields = if let DataType::Union(fields, _) = &arg_types[0] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nicer to write this like

let DataType::Union(fields, _) = &arg.types[0] else {
    return exec_err!(...
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, this syntax wasn't stable when I started rust, didn't know about it. Thanks!
Done

);
};

let field_name = if let Expr::Literal(ScalarValue::Utf8(Some(field_name))) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@gstvg gstvg marked this pull request as draft August 23, 2024 18:20

// This is like MutableBuffer::collect_bool(type_ids.len(), |i| type_ids[i] == target) with fast paths for all true or all false values.
#[doc(hidden)]
pub fn eq_scalar_generic<const N: usize>(type_ids: &[i8], target: i8) -> BoolValue {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we make them private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made them private on arrow-rs

@@ -0,0 +1,255 @@
# Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

@jayzhan211 jayzhan211 Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe name with union_function.slt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks better
Done

@@ -361,3 +369,893 @@ fn create_example_udf() -> ScalarUDF {
adder,
)
}

fn register_union_tables(ctx: &SessionContext) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we create table in slt file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until #10206 is fixed, I don't think so. Anyway, with the core of this PR moved to arrow-rs, all these tests are unit tests on arrow-rs now 🙏

.build_unchecked()
};

Ok(ColumnarValue::Array(make_array(data)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here seems equivalent to

new_null_array(target.data_type(), target.len());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantically yes, but new_null_array also allocates data buffers

@gstvg
Copy link
Contributor Author

gstvg commented Sep 11, 2024

Thanks @samuelcolvin and @jayzhan211 for your reviews. I moved part of this PR to apache/arrow-rs#6387. I made all your requests that were applicable there, and then I will make the requests that still apply here.
Sorry for any review work hours that became unused after this 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation functions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add union_extract function
3 participants