Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Correct results for grouping sets when columns contain nulls #12571

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

eejbyfeldt
Copy link
Contributor

@eejbyfeldt eejbyfeldt commented Sep 21, 2024

Which issue does this PR close?

Closes #12570

Rationale for this change

Currently we produce incorrect results when combining grouping sets and columns containing null values.

What changes are included in this PR?

The bug is fixed by introducing an internal column grouping_id when using grouping sets. This extra column makes sure that we create different groups for the nulls from the grouping sets and the data.

This approach is based on how it is implemented in Spark and has previously been proposed here: #5749 Note that this change is smaller in scope and limit the existence of the grouping_id to the physical plan. This is done so we end up with a smaller PR that is easier to review. But we might want to follow up and extend it to the logical plan an use it to implement the grouping function (#5647) in a similar manner to what is done in Spark.

Are these changes tested?

Existing and new sqllogictests.

Are there any user-facing changes?

@github-actions github-actions bot added physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Sep 21, 2024
@eejbyfeldt eejbyfeldt force-pushed the fix-grouping-sets-with-null-values branch 5 times, most recently from f4a220b to 9c840b0 Compare September 22, 2024 10:27
@github-actions github-actions bot added the optimizer Optimizer rules label Sep 22, 2024
@eejbyfeldt eejbyfeldt force-pushed the fix-grouping-sets-with-null-values branch 2 times, most recently from 8d01437 to fdce177 Compare September 22, 2024 11:19
@eejbyfeldt eejbyfeldt marked this pull request as ready for review September 23, 2024 17:08
@eejbyfeldt eejbyfeldt changed the title fix: Grouping sets when columns contain nulls fix: Correct results for grouping sets when columns contain nulls Sep 23, 2024
@alamb
Copy link
Contributor

alamb commented Sep 23, 2024

Thank you @eejbyfeldt

cc @thinkharderdev as I think you / your team implemented the GROUPING SETS implementation originally

Copy link
Contributor

@thinkharderdev thinkharderdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Had a few comments and questions :)

4 4
5 5
NULL 1
NULL NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC then without the fix in this PR the final NULL NULL row would be omitted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, without the fix both of the NULL keys becomes a single group.

@@ -108,6 +110,8 @@ impl AggregateMode {
}
}

const INTERNAL_GROUPING_ID: &str = "grouping_id";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this conflicts with a user-defined field? E.g. if I had a query like:

SELECT grouping_id, count(1) FROM table GROUP BY CUBE(grouping_id)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to just work (which surprised me).

}
}

/// Returns the data type of the grouping id.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a small comment on what the value we use as the grouping ID. Took me a moment to understand the logic below.

Suggested change
/// Returns the data type of the grouping id.
/// Returns the data type of the grouping id.
/// The grouping ID value is a bitmask where each set bit
/// indicates that the corresponding grouping expression is
/// null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added your comment.

The only reason to implement as a bitmask is if we plan to follow up by implementing the grouping function (#5647) on top of that column. Otherwise it might be better to make the id just be a sequential number. (It would actually be better in some ways it would also fix #5672)

// The number of internal expressions that are used to implement grouping
// sets. These output are removed from the final output and not in `expr`
// as they are generated based on the value in `groups`
num_internal_exprs: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a scenario in which this is something other than 0 or 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently not. My thinking was that in the future we might add more internal exprs to also solve (#5672) and or support more than 64 grouping columns (if someone ever needs that). As I think was done in #5749

@eejbyfeldt eejbyfeldt force-pushed the fix-grouping-sets-with-null-values branch from fee8bdf to 230e4ef Compare September 27, 2024 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect results when using grouping sets with data containing nulls
3 participants