-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Correct results for grouping sets when columns contain nulls #12571
base: main
Are you sure you want to change the base?
fix: Correct results for grouping sets when columns contain nulls #12571
Conversation
f4a220b
to
9c840b0
Compare
8d01437
to
fdce177
Compare
Thank you @eejbyfeldt cc @thinkharderdev as I think you / your team implemented the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Had a few comments and questions :)
4 4 | ||
5 5 | ||
NULL 1 | ||
NULL NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC then without the fix in this PR the final NULL NULL
row would be omitted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, without the fix both of the NULL
keys becomes a single group.
@@ -108,6 +110,8 @@ impl AggregateMode { | |||
} | |||
} | |||
|
|||
const INTERNAL_GROUPING_ID: &str = "grouping_id"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if this conflicts with a user-defined field? E.g. if I had a query like:
SELECT grouping_id, count(1) FROM table GROUP BY CUBE(grouping_id)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to just work (which surprised me).
} | ||
} | ||
|
||
/// Returns the data type of the grouping id. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a small comment on what the value we use as the grouping ID. Took me a moment to understand the logic below.
/// Returns the data type of the grouping id. | |
/// Returns the data type of the grouping id. | |
/// The grouping ID value is a bitmask where each set bit | |
/// indicates that the corresponding grouping expression is | |
/// null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The number of internal expressions that are used to implement grouping | ||
// sets. These output are removed from the final output and not in `expr` | ||
// as they are generated based on the value in `groups` | ||
num_internal_exprs: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a scenario in which this is something other than 0
or 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fdce177
to
fee8bdf
Compare
fee8bdf
to
230e4ef
Compare
Which issue does this PR close?
Closes #12570
Rationale for this change
Currently we produce incorrect results when combining grouping sets and columns containing null values.
What changes are included in this PR?
The bug is fixed by introducing an internal column
grouping_id
when using grouping sets. This extra column makes sure that we create different groups for the nulls from the grouping sets and the data.This approach is based on how it is implemented in Spark and has previously been proposed here: #5749 Note that this change is smaller in scope and limit the existence of the
grouping_id
to the physical plan. This is done so we end up with a smaller PR that is easier to review. But we might want to follow up and extend it to the logical plan an use it to implement the grouping function (#5647) in a similar manner to what is done in Spark.Are these changes tested?
Existing and new sqllogictests.
Are there any user-facing changes?