fix: Correct results for grouping sets when columns contain nulls #12571

eejbyfeldt · 2024-09-21T09:30:38Z

Which issue does this PR close?

Rationale for this change

Currently we produce incorrect results when combining grouping sets and columns containing null values.

What changes are included in this PR?

The bug is fixed by introducing an internal column grouping_id when using grouping sets. This extra column makes sure that we create different groups for the nulls from the grouping sets and the data.

This approach is based on how it is implemented in Spark and has previously been proposed here: #5749 Note that this change is smaller in scope and limit the existence of the grouping_id to the physical plan. This is done so we end up with a smaller PR that is easier to review. But we might want to follow up and extend it to the logical plan an use it to implement the grouping function (#5647) in a similar manner to what is done in Spark.

Are these changes tested?

Existing and new sqllogictests.

Are there any user-facing changes?

alamb · 2024-09-23T17:15:07Z

Thank you @eejbyfeldt

cc @thinkharderdev as I think you / your team implemented the GROUPING SETS implementation originally

thinkharderdev

Nice work! Had a few comments and questions :)

thinkharderdev · 2024-09-24T19:40:53Z

datafusion/sqllogictest/test_files/aggregate.slt

+4 4
+5 5
+NULL 1
+NULL NULL


IIUC then without the fix in this PR the final NULL NULL row would be omitted?

Yes, without the fix both of the NULL keys becomes a single group.

thinkharderdev · 2024-09-24T20:56:18Z

datafusion/physical-plan/src/aggregates/mod.rs

@@ -108,6 +110,8 @@ impl AggregateMode {
    }
 }

+const INTERNAL_GROUPING_ID: &str = "grouping_id";


What happens if this conflicts with a user-defined field? E.g. if I had a query like:

SELECT grouping_id, count(1) FROM table GROUP BY CUBE(grouping_id)

Seems to just work (which surprised me).

thinkharderdev · 2024-09-24T20:59:37Z

datafusion/physical-plan/src/aggregates/mod.rs

+        }
+    }
+
+    /// Returns the data type of the grouping id.


Maybe a small comment on what the value we use as the grouping ID. Took me a moment to understand the logic below.

Suggested change

/// Returns the data type of the grouping id.

/// Returns the data type of the grouping id.

/// The grouping ID value is a bitmask where each set bit

/// indicates that the corresponding grouping expression is

/// null

Added your comment.

The only reason to implement as a bitmask is if we plan to follow up by implementing the grouping function (#5647) on top of that column. Otherwise it might be better to make the id just be a sequential number. (It would actually be better in some ways it would also fix #5672)

thinkharderdev · 2024-09-24T21:00:31Z

datafusion/physical-plan/src/aggregates/mod.rs

+    // The number of internal expressions that are used to implement grouping
+    // sets. These output are removed from the final output and not in `expr`
+    // as they are generated based on the value in `groups`
+    num_internal_exprs: usize,


Is there a scenario in which this is something other than 0 or 1?

Currently not. My thinking was that in the future we might add more internal exprs to also solve (#5672) and or support more than 64 grouping columns (if someone ever needs that). As I think was done in #5749

github-actions bot added physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Sep 21, 2024

eejbyfeldt force-pushed the fix-grouping-sets-with-null-values branch 5 times, most recently from f4a220b to 9c840b0 Compare September 22, 2024 10:27

github-actions bot added the optimizer Optimizer rules label Sep 22, 2024

eejbyfeldt force-pushed the fix-grouping-sets-with-null-values branch 2 times, most recently from 8d01437 to fdce177 Compare September 22, 2024 11:19

eejbyfeldt marked this pull request as ready for review September 23, 2024 17:08

eejbyfeldt changed the title ~~fix: Grouping sets when columns contain nulls~~ fix: Correct results for grouping sets when columns contain nulls Sep 23, 2024

thinkharderdev reviewed Sep 24, 2024

View reviewed changes

eejbyfeldt force-pushed the fix-grouping-sets-with-null-values branch from fdce177 to fee8bdf Compare September 25, 2024 18:09

eejbyfeldt mentioned this pull request Sep 25, 2024

Implement GROUPING aggregate function (following Postgres behavior.) #12565

Open

eejbyfeldt added 3 commits September 27, 2024 15:42

Fix grouping sets behavior when data contains nulls

68d01f7

PR suggestion comment

429af70

Update new test case

230e4ef

eejbyfeldt force-pushed the fix-grouping-sets-with-null-values branch from fee8bdf to 230e4ef Compare September 27, 2024 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Correct results for grouping sets when columns contain nulls #12571

fix: Correct results for grouping sets when columns contain nulls #12571

eejbyfeldt commented Sep 21, 2024 •

edited

Loading

alamb commented Sep 23, 2024

thinkharderdev left a comment

thinkharderdev Sep 24, 2024

eejbyfeldt Sep 25, 2024

thinkharderdev Sep 24, 2024

eejbyfeldt Sep 25, 2024

thinkharderdev Sep 24, 2024

eejbyfeldt Sep 25, 2024

thinkharderdev Sep 24, 2024

eejbyfeldt Sep 25, 2024

-    /// Returns the data type of the grouping id.
+    /// Returns the data type of the grouping id.
+    /// The grouping ID value is a bitmask where each set bit
+    /// indicates that the corresponding grouping expression is
+    /// null

fix: Correct results for grouping sets when columns contain nulls #12571

Are you sure you want to change the base?

fix: Correct results for grouping sets when columns contain nulls #12571

Conversation

eejbyfeldt commented Sep 21, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb commented Sep 23, 2024

thinkharderdev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eejbyfeldt commented Sep 21, 2024 •

edited

Loading