Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: panic and incorrect results in LogFunc::output_ordering() #11571

Merged
merged 2 commits into from
Jul 24, 2024

Conversation

jonahgao
Copy link
Member

Which issue does this PR close?

Closes #11549.

Rationale for this change

  1. The current implementation assumes log has two parameters, but this does not hold for log(100), which is equivalent to log(10, 100). This will cause panic due to index out of bounds.
  2. It mistakenly treats the second parameter as base, causing incorrect ordering results.

Ref: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#log

What changes are included in this PR?

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jul 20, 2024
@@ -547,32 +554,32 @@ physical_plan
04)------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c11], output_ordering=[c11@0 ASC NULLS LAST], has_header=true

query TT
EXPLAIN SELECT LOG(c11, c12) as log_c11_base_c12
EXPLAIN SELECT LOG(c12, c11) as log_c11_base_c12
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base should be placed in the first parameter.

Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this rapid fix!

It's not the problem for this PR, but I found this output_ordering calculation is extremely hard to get all cases correct, I think the correct way to implement it is to do some hard coded exact match only for a small number of obvious cases. For the rest cases we could blindly return unordered.

If the example I gave is not mistaken, we can open an issue to address it and fix later.

match (input[0].sort_properties, input[1].sort_properties) {
let (base_sort_properties, num_sort_properties) = if input.len() == 1 {
// log(x) defaults to log(10, x)
(SortProperties::Singleton, input[0].sort_properties)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense to me 👍🏼

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

} else {
(input[0].sort_properties, input[1].sort_properties)
};
match (num_sort_properties, base_sort_properties) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the following logic is not completely correct 🤔
e.g.
base is DESC NULLS FIRST
num is ASC NULLS LAST
This implementation will return log(base, num) is ASC NULLS LAST
But the actual output might look like

NULL
NULL
1
2
NULL

Which is unordered.

Copy link
Member Author

@jonahgao jonahgao Jul 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. I noticed that other logic also does not consider nulls_first, for example

pub fn add(&self, rhs: &Self) -> Self {
match (self, rhs) {
(Self::Singleton, _) => *rhs,
(_, Self::Singleton) => *self,
(Self::Ordered(lhs), Self::Ordered(rhs))
if lhs.descending == rhs.descending =>
{
Self::Ordered(SortOptions {
descending: lhs.descending,
nulls_first: lhs.nulls_first || rhs.nulls_first,
})
}
_ => Self::Unordered,
}
}

Maybe @berkaysynnada can help confirm it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. We also need to check the placement of nulls in output_ordering() (and possibly in other implementations of it for multi-input functions). As @jonahgao noticed, impl SortProperties {} needs similar handling as well, but that can be addressed in a separate PR. I can create a ticket for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the nulls_first problem for log by 3feeaa8. We can create separate PRs for other fixes.

cc @2010YOUY01 @berkaysynnada

Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good to me. Thanks, @jonahgao. Could you also address the missing cases of nulls order checks, or do we report a ticket?

match (input[0].sort_properties, input[1].sort_properties) {
let (base_sort_properties, num_sort_properties) = if input.len() == 1 {
// log(x) defaults to log(10, x)
(SortProperties::Singleton, input[0].sort_properties)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

} else {
(input[0].sort_properties, input[1].sort_properties)
};
match (num_sort_properties, base_sort_properties) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. We also need to check the placement of nulls in output_ordering() (and possibly in other implementations of it for multi-input functions). As @jonahgao noticed, impl SortProperties {} needs similar handling as well, but that can be addressed in a separate PR. I can create a ticket for it.

SortProperties::Singleton,
];
assert_eq!(results, expected);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these tests. output_ordering() API of scalar functions are experimental and needs unit tests (#10595). This is a good example for the rest of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output_ordering() API of scalar functions are experimental

What do you mean by "experimental"? If that is the case perhaps we can update the documentation to explain what that means -- I don't see any mention of it

https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.output_ordering

@@ -512,7 +519,7 @@ CREATE EXTERNAL TABLE aggregate_test_100 (
)
STORED AS CSV
WITH ORDER(c11)
WITH ORDER(c12 DESC)
WITH ORDER(c12 DESC NULLS LAST)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make c11 and c12 have the same nulls_first.

@jonahgao
Copy link
Member Author

The fix looks good to me. Thanks, @jonahgao. Could you also address the missing cases of nulls order checks, or do we report a ticket?

Filed #11596

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jonahgao @berkaysynnada and @2010YOUY01 -- I think this PR is an improvement over what is on main,

As you have pointed out, filing some tickets to track the remaining work would be a good idea.

SortProperties::Singleton,
];
assert_eq!(results, expected);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output_ordering() API of scalar functions are experimental

What do you mean by "experimental"? If that is the case perhaps we can update the documentation to explain what that means -- I don't see any mention of it

https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.output_ordering

@jonahgao
Copy link
Member Author

Thanks all. The remaining issues are being tracked in ticket #11596

@jonahgao jonahgao merged commit c8ef545 into apache:main Jul 24, 2024
24 checks passed
@jonahgao jonahgao deleted the log_panic branch July 24, 2024 01:47
@berkaysynnada
Copy link
Contributor

What do you mean by "experimental"

@alamb I guess I used a wrong terminology. I meant that output_ordering() methods of ScalarUDF's are not well-tested by slt or unit tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crash bug when log() is used in order by clause (SQLancer)
4 participants