Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Devil of MergeGroup #148

Open
AveryQi115 opened this issue Apr 5, 2024 · 5 comments
Open

The Devil of MergeGroup #148

AveryQi115 opened this issue Apr 5, 2024 · 5 comments

Comments

@AveryQi115
Copy link
Contributor

Why we have MergeGroup in the first place?

For both testing the optd cascades core and shrinking the search space, we were trying to add heuristic rules in cascades. However, some simple rules like eliminateFilter which eliminate filter node when the filter predicates are always true lead to the situation that we have to add a MergeGroup in optd core.

// memo before eliminateFilter
Group(n): Filter(true) -> child(Group(n+1))
Group(n+1): expr
// memo after eliminateFilter
Group(n): Filter(true) -> child(Group(n+1)), placeholder(Group(n+1))
Group(n+1): expr
// The result returned by eliminateFilter means that Group(n) and Group(n+1) are actually logically equivalent. So we add MergeGroup in optd to merge these two groups.

The devil of MergeGroup

There's no MergeGroup definition in any cascades paper nor columbia paper and corresponding implementation. The adding of mergeGroup is actually pretty experimental. Besides, in real world optimizer, there are always rewrite phase/prepare phase before the cascades framework to make the expressions normalized. There's actually no need to do such heuristic rules in cascades.

The introduction of MergeGroup leads to two problems: loops in operator tree.

loops in operator tree

Take Select * from t1 where true as an example. The initial groups are:

// memo
G1: filter(true) -> child(G2)
G2: Scan (t1)

After eliminateFilter in cascades rule, it becomes:

// memo
G1: filter(true)->child(G2), Scan(t1)
G2: merged to G1, reduceGroupId=G1

It is ok when we execute the query as it only cares the winner. But when we wants to traverse the operator tree somehow, it goes into loops for expressions like filter(true)->child(G2) as they are expressions points to themselves as a child.

It makes tricky bugs for some logic other than execution, for example, join order enumeration, tree traversal, debug printing and so on.

A tricky case is that will it make firing tasks out of order?

In optd, the order of the tasks fired are carefully designed following the columbia paper.

We first fire OptimizeGroup task for the root/top/first group, it then traverse all the expressions in the task and fire OptimizeExpr task for all the logical exprs in the group, and OptimizeInput task for all the physical exprs in the group.

OptimizeExpr first call ExploreGroup tasks for the expr's children and then traverse the rules and fire ApplyRule task for the matched rules for expr.

Newly generated logical exprs are fired with new OptimizeExpr as they are not originally in the group when optimizeGroup or ExploreGroup is called.

OptimizeInput is responsible to calculate the costs for the expr and update winner for the group, it fires optimizeGroup for the children group when the children group haven't have a winner. But except for the root/top/first group, other groups have to be the children node of the physical expression in OptimizeInput can it be called with optimizeGroup.

In abstract, the cascades framework expect all the groups and their children are fully explored (apply all matching rules) from top group to the leaf group, and the costs and winners are updated then from the top group's physical expression to the children group.

If we merge two groups, these groups exist before merge (as rules can not create new group themselves, they can only return existing group). They are either being called with exploreGroup/OptimizeGroup or not. If they are called with exploreGroup before, we can not explore the group again.(exploreGroup skip groups following the groupId, though they are merged, the group Id remains the same). And optimizeGroup can only be called for root group or children group of physical expressions in parent group.

So if the merge destination group is explored before, the merge source group is not explored and both groups are not being called with optimizeGroup(it is either not the children group of physical expressions or pruned if we adds pruning), it will miss newly added expressions from merge source group. 🤯🤯🤯 It is a very trivial case and hard to be found.

Future task

MergeGroup will lead to debugging nightmare and tricky problems. We may need to remove it. Besides, if we don't add heuristic rules into cascades logic, we really don't need it.

@jurplel
Copy link
Member

jurplel commented Sep 5, 2024

After further discussion I question why we need merging at all, even for the cases outlined where we remove nodes.

Could we not simply add the alternative to the group—duplicating nodes at one level?

@averyqi-db
Copy link

I tried removing merge before, and it leads to further problem. It might go into infinite loop and same optimization might be executed continuously. Just take switch join(a,b) -> join(b,a) as an example.

@jurplel
Copy link
Member

jurplel commented Sep 5, 2024

Ah, I need to dig into how we handle that case now then. Looking more into Cascades examples I think we may handle this is a non standard way and we might want to change the current handling

@averyqi-db
Copy link

Sure, please let me know if you have ideas on how to change this. Quite curious about standard way to do it~ Thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants