bug: Query Error when one query pod is shutdown in cluster #16481

inviscid · 2024-09-19T20:15:04Z

Search before asking

I had searched in the issues and found no similar issues.

Version

v1.2.635-nightly

What's Wrong?

This is probably a combination of bug and feature request. The problem is that a loss of a query pod/machine while it is running a query causes an error to be raised to the SQL client. This is particularly prevalent when a cluster is auto-scaled to handle a load then begins scaling back in once the load subsides. As pods are shutdown, any client running a query on the pod receives a broken connection error.

Since one of the important reasons to use a cluster rather than larger machines is resilience to compute resource loss, this seems like the query cluster should be able to handle a cluster member loss by routing the query to another available member transparently to the external client. I think the expectation is that the previous transaction would be rolled back and a new transaction started.

How to Reproduce?

Scale the query cluster to more than one member
Begin running queries so that both members are answering query requests
Terminate one of the query members without a graceful shutdown
Observe the errors returned to SQL client for queries executing on the terminated member

Are you willing to submit PR?

Yes I am willing to submit a PR!

dosubot · 2024-09-19T20:16:51Z

When a query pod is terminated unexpectedly, the SQL client receives specific error messages such as:

{"error":{"code":400,"message":"query id QID canceled"}}
{ "code": 1043, "message": "canceled by client" }
{ "code": 2006, "message": "MySQL server has gone away" }
{ "code": 2013, "message": "Lost connection to MySQL server during query" } ^[1]^[2].

The query cluster handles the loss of a query pod without causing errors to the SQL client by using a heartbeat mechanism and dynamically discovering and managing nodes. The ClusterDiscovery struct and its methods play a crucial role in this process. Here is how it works:

Heartbeat Mechanism: The ClusterHeartbeat struct sends periodic heartbeats to ensure that the node is still active. If a node fails to send a heartbeat, it can be considered lost and removed from the cluster.
Dynamic Node Discovery: The ClusterDiscovery struct has methods like discover and drop_invalid_nodes to dynamically discover active nodes and remove invalid ones, ensuring the cluster is always aware of the current state of its nodes.
Graceful Shutdown: The unregister_to_metastore method ensures that a node is properly unregistered from the metastore during shutdown, preventing stale entries ^[3].

However, there is no specific mechanism mentioned for rolling back and retrying transactions when a node is lost. This might be an area for improvement or a feature request ^[3].

Additionally, there are configuration options available to control the behavior of the cluster during auto-scaling events, such as auto_compaction_imperfect_blocks_threshold and auto_compaction_segments_limit ^[4]^[5].

To continue talking to Dosu, mention @dosu.

wubx · 2024-09-20T04:00:39Z

This case was anticipated. We kindly request that the application handle the retry process.

rad-pat · 2024-09-20T08:34:04Z

Is it possible to handle this at the driver level - bendSQL?

inviscid added the C-bug Category: something isn't working label Sep 19, 2024

dosubot bot added the A-query Area: databend query label Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Query Error when one query pod is shutdown in cluster #16481

bug: Query Error when one query pod is shutdown in cluster #16481

inviscid commented Sep 19, 2024

dosubot bot commented Sep 19, 2024

wubx commented Sep 20, 2024

rad-pat commented Sep 20, 2024

bug: Query Error when one query pod is shutdown in cluster #16481

bug: Query Error when one query pod is shutdown in cluster #16481

Comments

inviscid commented Sep 19, 2024

Search before asking

Version

What's Wrong?

How to Reproduce?

Are you willing to submit PR?

dosubot bot commented Sep 19, 2024

wubx commented Sep 20, 2024

rad-pat commented Sep 20, 2024