Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Query Error when one query pod is shutdown in cluster #16481

Open
1 of 2 tasks
inviscid opened this issue Sep 19, 2024 · 3 comments
Open
1 of 2 tasks

bug: Query Error when one query pod is shutdown in cluster #16481

inviscid opened this issue Sep 19, 2024 · 3 comments
Labels
A-query Area: databend query C-bug Category: something isn't working

Comments

@inviscid
Copy link

Search before asking

  • I had searched in the issues and found no similar issues.

Version

v1.2.635-nightly

What's Wrong?

This is probably a combination of bug and feature request. The problem is that a loss of a query pod/machine while it is running a query causes an error to be raised to the SQL client. This is particularly prevalent when a cluster is auto-scaled to handle a load then begins scaling back in once the load subsides. As pods are shutdown, any client running a query on the pod receives a broken connection error.

Since one of the important reasons to use a cluster rather than larger machines is resilience to compute resource loss, this seems like the query cluster should be able to handle a cluster member loss by routing the query to another available member transparently to the external client. I think the expectation is that the previous transaction would be rolled back and a new transaction started.

How to Reproduce?

  1. Scale the query cluster to more than one member
  2. Begin running queries so that both members are answering query requests
  3. Terminate one of the query members without a graceful shutdown
  4. Observe the errors returned to SQL client for queries executing on the terminated member

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
@inviscid inviscid added the C-bug Category: something isn't working label Sep 19, 2024
Copy link

dosubot bot commented Sep 19, 2024

When a query pod is terminated unexpectedly, the SQL client receives specific error messages such as:

  1. {"error":{"code":400,"message":"query id QID canceled"}}
  2. { "code": 1043, "message": "canceled by client" }
  3. { "code": 2006, "message": "MySQL server has gone away" }
  4. { "code": 2013, "message": "Lost connection to MySQL server during query" } [1][2].

The query cluster handles the loss of a query pod without causing errors to the SQL client by using a heartbeat mechanism and dynamically discovering and managing nodes. The ClusterDiscovery struct and its methods play a crucial role in this process. Here is how it works:

  1. Heartbeat Mechanism: The ClusterHeartbeat struct sends periodic heartbeats to ensure that the node is still active. If a node fails to send a heartbeat, it can be considered lost and removed from the cluster.

  2. Dynamic Node Discovery: The ClusterDiscovery struct has methods like discover and drop_invalid_nodes to dynamically discover active nodes and remove invalid ones, ensuring the cluster is always aware of the current state of its nodes.

  3. Graceful Shutdown: The unregister_to_metastore method ensures that a node is properly unregistered from the metastore during shutdown, preventing stale entries [3].

However, there is no specific mechanism mentioned for rolling back and retrying transactions when a node is lost. This might be an area for improvement or a feature request [3].

Additionally, there are configuration options available to control the behavior of the cluster during auto-scaling events, such as auto_compaction_imperfect_blocks_threshold and auto_compaction_segments_limit [4][5].

To continue talking to Dosu, mention @dosu.

@dosubot dosubot bot added the A-query Area: databend query label Sep 19, 2024
@wubx
Copy link
Member

wubx commented Sep 20, 2024

This case was anticipated. We kindly request that the application handle the retry process.

@rad-pat
Copy link

rad-pat commented Sep 20, 2024

Is it possible to handle this at the driver level - bendSQL?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-query Area: databend query C-bug Category: something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants