Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log the last time we spontaneously disconnected from the cluster when forked. #1874

Merged
merged 6 commits into from
Sep 23, 2024

Conversation

tylerkaraszewski
Copy link
Contributor

@tylerkaraszewski tylerkaraszewski commented Sep 18, 2024

Details

This adds extra logging to the Hash mismatch log lines to indicate if we've recently lost quorum due to a disconnection.

It is more-or-less expected that losing quorum while leading will result in a forked DB node. There's no way for the node to anticipate that it is about to lose quorum, and so it will continue committing parallel transaction until it notices the disconnect, at which point it's too late. Another node will begin leading and this node will have commits that it was unable to send.

Fixed Issues

Fixes https://github.com/Expensify/Expensify/issues/384477 https://github.com/Expensify/Expensify/issues/422697

Tests

Artificially setting the timestamp logs:

2024-09-18T15:30:53.388360-05:00 expensidev2004 bedrock10004: xxxxxx (SQLiteNode.cpp:1604) _onMESSAGE [sync] [eror] {cluster_node_0/SYNCHRONIZING} Hash mismatch. I have forked from over half the cluster. This is unrecoverable. Lost Quorum at: 2024-09-18 20:30:43.388 (10.000014 seconds ago).

Internal Testing Reminder: when changing bedrock, please compile auth against your new changes

@tylerkaraszewski tylerkaraszewski self-assigned this Sep 18, 2024
@tylerkaraszewski tylerkaraszewski changed the title [WIP] Log the last time we spontaneously disconnected from the cluster when forked. Log the last time we spontaneously disconnected from the cluster when forked. Sep 18, 2024
@melvin-bot melvin-bot bot requested review from grgia and removed request for a team September 18, 2024 20:47
chiragsalian
chiragsalian previously approved these changes Sep 18, 2024
Copy link
Contributor

@chiragsalian chiragsalian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, left a question and a NAB

sqlitecluster/SQLiteNode.cpp Show resolved Hide resolved
sqlitecluster/SQLiteNode.h Outdated Show resolved Hide resolved
@@ -2785,3 +2790,13 @@ void SQLiteNode::kill() {
peer->reset();
}
}

string SQLiteNode::_getLostQuorumLogMessage() const {
string lostQuormMessage;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
string lostQuormMessage;
string lostQuorumMessage;

@@ -1591,12 +1591,14 @@ void SQLiteNode::_onMESSAGE(SQLitePeer* peer, const SData& message) {
uint64_t commitNum = SToUInt64(message["hashMismatchNumber"]);
_db.getCommits(commitNum, commitNum, result);
_forkedFrom.insert(peer->name);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't removing the blank line, just the spaces in the blank line

Suggested change

@rafecolton rafecolton self-requested a review September 20, 2024 16:51
Co-authored-by: Chirag Chandrakant Salian <[email protected]>
Copy link
Contributor

@chiragsalian chiragsalian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tylerkaraszewski tylerkaraszewski merged commit cedf037 into main Sep 23, 2024
1 check passed
@tylerkaraszewski tylerkaraszewski deleted the tyler-update-fork-logs branch September 23, 2024 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants