Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing conflict check when the last conflict happened on a journal #1882

Merged
merged 15 commits into from
Oct 1, 2024

Conversation

danieldoglas
Copy link
Contributor

@danieldoglas danieldoglas commented Sep 30, 2024

Details

The normal flow for a write command is to try to commit it. If that fails because of a conflict, we will retry the commits three times before sending them to the blocking queue.

When we have a conflict on a page that belongs to a table, that means that some other command is also changing that same page. We'll store the page that conflicted:

_conflictPage = atol(offset);

And then we'll wait until that page is not locked anymore before we try to execute the command again:

Bedrock/BedrockServer.cpp

Lines 910 to 913 in 67b55ff

PageLockGuard pageLock(lastConflictPage);
if (lastConflictPage) {
SINFO("Waited " << (STimeNow() - conflictLockStartTime) << "us for lock on db page " << lastConflictPage << ".");
}

When that happens for any tables in the system, it is not a problem. But when we have a conflict that happened on a journal table, then we're blocking the next commit from happening without reason. The journals are chosen when the commit is being executed, and the chances that we select the same journal on the next commit are low. that considered, we don't need to do a PageLockGuard if the previous conflict was on a journal table.

Fixed Issues

Fixes https://github.com/Expensify/Expensify/issues/432114

Tests

Ran the ConflictSpam test and confirmed that the changed log line is also logging the tables:

2024-09-30T18:35:02.974407+00:00 expensidev2004 bedrock10019: xxxxxx (SQLite.cpp:757) commit [replicate517] [info] part of last conflict page: 31, conflict table: test

then I applied these changes to make the cluster use page conflicts

diff --git a/BedrockServer.cpp b/BedrockServer.cpp
index 544d3e5b..59469959 100644
--- a/BedrockServer.cpp
+++ b/BedrockServer.cpp
@@ -1029,6 +1029,9 @@ void BedrockServer::runCommand(unique_ptr<BedrockCommand>&& _command, bool isBlo
                                 // don't need to lock our next commit on this page conflict.
                                 if (!SStartsWith(lastConflictTable, "journal")) {
                                     lastConflictPage = db.getLastConflictPage();
+                                    SINFO(format("Not skipping lock on journal table {}", lastConflictTable));
+                                } else {
+                                    SINFO(format("Skipping lock on journal table {}", lastConflictTable));
                                 }
                             }
                         }
diff --git a/test/clustertest/BedrockClusterTester.h b/test/clustertest/BedrockClusterTester.h
index d8d11bc4..fb8dbe54 100644
--- a/test/clustertest/BedrockClusterTester.h
+++ b/test/clustertest/BedrockClusterTester.h
@@ -129,6 +129,7 @@ ClusterTester<T>::ClusterTester(ClusterSize size,
             {"-peerList", peerString},
             {"-plugins", pluginsToLoad},
             {"-overrideProcessName", "bedrock" + to_string(nodePort)},
+            {"-enableConflictPageLocks", "true"},
         };
 
         // Now, if any args were supplied, we'll use those instead of these ones (note that this overwrites our

And confirmed that the logs were successfully doing what we wanted them to do:

2024-09-30T19:04:37.866701+00:00 expensidev2004 bedrock10013: xGhV1t (BedrockServer.cpp:1032) runCommand [socket27] [info] Not skipping lock on journal table test
2024-09-30T19:04:37.866977+00:00 expensidev2004 bedrock10013: QZeN1R (BedrockServer.cpp:1034) runCommand [socket32] [info] Skipping lock on journal table journal0004
2024-09-30T19:04:37.867260+00:00 expensidev2004 bedrock10013: DHoH5U (BedrockServer.cpp:1032) runCommand [socket5] [info] Not skipping lock on journal table test

Internal Testing Reminder: when changing bedrock, please compile auth against your new changes

@danieldoglas danieldoglas self-assigned this Sep 30, 2024
@danieldoglas danieldoglas changed the title Removing conflicts checks on journals Removing conflict check when the last conflict happened on a journal Sep 30, 2024
Comment on lines 290 to 301
const char* tableOffset = strstr(zMsg, "part of db table") + 17;

// Check if the tableOffset exists since not all conflicts are on tables
if (tableOffset) {
// Based on the SQLite log line, we should always have ';' after the table name,
// so let's use it to finish this loop.
int i = 0;
while (tableOffset[i] != ';') {
_conflictTable += tableOffset[i];
i++;
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've confirmed this works as expected:

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BedrockServer.cpp Show resolved Hide resolved
@@ -284,6 +285,20 @@ void SQLite::_sqliteLogCallback(void* pArg, int iErrCode, const char* zMsg) {
// 17 is the length of "conflict at page" and the following space.
const char* offset = strstr(zMsg, "conflict at page") + 17;
_conflictPage = atol(offset);

// 17 is the length of "part of db table" and the following space.
const char* tableOffset = strstr(zMsg, "part of db table") + 17;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can start from offset, we already know tablerOffset will be after that.

Also, renaming offset to pageOffset might be clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

sqlitecluster/SQLite.cpp Show resolved Hide resolved

// Let's add this check in case the SQLite log changes and we don't notice it
// since this would generate a runtime error.
if (semicolonOffset) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this IF since the string initializer would generate a runtime error in case semicollongOffset was null:
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. You need the if, yes, because otherwise semicolonOffset - tableOffset will be negative.

The error you posted looks like you were setting msg by hand, and that's the error you're seeing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below that that msg error we have a terminate called after throwing an instance of std::length-error`, I just wanted to show that it will fail if we don't have the if

Copy link
Contributor

@tylerkaraszewski tylerkaraszewski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more suggested change.

sqlitecluster/SQLite.cpp Show resolved Hide resolved

// Let's add this check in case the SQLite log changes and we don't notice it
// since this would generate a runtime error.
if (semicolonOffset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. You need the if, yes, because otherwise semicolonOffset - tableOffset will be negative.

The error you posted looks like you were setting msg by hand, and that's the error you're seeing.

BedrockServer.cpp Outdated Show resolved Hide resolved
@@ -778,6 +778,7 @@ void BedrockServer::runCommand(unique_ptr<BedrockCommand>&& _command, bool isBlo
}

int64_t lastConflictPage = 0;
string lastConflictTable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this defined so far up?

@flodnv flodnv merged commit 294538c into main Oct 1, 2024
1 check passed
@flodnv flodnv deleted the dsilva_removingConflictsChecksOnJournals branch October 1, 2024 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants