BN, add timeout for sync workers which are waiting in queue. #5831

cheatfate · 2024-01-27T17:57:47Z

Add timeout of 1.seconds for sync workers which are waiting in Q state. It means that first worker waiting in queue will have timeout of 12.seconds, second worker will have 24.seconds and so on.

Add tests.

github-actions · 2024-01-27T20:31:53Z

Unit Test Results

        9 files ±0   1 115 suites ±0 29m 51s ⏱️ -21s
  4 246 tests +2   3 899 ✔️ +2 347 💤 ±0 0 ❌ ±0
16 932 runs +6 16 534 ✔️ +6 398 💤 ±0 0 ❌ ±0

Results for commit d02376a. ± Comparison against base commit 9ad8ea0.

♻️ This comment has been updated with latest results.

cheatfate · 2024-01-28T19:01:34Z

This should fix #5794

etan-status

The fix indeed should fix the low-peer scenario, and is relatively tiny.

It's probably okay to push this after the Deneb Mainnet release, just in case a regression gets introduced, as there is no downgrade possibility from the Deneb release back to a Capella release (they cannot connect to Deneb networks).

etan-status · 2024-02-05T09:15:07Z

beacon_chain/nimbus_beacon_node.nim

@@ -390,14 +390,15 @@ proc initFullNode(
      dag.cfg.DENEB_FORK_EPOCH, dag.cfg.MIN_EPOCHS_FOR_BLOB_SIDECARS_REQUESTS,
      SyncQueueKind.Forward, getLocalHeadSlot,
      getLocalWallSlot, getFirstSlotAtFinalizedEpoch, getBackfillSlot,
-      getFrontfillSlot, dag.tail.slot, blockVerifier)
+      getFrontfillSlot, dag.tail.slot, blockVerifier,
+      workerBlockWaitTimeout = chronos.seconds(1))


1s intuitively feels a bit short, I could see false positives if a block is stuck in P for a while (for example, the occasional multi-second state replay). The underlying problem only occurs in low-peer scenario with bad peers failing to provide data, I think it's okay if it takes a bit longer to recover in that edge case, if it means that the happy case is a bit more reliable.

etan-status · 2024-02-05T09:17:53Z

tests/test_sync_manager.nim

+      r23.slot == r13.slot
+      r23.count == r13.count
+      r24.slot == r14.slot
+      r24.count == r14.count


and, also, if r11 fails (incomplete download, failed validation and so on), and p1 goes away due to the corresponding descore for failing to provide correct data, a different peer will eventually pick up r11.

otherwise, r12/r13/r14 would just get stuck again and again (as before).

Yep in old version r12, r13 and r14 will stuck waiting for peer to appear which should provide r11 again.

etan-status · 2024-02-05T09:28:04Z

beacon_chain/sync/sync_queue.nim

+  nanoseconds(
+    int64(sq.chunkSize * sq.chunksCount(sr)) *
+      sq.pendingWorkerBlockWaitTime.nanoseconds)


this is the value if every single request takes the maximum time to complete a single time.

reality could be longer, e.g., when there are retries of earlier sync requests

reality could be faster, e.g., if the time is computed while there are many prior requests, but then those prior requests complete quickly

I wonder if a simpler mechanism with a static, e.g., 30sec timeout, could also mitigate the risk of getting stuck. It would take a bit longer to unstuck than the current solution, but is simpler to reason about.

Alternatively, to get it fully correct may involve having to re-schedule the timeouts whenever a prior request completes.

etan-status · 2024-03-25T16:47:23Z

On Goerli, a similar situation actually comes up even though the number of available peers is very high.

Sync manager only considers peers viable that report a higher slot progress than the local head. However, because goerli is partitioned into split views and proposals are infrequent, there are long stretches where the local head may be higher than the peer's branches. This leads to a situation where < 10 peers are actually viable for sync manager at a time, and I have observed the situation where the Q status workers temporarily could not proceed for minutes because all the other workers were stuck in U/R stage.

Regarding timings, the D/P stages can take quite long, and are the only way to eventually unstuck Q workers. If we just want to specify a single timeout number, I think Q should wait for at least 30-90 seconds based on manual observations of sync progress. Alternatively, a shorter timeout may be suitable if it only is applied while no other worker is in D/P stage, e.g., 5-15 seconds. While other workers are in D/P, I don't think a timeout is needed, as the situation will resolve itself eventually, but if a single timeout value makes the implementation easier, 30-90 seconds should not do too much harm.

cheatfate added 3 commits January 27, 2024 18:48

Add timeouts to sync queue push operation.

bb3269e

Add tests.

Use 1.seconds timeout.

662c365

Update AllTests.md.

642be87

cheatfate requested review from arnetheduck and etan-status February 4, 2024 00:50

etan-status approved these changes Feb 5, 2024

View reviewed changes

etan-status added 2 commits February 9, 2024 23:37

Merge branch 'unstable' into sync-queue-timeouts

423a8bc

Merge branch 'unstable' into sync-queue-timeouts

727b3f6

etan-status and others added 6 commits March 25, 2024 18:00

Merge branch 'unstable' into sync-queue-timeouts

f2b3c70

Merge branch 'unstable' into sync-queue-timeouts

d02376a

Merge branch 'stable' into sync-queue-timeouts

7843d8f

version v24.7.0

0a4d3ac

add missing colon at end of changelog line

99f657e

Merge branch 'stable' into sync-queue-timeouts

79dd6c0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BN, add timeout for sync workers which are waiting in queue. #5831

BN, add timeout for sync workers which are waiting in queue. #5831

cheatfate commented Jan 27, 2024

github-actions bot commented Jan 27, 2024 •

edited

Loading

cheatfate commented Jan 28, 2024

etan-status left a comment

etan-status Feb 5, 2024

etan-status Feb 5, 2024

cheatfate Feb 5, 2024

etan-status Feb 5, 2024

etan-status commented Mar 25, 2024

BN, add timeout for sync workers which are waiting in queue. #5831

Are you sure you want to change the base?

BN, add timeout for sync workers which are waiting in queue. #5831

Conversation

cheatfate commented Jan 27, 2024

github-actions bot commented Jan 27, 2024 • edited Loading

Unit Test Results

cheatfate commented Jan 28, 2024

etan-status left a comment

Choose a reason for hiding this comment

etan-status Feb 5, 2024

Choose a reason for hiding this comment

etan-status Feb 5, 2024

Choose a reason for hiding this comment

cheatfate Feb 5, 2024

Choose a reason for hiding this comment

etan-status Feb 5, 2024

Choose a reason for hiding this comment

etan-status commented Mar 25, 2024

github-actions bot commented Jan 27, 2024 •

edited

Loading