Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Add cancellation support in RemoteStoreReplicationSource #9234

Closed
wants to merge 7 commits into from

Conversation

dreamer-89
Copy link
Member

@dreamer-89 dreamer-89 commented Aug 10, 2023

Description

Adds cancellation to RemoteStoreReplicationSource and checks for event cancellation in getCheckpointMetadata and getSegmentFiles method calls. This is useful as it allows:

  1. Fail fast
  2. Fail gracefully compared to existing shard/store/engine AlreadyClosedExceptions.

Related Issues

Resolves #8089

Testing

Screenshot 2023-08-15 at 10 11 44 AM

One test testDropRandomNodeDuringReplication failure due to assertion trip which seems unrelated. Ran test against main and I am still seeing this failure. So, this is un-related to this change.

java.lang.AssertionError: shard [test-idx-1][1] is not locked
	at __randomizedtesting.SeedInfo.seed([2857D29B5BA0C1F5]:0)
	at org.opensearch.env.NodeEnvironment.deleteShardDirectoryUnderLock(NodeEnvironment.java:576)
	at org.opensearch.indices.IndicesService.deleteShardStore(IndicesService.java:1143)
	at org.opensearch.index.IndexService.onShardClose(IndexService.java:659)
	at org.opensearch.index.IndexService$StoreCloseListener.accept(IndexService.java:782)
	at org.opensearch.index.IndexService$StoreCloseListener.accept(IndexService.java:769)
	at org.opensearch.index.store.Store.closeInternal(Store.java:550)
	at org.opensearch.index.store.Store$1.closeInternal(Store.java:190)
	at org.opensearch.common.util.concurrent.AbstractRefCounted.decRef(AbstractRefCounted.java:78)
	at org.opensearch.index.store.Store.decRef(Store.java:523)
	at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1775)
	at org.opensearch.index.engine.InternalEngine.maybeRefresh(InternalEngine.java:1751)
	at org.opensearch.index.shard.IndexShard.scheduledRefresh(IndexShard.java:4325)
	at org.opensearch.index.IndexService.maybeRefreshEngine(IndexService.java:995)
	at org.opensearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:1128)
	at org.opensearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:159)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@opensearch-trigger-bot
Copy link
Contributor

Compatibility status:



> Task :checkCompatibility
Incompatible components: [https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/reporting.git]
Compatible components: [https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/performance-analyzer-rca.git]

BUILD SUCCESSFUL in 27m 12s

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@dreamer-89
Copy link
Member Author

Gradle Check (Jenkins) Run Completed with:

org.opensearch.remotestore.SegmentReplicationWithRemoteStorePressureIT.testAddReplicaWhileWritesBlocked

Issue is not repro'able locally.

 ./gradlew ':server:internalClusterTest' --tests "org.opensearch.remotestore.SegmentReplicationWithRemoteStorePressureIT.testAddReplicaWhileWritesBlocked" -Dtests.seed=5393342D5852BC15 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=it-IT -Dtests.timezone=Singapore -Druntime.java=17
...
java.lang.AssertionError: timed out waiting for green state
	at __randomizedtesting.SeedInfo.seed([5393342D5852BC15:2AFFA3BDC6EDDB7B]:0)
	at org.junit.Assert.fail(Assert.java:89)
	at org.opensearch.test.OpenSearchIntegTestCase.ensureColor(OpenSearchIntegTestCase.java:1013)
	at org.opensearch.test.OpenSearchIntegTestCase.ensureGreen(OpenSearchIntegTestCase.java:944)
	at org.opensearch.test.OpenSearchIntegTestCase.ensureGreen(OpenSearchIntegTestCase.java:933)
	at org.opensearch.index.SegmentReplicationPressureIT.testAddReplicaWhileWritesBlocked(SegmentReplicationPressureIT.java:178)

Copy link
Member

@mch2 mch2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we run

public class SegmentReplicationSuiteIT extends SegmentReplicationBaseIT {
with remote store enabled to ensure resources are cleaned up on close?

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@codecov
Copy link

codecov bot commented Aug 10, 2023

Codecov Report

Merging #9234 (4fc4af6) into main (61c5f17) will increase coverage by 0.11%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##               main    #9234      +/-   ##
============================================
+ Coverage     71.02%   71.13%   +0.11%     
- Complexity    57365    57432      +67     
============================================
  Files          4776     4776              
  Lines        270809   270806       -3     
  Branches      39582    39582              
============================================
+ Hits         192331   192647     +316     
+ Misses        62271    62026     -245     
+ Partials      16207    16133      -74     
Files Changed Coverage Δ
.../indices/replication/SegmentReplicationTarget.java 90.72% <100.00%> (+1.72%) ⬆️

... and 499 files with indirect coverage changes

@github-actions

This comment was marked as outdated.

@opensearch-trigger-bot

This comment was marked as outdated.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationRelocationIT.testRelocateWhileContinuouslyIndexingAndWaitingForRefresh

@github-actions

This comment was marked as outdated.

@opensearch-trigger-bot

This comment was marked as outdated.

@opensearch-trigger-bot

This comment was marked as outdated.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions

This comment was marked as outdated.

@opensearch-trigger-bot

This comment was marked as outdated.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@dreamer-89
Copy link
Member Author

Gradle Check (Jenkins) Run Completed with:

Screenshot 2023-08-16 at 9 01 25 PM

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@@ -159,13 +159,15 @@ public void startReplication(ActionListener<Void> listener) {
// Get list of files to copy from this checkpoint.
state.setStage(SegmentReplicationState.Stage.GET_CHECKPOINT_INFO);
cancellableThreads.checkForCancel();
source.getCheckpointMetadata(getId(), checkpoint, checkpointInfoListener);
cancellableThreads.execute(() -> source.getCheckpointMetadata(getId(), checkpoint, checkpointInfoListener));

checkpointInfoListener.whenComplete(checkpointInfo -> {
final List<StoreFileMetadata> filesToFetch = getFiles(checkpointInfo);
state.setStage(SegmentReplicationState.Stage.GET_FILES);
cancellableThreads.checkForCancel();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - if we are moving to a sync cancel here and wrapping calls to the replication source I don't think this class needs the checkForCancel checks anymore before invoking the source methods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove checkForCancel call before/after ReplicationSource call. The remaining checkForCancel would help in fail-fast when cancellation happens after source calls e.g. cancellation just after getSegmentFiles method call but before finalizeReplication inside SegmentReplicationTarget.

Copy link
Member Author

@dreamer-89 dreamer-89 Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see remaining checkForCancel() calls are right after the ReplicationSource method calls (first statement inside SegmentReplicationTarget.getFiles, SegmentReplicationTarget.finalizeReplication), which means we can remove these calls as well. BUT, it is still possible that cancellation happens outside of replication source method calls, in which case resulting in non-graceful cancellations. I think we should also wrap getFiles and finalizeReplication method calls inside CancellationThreads.execute(). This is required as target contains heavy-weight operations such as finalizeReplication (reads commit from disk) and there are chances that cancellation happens during method execution.

…ionSource

Signed-off-by: Suraj Singh <[email protected]>

Self review

Signed-off-by: Suraj Singh <[email protected]>

Address review comments

Signed-off-by: Suraj Singh <[email protected]>

Address review comments

Signed-off-by: Suraj Singh <[email protected]>
Signed-off-by: Suraj Singh <[email protected]>
Signed-off-by: Suraj Singh <[email protected]>
Signed-off-by: Suraj Singh <[email protected]>

class TestRSReplicationSource extends RemoteStoreReplicationSource {

private final Thread beforeCheckpoint;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Created threads vs Runnable to more closely mimic the actual scenario where cancellations happen in separate threads. Sample trace showing segrep event & cancellation invoke in separate threads (note Thread %tid in logs below).

[2023-08-21T19:42:05,088][INFO ][o.o.i.r.SegmentReplicationTarget] [org.opensearch.index.shard.RemoteIndexShardTests] [test][0] [Thread 42] Starting Replication Target: Id:[3] Checkpoint [ReplicationCheckpoint{shardId=[test][0], primaryTerm=88, segmentsGen=3, version=7, size=0, codec=Lucene95}] Shard:[[test][0]] Source:[TestReplicationSource] with thread opensearch[org.opensearch.index.shard.RemoteIndexShardTests][generic][T#3]
[2023-08-21T19:42:05,088][INFO ][o.o.i.r.SegmentReplicationTarget] [[Thread-5]] [test][0] [Thread 46] Cancelling replication for target Id:[3] Checkpoint [ReplicationCheckpoint{shardId=[test][0], primaryTerm=88, segmentsGen=3, version=7, size=0, codec=Lucene95}] Shard:[[test][0]] Source:[TestReplicationSource]
[2023-08-21T19:42:05,090][ERROR][o.o.i.r.SegmentReplicationTargetService] [org.opensearch.index.shard.RemoteIndexShardTests] [Thread 42] Error during segment replication, Id:[3] Checkpoint [ReplicationCheckpoint{shardId=[test][0], primaryTerm=88, segmentsGen=3, version=7, size=0, codec=Lucene95}] Shard:[[test][0]] Source:[TestReplicationSource]
...

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@mch2
Copy link
Member

mch2 commented Aug 23, 2023

@dreamer-89 FYI #9499 - Not sure if your changes here will fix this or not but would like to first determine the source of flakiness here.

@dreamer-89
Copy link
Member Author

Discussed separately with @mch2, there is no data point around whether this change is needed for 2.10.0 release provided existing tests are stable and we are close to release. We may need to revisit this change, identify use-cases where it helps. I will keep the PR open in case we identify it is needed.

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Oct 2, 2023
@mch2
Copy link
Member

mch2 commented Nov 14, 2023

Cancellation support was added as part of #10725, closing this stalled PR.

@mch2 mch2 closed this Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip-changelog stalled Issues that have stalled
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gracefully cancel the replication when the replica shard is closed
2 participants