RFS refactoring into separate component programs + new work coordination mechanisms #739

gregschohn · 2024-06-19T17:40:26Z

Description

[Describe what this change achieves]

Category (Enhancement, New feature, Bug fix, Test fix, Refactoring, Maintenance, Documentation)
Why these changes are required?
What is the old behavior before changes and new behavior after changes?

Issues Resolved

[List any issues this PR will resolve]

Is this a backport? If so, please add backport PR # and/or commits #

Testing

[Please provide details of testing done: unit testing, integration testing and manual testing]

Check List

New functionality includes testing
- All tests pass, including unit test, integration test and doctest
New functionality has been documented
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Greg Schohn <[email protected]>

…o contention that we need to mitigate. If there is contention, there will be multiple snapshot requests and multiple processes will poll and proceed once the snapshot has been taken. Signed-off-by: Greg Schohn <[email protected]>

Signed-off-by: Greg Schohn <[email protected]> # Conflicts: # RFS/src/main/java/com/rfs/RunRfsWorker.java # RFS/src/main/java/com/rfs/worker/Runner.java # RFS/src/test/java/com/rfs/worker/SnapshotRunnerTest.java

Signed-off-by: Greg Schohn <[email protected]>

… quality of the code. Signed-off-by: Greg Schohn <[email protected]>

…uction ready interface that can be used to coordinate work. The code doesn't compile, the requests aren't fully featured, and the test is in the process of being shelled and rewritten (let alone clocked up to stress the interfaces). Signed-off-by: Greg Schohn <[email protected]>

…se works Signed-off-by: Greg Schohn <[email protected]>

Amp the unit test for WorkCoordinator up to stress this more. Signed-off-by: Greg Schohn <[email protected]>

Signed-off-by: Greg Schohn <[email protected]>

…e leases expire or exceptions happen. The code compiles, but I haven't written tests for a lot of those pieces yet. Signed-off-by: Greg Schohn <[email protected]>

…ut any Coordinated Lease Locking. Leases aren't required for the first phases of snapshot migration since they're run for a short period of time and are expected to be orchestrated externally. If that decision were to reverse, it should be straightforward to wrap the methods in a ScopedWorkCoordinatorHelper. Tests that mocked internal details were deleted because those details have changed. A future commit will replace those with tests further up the stack. Signed-off-by: Greg Schohn <[email protected]>

…kCoordinator class. Signed-off-by: Greg Schohn <[email protected]>

Some files came into being or went out of being across the merge, so some of the changes may be greater than I would have liked for a merge. Signed-off-by: Greg Schohn <[email protected]> # Conflicts: # RFS/src/main/java/com/rfs/RunRfsWorker.java # RFS/src/main/java/com/rfs/worker/DocumentsRunner.java # RFS/src/main/java/com/rfs/worker/DocumentsStep.java # RFS/src/main/java/com/rfs/worker/IndexStep.java # RFS/src/test/java/com/rfs/worker/DocumentsRunnerTest.java # RFS/src/test/java/com/rfs/worker/DocumentsStepTest.java

Signed-off-by: Greg Schohn <[email protected]> # Conflicts: # CreateSnapshot/src/main/java/com/rfs/CreateSnapshot.java # DocumentsFromSnapshotMigration/src/main/java/com/rfs/RfsMigrateDocuments.java # MetadataMigration/src/main/java/com/rfs/MetadataMigration.java

…RFS JUnit test. Signed-off-by: Greg Schohn <[email protected]>

Signed-off-by: Greg Schohn <[email protected]>

…d have the unpacker provide the filepath to it. Comment out some tests/parts of tests that are currently problematic/flaky. Signed-off-by: Greg Schohn <[email protected]>

…ove it to DocumentFromSnapshotMigration Signed-off-by: Greg Schohn <[email protected]>

…d have the unpacker provide the filepath to it. Comment out some tests/parts of tests that are currently problematic/flaky. Signed-off-by: Greg Schohn <[email protected]>

peternied

Whew! Done with my initial pass, need to sleep on this change and revisit tomorrow, lots of changes here - great work

peternied · 2024-06-19T19:25:24Z

RFS/src/test/resources/log4j2.xml

@@ -9,6 +9,12 @@
        <Root level="debug">
            <AppenderRef ref="Console"/>
        </Root>
+        <Logger name="org.apache.hc.client5.http.wire" level="debug" additivity="false">


Do we need this level of logging? Could we have it only enabled on specific tests where this would be useful?

RFS/src/main/java/com/rfs/cms/AbstractedHttpClient.java

peternied · 2024-06-19T20:00:36Z

RFS/src/main/java/com/rfs/cms/OpenSearchWorkCoordinator.java

+                "}\n";
+
+        var response = httpClient.makeJsonRequest(PUT_METHOD, INDEX_NAME, null, body);
+        if ((response.getStatusCode() / 100) != 2) {


This is cute - but hard to understand on first glance. If we are going through the trouble of making our own http client abstraction it could at the very least have a helper method for functionality like this

E.g. response.wasNotSuccessful()

peternied · 2024-06-19T20:07:23Z

RFS/src/main/java/com/rfs/cms/OpenSearchWorkCoordinator.java

+        if (resultFromUpdate == DocumentModificationResult.CREATED) {
+            return new WorkItemAndDuration(workItemId, startTime.plus(leaseDuration));
+        } else {
+            final var httpResponse = httpClient.makeJsonRequest(GET_METHOD, INDEX_NAME + "/_doc/" + workItemId,


Why do we need a second request? I feel like I'm missing something

peternied · 2024-06-19T20:08:28Z

RFS/src/main/java/com/rfs/cms/OpenSearchWorkCoordinator.java

+
+        if (resultFromUpdate == DocumentModificationResult.CREATED) {
+            return new WorkItemAndDuration(workItemId, startTime.plus(leaseDuration));
+        } else {


Since you are return in the if statement it would be easier to read if you got rid of the else block so all DocumentModificationResult checks were on the same vertical indent

RFS/src/test/java/com/rfs/cms/WorkCoordinatorTest.java

peternied · 2024-06-19T20:46:24Z

RFS/src/test/java/com/rfs/cms/WorkCoordinatorTest.java

+            log.error("Next work item picked=" + nextWorkItem);
+            Assertions.assertNotNull(nextWorkItem);
+            Assertions.assertNotNull(nextWorkItem.workItemId);
+            Assertions.assertTrue(nextWorkItem.leaseExpirationTime.isAfter(Instant.now()));


Doesn't seem to make sense to compare 'local' time against the clusters timestamp, that seems like a source of random failure in the future

Right now, the clients (Document migration workers) handle leases by the 'honor system'. If the clock on the client is way off, we'll have multiple parties stepping on each others toes. This is a simple sanity check to build some protection against things from completely falling apart.

I'd like to maintain the client time for the kill-switch based off of the cluster time, as it's returned in the http headers of each target request. The responses may be for different nodes and there could be skew there, so that could be problematic too.

There is the general problem of a single source of truth for time I think we don't need to verify that scenario in this case of these tests. What would we do about a failure at this line - it seems like unactionable failure to me?

I'd rather remove the validation of this concept from these tests and bake them into separate more focused test cases that assure the time management/coordination is working as expected - this seems like a good candidate for a follow up (and I don't think is a code requirement for the moment). What do you think?

DocumentsFromSnapshotMigration/src/test/java/com/rfs/FullTest.java

peternied · 2024-06-19T20:54:01Z

DocumentsFromSnapshotMigration/src/test/java/com/rfs/FullTest.java

+
+    @BeforeAll
+    static void setupSourceContainer() {
+        esSourceContainer.start();


Think you need to put some things in the source container to expect to be migrated

The source container is the elasticsearch_rfs_source container, which is being built automatically by gradle (every time still :( ). For now, that includes our 4 OpenSearch Benchmark test runs of 1000 docs / index. I'd like to improve our cacheable image generation and to include different variants (7.17, 6.8; many more shards; fewer shards), etc.

That will probably be in future PRs though.

Added methods to the work coordinator to show if there are work items remaining and another for the count. FullTest is beginning to make some use of those to confirm that scaling down measures can work. That test still doesn't work though as more features are being added to it. Signed-off-by: Greg Schohn <[email protected]>

… all of our acquisition calls, etc. Signed-off-by: Greg Schohn <[email protected]>

Make sure that the indices match between target and source. Swallow exceptions that might have come out of the migration run and let the run start over again, just as orchestrated containers would do. Signed-off-by: Greg Schohn <[email protected]>

…CoordinationWork Signed-off-by: Greg Schohn <[email protected]> # Conflicts: # DocumentsFromSnapshotMigration/build.gradle # DocumentsFromSnapshotMigration/src/test/java/com/rfs/FullTest.java # RFS/src/main/java/com/rfs/common/LuceneDocumentsReader.java

Signed-off-by: Greg Schohn <[email protected]> # Conflicts: # CreateSnapshot/src/main/java/com/rfs/CreateSnapshot.java # RFS/build.gradle

Signed-off-by: Greg Schohn <[email protected]>

…pull the right image for the slowTest target. buildDockerImage_elasticsearchRFSSource is run completely every time. The DockerBuildImage stuff for dockerSolution makes good use of inputs/outputs to do more minimal, dependency-driven builds. This pattern uses multiple containers, so inputs and outputs will need to be managed externally. Signed-off-by: Greg Schohn <[email protected]>

…ion" for CDK & README Signed-off-by: Greg Schohn <[email protected]>

peternied · 2024-06-21T15:10:46Z

DocumentsFromSnapshotMigration/build.gradle

+
+task slowTest(type: Test) {
+    useJUnitPlatform()
+    dependsOn buildDockerImage_elasticsearchRFSSource


This slows down the slow tests even more. Why can't we use test containers and populate the documents during the test - this decoupling of test setup and test validation will slow us down over time sustainably IMO.

Right now, I think that the time is roughly the same. It takes about 2 minutes to load the test data that I'm using. The hope with the prebuilt image is that if I'm running repeatedly, I don't need to keep regenerating my source data. I'll work on making the prebuilt stuff cacheable (it isn't now, which admittedly sucks), but the additional tax is just a couple extra docker operations (and an extra startup cycle of ES).

peternied · 2024-06-21T15:21:03Z

RFS/src/main/java/com/rfs/worker/SnapshotRunner.java

+            log.info("Snapshot not finished yet; sleeping for 1 seconds...");
+            Thread.sleep(1000);


This or something like it to prevent the log from going out of sync

Suggested change

log.info("Snapshot not finished yet; sleeping for 1 seconds...");

Thread.sleep(1000);

var waitPeriod = Duration.ofSeconds(1);

log.info("Snapshot not finished yet; sleeping for " + waitPeriod.toMillis() + "ms...");

Thread.sleep(waitPeriod.toMillis());

whoops - I had copied that code in. Values are fixed now

peternied · 2024-06-21T15:22:59Z

RFS/src/test/java/com/rfs/cms/WorkCoordinatorTest.java

-            seenWorkerItems.put(nextWorkItem.workItemId, nextWorkItem.workItemId);
-            return null;
+            return workCoordinator.acquireNextWorkItem(expirationWindow).visit(
+                    new IWorkCoordinator.WorkAcquisitionOutcomeVisitor<>() {


Nit: this would be a good use case for a mock so you only need to handle the onAcquireWork method

peternied · 2024-06-21T15:30:17Z

RFS/src/main/java/com/rfs/cms/IWorkCoordinator.java

+    class NoAvailableWorkToBeDone implements WorkAcquisitionOutcome {
+        @Override
+        public <T> T visit(WorkAcquisitionOutcomeVisitor<T> v) throws IOException {
+            return null;


This looks like a bug, shouldn't this be v.onNoAvailableWorkToBeDone?

peternied · 2024-06-21T15:33:03Z

RFS/src/main/java/com/rfs/cms/IWorkCoordinator.java

+import java.time.Duration;
+import java.time.Instant;
+
+public interface IWorkCoordinator extends AutoCloseable {


So there is the problem of allocating work, then there is the problem of defining work to the allocated, having a single interface in one place makes this harder to understand, seems like this would be better suited to be in two interfaces - in the case where there is a dual consumer and producer, it could just implement both interfaces. What do you think?

peternied · 2024-06-21T15:34:53Z

RFS/src/main/java/com/rfs/cms/ScopedWorkCoordinatorHelper.java

+    }
+
+    public interface WorkItemGetter {
+        @NonNull IWorkCoordinator.WorkAcquisitionOutcome apply(IWorkCoordinator wc);


Or maybe tryAcquire(...) is a better fit?

peternied

I know we are pushing towards a deadline - and I think we've made huge strides with this PR. What I would advise to keep moving fast would be to merge as is and then rather than create follow up items - just keep working on the outstanding comments on this PR until they are all resolved. What do you think?

I've just done a pass of them and there are far fewer that we need to put this set of changes to 'rest'. Making github issues or jira tasks seems like low-value overhead when we've got the context in this PR.

Signed-off-by: Greg Schohn <[email protected]>

…rdinationPRFeedback

codecov · 2024-06-21T16:40:46Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.37%. Comparing base (57007c0) to head (787fd26).
Report is 9 commits behind head on main.

❗ Current head 787fd26 differs from pull request most recent head 8089d8b

Please upload reports for the commit 8089d8b to get more accurate results.

Additional details and impacted files

@@              Coverage Diff              @@
##               main     #739       +/-   ##
=============================================
+ Coverage     69.82%   89.37%   +19.55%     
=============================================
  Files           269       46      -223     
  Lines         11815     2702     -9113     
  Branches        772        0      -772     
=============================================
- Hits           8250     2415     -5835     
+ Misses         3158      287     -2871     
+ Partials        407        0      -407

Flag	Coverage Δ
gradle-test	`?`
python-test	`89.37% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Greg Schohn <[email protected]>

gregschohn added 17 commits June 10, 2024 18:29

POC/unit test code to use painless scripts to do lease acquisition.

dd4b1d1

Signed-off-by: Greg Schohn <[email protected]>

Merge branch 'main' into RfsCoordinationWork

6a457ff

Signed-off-by: Greg Schohn <[email protected]> # Conflicts: # RFS/src/main/java/com/rfs/RunRfsWorker.java # RFS/src/main/java/com/rfs/worker/Runner.java # RFS/src/test/java/com/rfs/worker/SnapshotRunnerTest.java

Bugfix in update logic of the CMS unit test POC.

dd360ac

Signed-off-by: Greg Schohn <[email protected]>

Checkpoint on the coordination POC test harness. No guarantees on the…

3d5a5a2

… quality of the code. Signed-off-by: Greg Schohn <[email protected]>

Bugfixes and test improvement to show doling out work in the happy ca…

7746f5f

…se works Signed-off-by: Greg Schohn <[email protected]>

Make some improvements to mitigate contention for acquiring work.

f805701

Amp the unit test for WorkCoordinator up to stress this more. Signed-off-by: Greg Schohn <[email protected]>

Start converting the document migration phase to use the new primitives.

b7923f6

Signed-off-by: Greg Schohn <[email protected]>

Starting to change how process management (death) works in cases wher…

611140f

…e leases expire or exceptions happen. The code compiles, but I haven't written tests for a lot of those pieces yet. Signed-off-by: Greg Schohn <[email protected]>

Remove "CMS" implementations as they're being replaced by the new Wor…

3683c44

…kCoordinator class. Signed-off-by: Greg Schohn <[email protected]>

WIP, doesn't compile. A lot of in-progress changes to support a full …

8703c20

…RFS JUnit test. Signed-off-by: Greg Schohn <[email protected]>

A lot of bugfixes to get the FullTest moving further along

0275dd5

Signed-off-by: Greg Schohn <[email protected]>

Update the DocumentsReader API to not take indexname/shard and instea…

8733c01

…d have the unpacker provide the filepath to it. Comment out some tests/parts of tests that are currently problematic/flaky. Signed-off-by: Greg Schohn <[email protected]>

gregschohn requested review from AndreKurait, chelma, lewijacn, mikaylathompson, okhasawn, peternied and sumobrian as code owners June 19, 2024 17:40

gregschohn force-pushed the BuildReindexFromSnapshotViaNewDocumentProject branch from dbb93a4 to cede889 Compare June 19, 2024 18:12

Extract some of the build logic to build a docker image for RFS and m…

a9b6709

…ove it to DocumentFromSnapshotMigration Signed-off-by: Greg Schohn <[email protected]>

gregschohn force-pushed the BuildReindexFromSnapshotViaNewDocumentProject branch from cede889 to a9b6709 Compare June 19, 2024 18:27

Update the DocumentsReader API to not take indexname/shard and instea…

d2514dc

…d have the unpacker provide the filepath to it. Comment out some tests/parts of tests that are currently problematic/flaky. Signed-off-by: Greg Schohn <[email protected]>

gregschohn mentioned this pull request Jun 19, 2024

Collection of RFS tweaks and fixes #743

Closed

4 tasks

peternied reviewed Jun 19, 2024

View reviewed changes

peternied mentioned this pull request Jun 20, 2024

New RFS work coordination #715

Closed

4 tasks

gregschohn added 7 commits June 20, 2024 19:01

Fix some lingering reporting issues to try to get better certainty in…

69d7d58

… all of our acquisition calls, etc. Signed-off-by: Greg Schohn <[email protected]>

Merge branch 'main' into RfsCoordinationWork

ee6fbc8

Signed-off-by: Greg Schohn <[email protected]> # Conflicts: # CreateSnapshot/src/main/java/com/rfs/CreateSnapshot.java # RFS/build.gradle

Fix documentation linting errors

7905f4c

Signed-off-by: Greg Schohn <[email protected]>

Path updates to move "RFS" references to "DocumentsFromSnapshotMigrat…

6f50219

…ion" for CDK & README Signed-off-by: Greg Schohn <[email protected]>

peternied reviewed Jun 21, 2024

View reviewed changes

peternied approved these changes Jun 21, 2024

View reviewed changes

Minor bugfixes and refactoring improvements as per PR feedback.

9aead15

Signed-off-by: Greg Schohn <[email protected]>

gregschohn force-pushed the BuildReindexFromSnapshotViaNewDocumentProject branch from 9aead15 to 6f50219 Compare June 21, 2024 16:36

gregschohn added 2 commits June 21, 2024 12:38

Merge branch 'main' into BuildReindexFromSnapshotViaNewDocumentProject

787fd26

Merge branch 'BuildReindexFromSnapshotViaNewDocumentProject' into Coo…

8ba51fc

…rdinationPRFeedback

Test bugfix for when not EVERY worker throws an exception.

8089d8b

Signed-off-by: Greg Schohn <[email protected]>

gregschohn merged commit 1383562 into opensearch-project:main Jun 21, 2024
10 checks passed

gregschohn deleted the BuildReindexFromSnapshotViaNewDocumentProject branch June 21, 2024 18:04

gregschohn changed the title ~~Build reindex from snapshot via new document project~~ RFS refactoring into separate component programs + new work coordination mechanisms Jun 26, 2024

peternied mentioned this pull request Jul 12, 2024

CmsClient should have a different type object for optimistic locking #697

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFS refactoring into separate component programs + new work coordination mechanisms #739

RFS refactoring into separate component programs + new work coordination mechanisms #739

gregschohn commented Jun 19, 2024

peternied left a comment

peternied Jun 19, 2024

peternied Jun 19, 2024

peternied Jun 19, 2024

peternied Jun 19, 2024

peternied Jun 19, 2024

peternied Jun 19, 2024

gregschohn Jun 21, 2024

peternied Jun 21, 2024

peternied Jun 19, 2024

gregschohn Jun 21, 2024

peternied Jun 21, 2024

gregschohn Jun 21, 2024

peternied Jun 21, 2024

gregschohn Jun 21, 2024

peternied Jun 21, 2024

peternied Jun 21, 2024

gregschohn Jun 21, 2024

peternied Jun 21, 2024

peternied Jun 21, 2024

peternied left a comment

codecov bot commented Jun 21, 2024 •

edited

Loading

		log.info("Snapshot not finished yet; sleeping for 1 seconds...");
		Thread.sleep(1000);

-            log.info("Snapshot not finished yet; sleeping for 1 seconds...");
-            Thread.sleep(1000);
+            var waitPeriod = Duration.ofSeconds(1);
+            log.info("Snapshot not finished yet; sleeping for " + waitPeriod.toMillis() + "ms...");
+            Thread.sleep(waitPeriod.toMillis());

RFS refactoring into separate component programs + new work coordination mechanisms #739

RFS refactoring into separate component programs + new work coordination mechanisms #739

Conversation

gregschohn commented Jun 19, 2024

Description

Issues Resolved

Testing

Check List

peternied left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peternied left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 21, 2024 • edited Loading

Codecov Report

codecov bot commented Jun 21, 2024 •

edited

Loading