Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFS refactoring into separate component programs + new work coordination mechanisms #739

Conversation

gregschohn
Copy link
Collaborator

Description

[Describe what this change achieves]

  • Category (Enhancement, New feature, Bug fix, Test fix, Refactoring, Maintenance, Documentation)
  • Why these changes are required?
  • What is the old behavior before changes and new behavior after changes?

Issues Resolved

[List any issues this PR will resolve]

Is this a backport? If so, please add backport PR # and/or commits #

Testing

[Please provide details of testing done: unit testing, integration testing and manual testing]

Check List

  • New functionality includes testing
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…o contention that we need to mitigate.

If there is contention, there will be multiple snapshot requests and multiple processes will poll and proceed once the snapshot has been taken.

Signed-off-by: Greg Schohn <[email protected]>
Signed-off-by: Greg Schohn <[email protected]>

# Conflicts:
#	RFS/src/main/java/com/rfs/RunRfsWorker.java
#	RFS/src/main/java/com/rfs/worker/Runner.java
#	RFS/src/test/java/com/rfs/worker/SnapshotRunnerTest.java
…uction ready interface that can be used to coordinate work.

The code doesn't compile, the requests aren't fully featured, and the test is in the process of being shelled and rewritten (let alone clocked up to stress the interfaces).

Signed-off-by: Greg Schohn <[email protected]>
Amp the unit test for WorkCoordinator up to stress this more.

Signed-off-by: Greg Schohn <[email protected]>
…e leases expire or exceptions happen.

The code compiles, but I haven't written tests for a lot of those pieces yet.

Signed-off-by: Greg Schohn <[email protected]>
…ut any Coordinated Lease Locking.

Leases aren't required for the first phases of snapshot migration since they're run for a short period of time and are expected to be orchestrated externally.  If that decision were to reverse, it should be straightforward to wrap the methods in a ScopedWorkCoordinatorHelper.
Tests that mocked internal details were deleted because those details have changed.  A future commit will replace those with tests further up the stack.

Signed-off-by: Greg Schohn <[email protected]>
Some files came into being or went out of being across the merge, so some of the changes may be greater than I would have liked for a merge.

Signed-off-by: Greg Schohn <[email protected]>

# Conflicts:
#	RFS/src/main/java/com/rfs/RunRfsWorker.java
#	RFS/src/main/java/com/rfs/worker/DocumentsRunner.java
#	RFS/src/main/java/com/rfs/worker/DocumentsStep.java
#	RFS/src/main/java/com/rfs/worker/IndexStep.java
#	RFS/src/test/java/com/rfs/worker/DocumentsRunnerTest.java
#	RFS/src/test/java/com/rfs/worker/DocumentsStepTest.java
Signed-off-by: Greg Schohn <[email protected]>

# Conflicts:
#	CreateSnapshot/src/main/java/com/rfs/CreateSnapshot.java
#	DocumentsFromSnapshotMigration/src/main/java/com/rfs/RfsMigrateDocuments.java
#	MetadataMigration/src/main/java/com/rfs/MetadataMigration.java
…d have the unpacker provide the filepath to it.

Comment out some tests/parts of tests that are currently problematic/flaky.

Signed-off-by: Greg Schohn <[email protected]>
…ove it to DocumentFromSnapshotMigration

Signed-off-by: Greg Schohn <[email protected]>
@gregschohn gregschohn force-pushed the BuildReindexFromSnapshotViaNewDocumentProject branch from cede889 to a9b6709 Compare June 19, 2024 18:27
…d have the unpacker provide the filepath to it.

Comment out some tests/parts of tests that are currently problematic/flaky.

Signed-off-by: Greg Schohn <[email protected]>
Copy link
Member

@peternied peternied left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whew! Done with my initial pass, need to sleep on this change and revisit tomorrow, lots of changes here - great work

@@ -9,6 +9,12 @@
<Root level="debug">
<AppenderRef ref="Console"/>
</Root>
<Logger name="org.apache.hc.client5.http.wire" level="debug" additivity="false">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this level of logging? Could we have it only enabled on specific tests where this would be useful?

"}\n";

var response = httpClient.makeJsonRequest(PUT_METHOD, INDEX_NAME, null, body);
if ((response.getStatusCode() / 100) != 2) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cute - but hard to understand on first glance. If we are going through the trouble of making our own http client abstraction it could at the very least have a helper method for functionality like this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. response.wasNotSuccessful()

if (resultFromUpdate == DocumentModificationResult.CREATED) {
return new WorkItemAndDuration(workItemId, startTime.plus(leaseDuration));
} else {
final var httpResponse = httpClient.makeJsonRequest(GET_METHOD, INDEX_NAME + "/_doc/" + workItemId,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a second request? I feel like I'm missing something


if (resultFromUpdate == DocumentModificationResult.CREATED) {
return new WorkItemAndDuration(workItemId, startTime.plus(leaseDuration));
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are return in the if statement it would be easier to read if you got rid of the else block so all DocumentModificationResult checks were on the same vertical indent

RFS/src/test/java/com/rfs/cms/WorkCoordinatorTest.java Outdated Show resolved Hide resolved
log.error("Next work item picked=" + nextWorkItem);
Assertions.assertNotNull(nextWorkItem);
Assertions.assertNotNull(nextWorkItem.workItemId);
Assertions.assertTrue(nextWorkItem.leaseExpirationTime.isAfter(Instant.now()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem to make sense to compare 'local' time against the clusters timestamp, that seems like a source of random failure in the future

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, the clients (Document migration workers) handle leases by the 'honor system'. If the clock on the client is way off, we'll have multiple parties stepping on each others toes. This is a simple sanity check to build some protection against things from completely falling apart.

I'd like to maintain the client time for the kill-switch based off of the cluster time, as it's returned in the http headers of each target request. The responses may be for different nodes and there could be skew there, so that could be problematic too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the general problem of a single source of truth for time I think we don't need to verify that scenario in this case of these tests. What would we do about a failure at this line - it seems like unactionable failure to me?

I'd rather remove the validation of this concept from these tests and bake them into separate more focused test cases that assure the time management/coordination is working as expected - this seems like a good candidate for a follow up (and I don't think is a code requirement for the moment). What do you think?


@BeforeAll
static void setupSourceContainer() {
esSourceContainer.start();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you need to put some things in the source container to expect to be migrated

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source container is the elasticsearch_rfs_source container, which is being built automatically by gradle (every time still :( ). For now, that includes our 4 OpenSearch Benchmark test runs of 1000 docs / index. I'd like to improve our cacheable image generation and to include different variants (7.17, 6.8; many more shards; fewer shards), etc.

That will probably be in future PRs though.

Added methods to the work coordinator to show if there are work items remaining and another for the count.
FullTest is beginning to make some use of those to confirm that scaling down measures can work.  That test still doesn't work though as more features are being added to it.

Signed-off-by: Greg Schohn <[email protected]>
@peternied peternied mentioned this pull request Jun 20, 2024
4 tasks
… all of our acquisition calls, etc.

Signed-off-by: Greg Schohn <[email protected]>
Make sure that the indices match between target and source.  Swallow exceptions that might have come out of the migration run and let the run start over again, just as orchestrated containers would do.

Signed-off-by: Greg Schohn <[email protected]>
…CoordinationWork

Signed-off-by: Greg Schohn <[email protected]>

# Conflicts:
#	DocumentsFromSnapshotMigration/build.gradle
#	DocumentsFromSnapshotMigration/src/test/java/com/rfs/FullTest.java
#	RFS/src/main/java/com/rfs/common/LuceneDocumentsReader.java
Signed-off-by: Greg Schohn <[email protected]>

# Conflicts:
#	CreateSnapshot/src/main/java/com/rfs/CreateSnapshot.java
#	RFS/build.gradle
…pull the right image for the slowTest target.

buildDockerImage_elasticsearchRFSSource is run completely every time.  The DockerBuildImage stuff for dockerSolution makes good use of inputs/outputs to do more minimal, dependency-driven builds.
This pattern uses multiple containers, so inputs and outputs will need to be managed externally.

Signed-off-by: Greg Schohn <[email protected]>

task slowTest(type: Test) {
useJUnitPlatform()
dependsOn buildDockerImage_elasticsearchRFSSource
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This slows down the slow tests even more. Why can't we use test containers and populate the documents during the test - this decoupling of test setup and test validation will slow us down over time sustainably IMO.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, I think that the time is roughly the same. It takes about 2 minutes to load the test data that I'm using. The hope with the prebuilt image is that if I'm running repeatedly, I don't need to keep regenerating my source data. I'll work on making the prebuilt stuff cacheable (it isn't now, which admittedly sucks), but the additional tax is just a couple extra docker operations (and an extra startup cycle of ES).

Comment on lines 13 to 14
log.info("Snapshot not finished yet; sleeping for 1 seconds...");
Thread.sleep(1000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This or something like it to prevent the log from going out of sync

Suggested change
log.info("Snapshot not finished yet; sleeping for 1 seconds...");
Thread.sleep(1000);
var waitPeriod = Duration.ofSeconds(1);
log.info("Snapshot not finished yet; sleeping for " + waitPeriod.toMillis() + "ms...");
Thread.sleep(waitPeriod.toMillis());

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops - I had copied that code in. Values are fixed now

seenWorkerItems.put(nextWorkItem.workItemId, nextWorkItem.workItemId);
return null;
return workCoordinator.acquireNextWorkItem(expirationWindow).visit(
new IWorkCoordinator.WorkAcquisitionOutcomeVisitor<>() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this would be a good use case for a mock so you only need to handle the onAcquireWork method

class NoAvailableWorkToBeDone implements WorkAcquisitionOutcome {
@Override
public <T> T visit(WorkAcquisitionOutcomeVisitor<T> v) throws IOException {
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a bug, shouldn't this be v.onNoAvailableWorkToBeDone?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

import java.time.Duration;
import java.time.Instant;

public interface IWorkCoordinator extends AutoCloseable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there is the problem of allocating work, then there is the problem of defining work to the allocated, having a single interface in one place makes this harder to understand, seems like this would be better suited to be in two interfaces - in the case where there is a dual consumer and producer, it could just implement both interfaces. What do you think?

}

public interface WorkItemGetter {
@NonNull IWorkCoordinator.WorkAcquisitionOutcome apply(IWorkCoordinator wc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe tryAcquire(...) is a better fit?

Copy link
Member

@peternied peternied left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we are pushing towards a deadline - and I think we've made huge strides with this PR. What I would advise to keep moving fast would be to merge as is and then rather than create follow up items - just keep working on the outstanding comments on this PR until they are all resolved. What do you think?

I've just done a pass of them and there are far fewer that we need to put this set of changes to 'rest'. Making github issues or jira tasks seems like low-value overhead when we've got the context in this PR.

@gregschohn gregschohn force-pushed the BuildReindexFromSnapshotViaNewDocumentProject branch from 9aead15 to 6f50219 Compare June 21, 2024 16:36
Copy link

codecov bot commented Jun 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.37%. Comparing base (57007c0) to head (787fd26).
Report is 9 commits behind head on main.

Current head 787fd26 differs from pull request most recent head 8089d8b

Please upload reports for the commit 8089d8b to get more accurate results.

Additional details and impacted files
@@              Coverage Diff              @@
##               main     #739       +/-   ##
=============================================
+ Coverage     69.82%   89.37%   +19.55%     
=============================================
  Files           269       46      -223     
  Lines         11815     2702     -9113     
  Branches        772        0      -772     
=============================================
- Hits           8250     2415     -5835     
+ Misses         3158      287     -2871     
+ Partials        407        0      -407     
Flag Coverage Δ
gradle-test ?
python-test 89.37% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gregschohn gregschohn merged commit 1383562 into opensearch-project:main Jun 21, 2024
10 checks passed
@gregschohn gregschohn deleted the BuildReindexFromSnapshotViaNewDocumentProject branch June 21, 2024 18:04
@gregschohn gregschohn changed the title Build reindex from snapshot via new document project RFS refactoring into separate component programs + new work coordination mechanisms Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants