-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0dt upgrades: environmentd: fatal: incompatible persist version 0.115.0-dev.0, current: 0.114.0, make sure to upgrade the catalog one version forward at a time #29199
Comments
IMO failing the version check should be treated as a panic and use the same exit code as all other panics. I'll look into why it's exiting with 134 and see if I can update it to use exit code 0. |
Another tricky thing is testing the fix. It looks like it's version 0.114.0 that's stuck in a crash loop, so no changes we make will take affect until it's included in a release. |
That's why I made #29201, which allows us to test it immediately by creating a new version with the next minor. It seems to work: https://buildkite.com/materialize/nightly/builds/9232 |
The issue seems to be that if the error bubbles up to main, then it will unconditionally panic instead of potentially halting on certain errors. materialize/src/environmentd/src/bin/environmentd/main.rs Lines 603 to 605 in 62d8ae7
|
I had a discussion with @benesch and this is my understanding of the issue. Before #29035, 0dt worked like this:
After #29035, we see the same first four steps, and continue like this:
So we end up in an infinite crash loop. My proposal to fix this is the following:
|
If an environmentd process discovers that it's been fenced out by an environmentd process with a newer generation, it should exit with code zero (0). The entrypoint.sh script in the environmentd Docker image will then sleep until the orchestration layer gets around to tearing down the container. Previously in this situation, the old environmentd would halt with exit code two (2), then restart and panic upon discovering that the catalog had been written to by a newer version. It would then remain in a panic loop until the orchestration layer got around to tearing down the container. This did not cause incorrectness, but led to log spam in tests and noisy alerts during release windows in staging and production. Fix MaterializeInc#29199.
If an environmentd process discovers that it's been fenced out by an environmentd process with a newer generation, it should exit with code zero (0). The entrypoint.sh script in the environmentd Docker image will then sleep until the orchestration layer gets around to tearing down the container. Previously in this situation, the old environmentd would halt with exit code two (2), then restart and panic upon discovering that the catalog had been written to by a newer version. It would then remain in a panic loop until the orchestration layer got around to tearing down the container. This did not cause incorrectness, but led to log spam in tests and noisy alerts during release windows in staging and production. Fix MaterializeInc#29199.
Proposal LGTM. I put up #29214 to tackle the last bullet—hoping you can fill in the first two bullets next week, Joe! |
Previously, the durable catalog could be fenced by one of the following ways: - It tried to listen to the durable catalog and saw a higher epoch. - It tried to listen to the durable catalog and saw a higher binary version. - It tried to write to the durable catalog and had an upper mismatch. - It tried to write to the catalog migration shard and had an upper mismatch. These would either result in a non-descriptive `DurableCatalogError::Fence` error or a `DurableCatalogError::IncompatiblePersistVersion` error. Additionally, if multiple of these conditions happened at the same time, it was non-deterministic which error would be returned. This commit modifies the conditions for when the durable catalog can be fenced to the following conditions: - It tried to listen to the durable catalog and saw a higher deploy generation. - It tried to listen to the durable catalog and saw a higher epoch. - It tried to write to the durable catalog shard and had an upper mismatch. - It tried to write to the catalog migration shard and had an upper mismatch. When initializing the catalog, we still check that the current binary version is not less than the catalog binary version, but we no longer check during every update. After the catalog has been initialized, another catalog can only increase the binary version by also fencing the current catalog by another method. This commit also creates a new error enum to explicitly codify these different types of fences. When multiple of these conditions happen at the same time, then we deterministically will return a fence error with the priority of how the conditions are ordered above. The context of the fence is useful to higher layers in determining what action to take. For example if the deploy generation has increased, then we'd like to gracefully exit with exit code 0 and let the new instance take over. Works towards resolving MaterializeInc#29199
Previously, the durable catalog could be fenced by one of the following ways: - It tried to listen to the durable catalog and saw a higher epoch. - It tried to listen to the durable catalog and saw a higher binary version. - It tried to write to the durable catalog and had an upper mismatch. - It tried to write to the catalog migration shard and had an upper mismatch. These would either result in a non-descriptive `DurableCatalogError::Fence` error or a `DurableCatalogError::IncompatiblePersistVersion` error. Additionally, if multiple of these conditions happened at the same time, it was non-deterministic which error would be returned. This commit modifies the conditions for when the durable catalog can be fenced to the following conditions: - It tried to listen to the durable catalog and saw a higher deploy generation. - It tried to listen to the durable catalog and saw a higher epoch. - It tried to write to the durable catalog shard and had an upper mismatch. - It tried to write to the catalog migration shard and had an upper mismatch. When initializing the catalog, we still check that the current binary version is not less than the catalog binary version, but we no longer check during every update. After the catalog has been initialized, another catalog can only increase the binary version by also fencing the current catalog by another method. This commit also creates a new error enum to explicitly codify these different types of fences. When multiple of these conditions happen at the same time, then we deterministically will return a fence error with the priority of how the conditions are ordered above. The context of the fence is useful to higher layers in determining what action to take. For example if the deploy generation has increased, then we'd like to gracefully exit with exit code 0 and let the new instance take over. Works towards resolving #29199
If an environmentd process discovers that it's been fenced out by an environmentd process with a newer generation, it should exit with code zero (0). The entrypoint.sh script in the environmentd Docker image will then sleep until the orchestration layer gets around to tearing down the container. Previously in this situation, the old environmentd would halt with exit code two (2), then restart and panic upon discovering that the catalog had been written to by a newer version. It would then remain in a panic loop until the orchestration layer got around to tearing down the container. This did not cause incorrectness, but led to log spam in tests and noisy alerts during release windows in staging and production. Fix MaterializeInc#29199.
If an environmentd process discovers that it's been fenced out by an environmentd process with a newer generation, it should exit with code zero (0). The entrypoint.sh script in the environmentd Docker image will then sleep until the orchestration layer gets around to tearing down the container. Previously in this situation, the old environmentd would halt with exit code two (2), then restart and panic upon discovering that the catalog had been written to by a newer version. It would then remain in a panic loop until the orchestration layer got around to tearing down the container. This did not cause incorrectness, but led to log spam in tests and noisy alerts during release windows in staging and production. Fix MaterializeInc#29199.
If an environmentd process discovers that it's been fenced out by an environmentd process with a newer generation, it should exit with code zero (0). The entrypoint.sh script in the environmentd Docker image will then sleep until the orchestration layer gets around to tearing down the container. Previously in this situation, the old environmentd would halt with exit code two (2), then restart and panic upon discovering that the catalog had been written to by a newer version. It would then remain in a panic loop until the orchestration layer got around to tearing down the container. This did not cause incorrectness, but led to log spam in tests and noisy alerts during release windows in staging and production. Fix MaterializeInc#29199.
If an environmentd process discovers that it's been fenced out by an environmentd process with a newer generation, it should exit with code zero (0). The entrypoint.sh script in the environmentd Docker image will then sleep until the orchestration layer gets around to tearing down the container. Previously in this situation, the old environmentd would halt with exit code two (2), then restart and panic upon discovering that the catalog had been written to by a newer version. It would then remain in a panic loop until the orchestration layer got around to tearing down the container. This did not cause incorrectness, but led to log spam in tests and noisy alerts during release windows in staging and production. Fix MaterializeInc#29199.
If an environmentd process discovers that it's been fenced out by an environmentd process with a newer generation, it should exit with code zero (0). The entrypoint.sh script in the environmentd Docker image will then sleep until the orchestration layer gets around to tearing down the container. Previously in this situation, the old environmentd would halt with exit code two (2), then restart and panic upon discovering that the catalog had been written to by a newer version. It would then remain in a panic loop until the orchestration layer got around to tearing down the container. This did not cause incorrectness, but led to log spam in tests and noisy alerts during release windows in staging and production. Fix MaterializeInc#29199.
Previously, the catalog used both the epoch and the deploy generation as fencing tokens. However, they were both written to the catalog as separate entries and at different times. This complicated the fencing logic and often resulted in fencing errors with less information than we wanted. This commit combines the epoch and deploy generation into a single fencing token that is always written to the catalog atomically. Works towards resolving MaterializeInc#29199
Previously, the catalog used both the epoch and the deploy generation as fencing tokens. However, they were both written to the catalog as separate entries and at different times. This complicated the fencing logic and often resulted in fencing errors with less information than we wanted. This commit combines the epoch and deploy generation into a single fencing token that is always written to the catalog atomically. Works towards resolving MaterializeInc#29199
If an environmentd process discovers that it's been fenced out by an environmentd process with a newer generation, it should exit with code zero (0). The entrypoint.sh script in the environmentd Docker image will then sleep until the orchestration layer gets around to tearing down the container. Previously in this situation, the old environmentd would halt with exit code two (2), then restart and panic upon discovering that the catalog had been written to by a newer version. It would then remain in a panic loop until the orchestration layer got around to tearing down the container. This did not cause incorrectness, but led to log spam in tests and noisy alerts during release windows in staging and production. Fix MaterializeInc#29199.
If an environmentd process discovers that it's been fenced out by an environmentd process with a newer generation, it should exit with code zero (0). The entrypoint.sh script in the environmentd Docker image will then sleep until the orchestration layer gets around to tearing down the container. Previously in this situation, the old environmentd would halt with exit code two (2), then restart and panic upon discovering that the catalog had been written to by a newer version. It would then remain in a panic loop until the orchestration layer got around to tearing down the container. This did not cause incorrectness, but led to log spam in tests and noisy alerts during release windows in staging and production. Fix MaterializeInc#29199.
If an environmentd process discovers that it's been fenced out by an environmentd process with a newer generation, it should exit with code zero (0). The entrypoint.sh script in the environmentd Docker image will then sleep until the orchestration layer gets around to tearing down the container. Previously in this situation, the old environmentd would halt with exit code two (2), then restart and panic upon discovering that the catalog had been written to by a newer version. It would then remain in a panic loop until the orchestration layer got around to tearing down the container. This did not cause incorrectness, but led to log spam in tests and noisy alerts during release windows in staging and production. Fix MaterializeInc#29199.
If an environmentd process discovers that it's been fenced out by an environmentd process with a newer generation, it should exit with code zero (0). The entrypoint.sh script in the environmentd Docker image will then sleep until the orchestration layer gets around to tearing down the container. Previously in this situation, the old environmentd would halt with exit code two (2), then restart and panic upon discovering that the catalog had been written to by a newer version. It would then remain in a panic loop until the orchestration layer got around to tearing down the container. This did not cause incorrectness, but led to log spam in tests and noisy alerts during release windows in staging and production. Fix MaterializeInc#29199.
Previously, the catalog used both the epoch and the deploy generation as fencing tokens. However, they were both written to the catalog as separate entries and at different times. This complicated the fencing logic and often resulted in fencing errors with less information than we wanted. This commit combines the epoch and deploy generation into a single fencing token that is always written to the catalog atomically. It utilizes the `FenceToken` update kind which was added in a previous commit. Since the epoch and fence token must be available before the catalog is opened (i.e. before migrations are run), this commit needs to handle both epochs and fence tokens. In the next release we can fully remove the `Epoch` update kind. Additionally, fence errors can indicate with certainty whether the fence is due to the deploy generation or epoch. Therefore, we can remove the halt varint of `unwrap_or_terminate` which was used when we weren't sure if the deploy generation had increased. Works towards resolving #29199 --------- Co-authored-by: Dennis Felsing <[email protected]>
Reopening since #29269 was reverted. |
What version of Materialize are you using?
75a54f5
What is the issue?
I believe what is happening is that #29025 and #29035 don't work together
#29025 tries to exit with 0 when a deployment has been fenced out, but #29035 again causes us to panic (exit code 134), so environmentd is rebooted again and again. This causes the upgrade test failures in https://buildkite.com/materialize/nightly/builds/9225
I'm not sure what the cleanest way to fix this in code is, maybe the catalog version check should use
exit!(0, ...
instead of panicking. Or theexit!(0, "this deployment has been fenced out");
should happen earlier? The tests still seem to run fine, the old environmentd just keeps restarting and panicking causing logs to fill up with garbage, so for now from testing side I'll silence the panics until we have a better solutionThis started occurring now because the release version was bumped and thus the older version we pick also contains #29035 . This is a general problem where we only notice problems with the catalog versioning when we actually bump the release version. Maybe we should have a test that creates a new build with the next version and tries upgrading to it, so that we can catch those issues earlier. I'm not that worried about test issues, but product issues could also hide like that. I'll give it a try.
The text was updated successfully, but these errors were encountered: