0dt upgrades: environmentd: fatal: incompatible persist version 0.115.0-dev.0, current: 0.114.0, make sure to upgrade the catalog one version forward at a time #29199

def- · 2024-08-23T10:15:44Z

What version of Materialize are you using?

75a54f5

What is the issue?

I believe what is happening is that #29025 and #29035 don't work together
#29025 tries to exit with 0 when a deployment has been fenced out, but #29035 again causes us to panic (exit code 134), so environmentd is rebooted again and again. This causes the upgrade test failures in https://buildkite.com/materialize/nightly/builds/9225

platform-checks-mz_2-1              | environmentd: fatal: incompatible persist version 0.115.0-dev.0, current: 0.114.0, make sure to upgrade the catalog one version forward at a time
platform-checks-mz_2-1              | thread 'main' panicked at src/environmentd/src/bin/environmentd/main.rs:604:9: environmentd: fatal: incompatible persist version 0.115.0-dev.0, current: 0.114.0, make sure to upgrade the catalog one version forward at a time

I'm not sure what the cleanest way to fix this in code is, maybe the catalog version check should use exit!(0, ... instead of panicking. Or the exit!(0, "this deployment has been fenced out"); should happen earlier? The tests still seem to run fine, the old environmentd just keeps restarting and panicking causing logs to fill up with garbage, so for now from testing side I'll silence the panics until we have a better solution

This started occurring now because the release version was bumped and thus the older version we pick also contains #29035 . This is a general problem where we only notice problems with the catalog versioning when we actually bump the release version. Maybe we should have a test that creates a new build with the next version and tries upgrading to it, so that we can catch those issues earlier. I'm not that worried about test issues, but product issues could also hide like that. I'll give it a try.

The text was updated successfully, but these errors were encountered:

jkosh44 · 2024-08-23T13:16:59Z

IMO failing the version check should be treated as a panic and use the same exit code as all other panics. I'll look into why it's exiting with 134 and see if I can update it to use exit code 0.

jkosh44 · 2024-08-23T13:42:57Z

Another tricky thing is testing the fix. It looks like it's version 0.114.0 that's stuck in a crash loop, so no changes we make will take affect until it's included in a release.

def- · 2024-08-23T13:45:08Z

That's why I made #29201, which allows us to test it immediately by creating a new version with the next minor. It seems to work: https://buildkite.com/materialize/nightly/builds/9232

jkosh44 · 2024-08-23T14:27:06Z

The issue seems to be that if the error bubbles up to main, then it will unconditionally panic instead of potentially halting on certain errors.

materialize/src/environmentd/src/bin/environmentd/main.rs

Lines 603 to 605 in 62d8ae7

    
           if let Err(err) = run(args) { 
        
               panic!("environmentd: fatal: {}", err.display_with_causes()); 
        
           }

jkosh44 · 2024-08-23T22:12:31Z

I had a discussion with @benesch and this is my understanding of the issue. Before #29035, 0dt worked like this:

Envd version n is running.
We start envd version n + 1, which eventually increments the epoch and binary version in the catalog.
Envd version n sees a higher epoch in the catalog and halts, causing the process to exit with a non-zero exit code.
Envd version n is restarted because it exited with a non-zero exit code.
Envd version n initializes the catalog.
Envd version n enters the preflight checks, looks at the most recent deploy generation in the catalog, sees that it's larger than the deploy version it was started with, and exits with a zero exit code.
Envd version n is NOT restarted because it exited with a zero exit code.

After #29035, we see the same first four steps, and continue like this:

Envd version n tries to initialize the catalog, but fails because the catalog has been opened with a higher binary version, and exits with a non-zero exit code.
Go back to step (4).

So we end up in an infinite crash loop.

My proposal to fix this is the following:

The durable catalog will store in memory the deploy generation it was started with.
In the durable catalog, instead of immediately returning a fence error, always drain all updates from the persist shard. If there is a deploy generation greater than the deploy generation stored in memory return a DeployGenerationFence error, otherwise return the original fence error.
Whenever the adapter/Coordinator/envd sees a DeployGenerationFence error, exit with exit code 0.

If an environmentd process discovers that it's been fenced out by an environmentd process with a newer generation, it should exit with code zero (0). The entrypoint.sh script in the environmentd Docker image will then sleep until the orchestration layer gets around to tearing down the container. Previously in this situation, the old environmentd would halt with exit code two (2), then restart and panic upon discovering that the catalog had been written to by a newer version. It would then remain in a panic loop until the orchestration layer got around to tearing down the container. This did not cause incorrectness, but led to log spam in tests and noisy alerts during release windows in staging and production. Fix MaterializeInc#29199.

benesch · 2024-08-24T02:23:10Z

My proposal to fix this is the following:

The durable catalog will store in memory the deploy generation it was started with.

In the durable catalog, instead of immediately returning a fence error, always drain all updates from the persist shard. If there is a deploy generation greater than the deploy generation stored in memory return a DeployGenerationFence error, otherwise return the original fence error.

Whenever the adapter/Coordinator/envd sees a DeployGenerationFence error, exit with exit code 0.

Proposal LGTM. I put up #29214 to tackle the last bullet—hoping you can fill in the first two bullets next week, Joe!

Previously, the durable catalog could be fenced by one of the following ways: - It tried to listen to the durable catalog and saw a higher epoch. - It tried to listen to the durable catalog and saw a higher binary version. - It tried to write to the durable catalog and had an upper mismatch. - It tried to write to the catalog migration shard and had an upper mismatch. These would either result in a non-descriptive `DurableCatalogError::Fence` error or a `DurableCatalogError::IncompatiblePersistVersion` error. Additionally, if multiple of these conditions happened at the same time, it was non-deterministic which error would be returned. This commit modifies the conditions for when the durable catalog can be fenced to the following conditions: - It tried to listen to the durable catalog and saw a higher deploy generation. - It tried to listen to the durable catalog and saw a higher epoch. - It tried to write to the durable catalog shard and had an upper mismatch. - It tried to write to the catalog migration shard and had an upper mismatch. When initializing the catalog, we still check that the current binary version is not less than the catalog binary version, but we no longer check during every update. After the catalog has been initialized, another catalog can only increase the binary version by also fencing the current catalog by another method. This commit also creates a new error enum to explicitly codify these different types of fences. When multiple of these conditions happen at the same time, then we deterministically will return a fence error with the priority of how the conditions are ordered above. The context of the fence is useful to higher layers in determining what action to take. For example if the deploy generation has increased, then we'd like to gracefully exit with exit code 0 and let the new instance take over. Works towards resolving MaterializeInc#29199

Previously, the durable catalog could be fenced by one of the following ways: - It tried to listen to the durable catalog and saw a higher epoch. - It tried to listen to the durable catalog and saw a higher binary version. - It tried to write to the durable catalog and had an upper mismatch. - It tried to write to the catalog migration shard and had an upper mismatch. These would either result in a non-descriptive `DurableCatalogError::Fence` error or a `DurableCatalogError::IncompatiblePersistVersion` error. Additionally, if multiple of these conditions happened at the same time, it was non-deterministic which error would be returned. This commit modifies the conditions for when the durable catalog can be fenced to the following conditions: - It tried to listen to the durable catalog and saw a higher deploy generation. - It tried to listen to the durable catalog and saw a higher epoch. - It tried to write to the durable catalog shard and had an upper mismatch. - It tried to write to the catalog migration shard and had an upper mismatch. When initializing the catalog, we still check that the current binary version is not less than the catalog binary version, but we no longer check during every update. After the catalog has been initialized, another catalog can only increase the binary version by also fencing the current catalog by another method. This commit also creates a new error enum to explicitly codify these different types of fences. When multiple of these conditions happen at the same time, then we deterministically will return a fence error with the priority of how the conditions are ordered above. The context of the fence is useful to higher layers in determining what action to take. For example if the deploy generation has increased, then we'd like to gracefully exit with exit code 0 and let the new instance take over. Works towards resolving #29199

If an environmentd process discovers that it's been fenced out by an environmentd process with a newer generation, it should exit with code zero (0). The entrypoint.sh script in the environmentd Docker image will then sleep until the orchestration layer gets around to tearing down the container. Previously in this situation, the old environmentd would halt with exit code two (2), then restart and panic upon discovering that the catalog had been written to by a newer version. It would then remain in a panic loop until the orchestration layer got around to tearing down the container. This did not cause incorrectness, but led to log spam in tests and noisy alerts during release windows in staging and production. Fix MaterializeInc#29199.

Previously, the catalog used both the epoch and the deploy generation as fencing tokens. However, they were both written to the catalog as separate entries and at different times. This complicated the fencing logic and often resulted in fencing errors with less information than we wanted. This commit combines the epoch and deploy generation into a single fencing token that is always written to the catalog atomically. Works towards resolving MaterializeInc#29199

If an environmentd process discovers that it's been fenced out by an environmentd process with a newer generation, it should exit with code zero (0). The entrypoint.sh script in the environmentd Docker image will then sleep until the orchestration layer gets around to tearing down the container. Previously in this situation, the old environmentd would halt with exit code two (2), then restart and panic upon discovering that the catalog had been written to by a newer version. It would then remain in a panic loop until the orchestration layer got around to tearing down the container. This did not cause incorrectness, but led to log spam in tests and noisy alerts during release windows in staging and production. Fix MaterializeInc#29199.

Previously, the catalog used both the epoch and the deploy generation as fencing tokens. However, they were both written to the catalog as separate entries and at different times. This complicated the fencing logic and often resulted in fencing errors with less information than we wanted. This commit combines the epoch and deploy generation into a single fencing token that is always written to the catalog atomically. It utilizes the `FenceToken` update kind which was added in a previous commit. Since the epoch and fence token must be available before the catalog is opened (i.e. before migrations are run), this commit needs to handle both epochs and fence tokens. In the next release we can fully remove the `Epoch` update kind. Additionally, fence errors can indicate with certainty whether the fence is due to the deploy generation or epoch. Therefore, we can remove the halt varint of `unwrap_or_terminate` which was used when we weren't sure if the deploy generation had increased. Works towards resolving #29199 --------- Co-authored-by: Dennis Felsing <[email protected]>

benesch · 2024-09-14T00:05:23Z

Reopening since #29269 was reverted.

def- added C-bug Category: something is broken T-platform-v2 Theme: Platform v2 labels Aug 23, 2024

def- assigned aljoscha, jkosh44 and def- Aug 23, 2024

This was referenced Aug 23, 2024

platform-checks: Add ZeroDowntimeBumpedVersion scenario #29201

Merged

ci: Ignore old environmentd failing during legacy upgrade test #29194

Closed

benesch mentioned this issue Aug 24, 2024

environmentd: gracefully terminate when fenced out by new gen #29214

Merged

5 tasks

jkosh44 mentioned this issue Aug 27, 2024

catalog: Improve fencing #29224

Merged

5 tasks

jkosh44 mentioned this issue Aug 29, 2024

catalog: Combine epoch and deploy generation #29269

Merged

5 tasks

benesch closed this as completed in #29214 Aug 30, 2024

benesch closed this as completed in f6eebee Aug 30, 2024

benesch reopened this Sep 14, 2024

jkosh44 mentioned this issue Sep 18, 2024

Combine epoch gen bug fix #29593

Merged

5 tasks

jkosh44 closed this as completed in #29593 Sep 19, 2024

jkosh44 closed this as completed in 2f57a6f Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0dt upgrades: environmentd: fatal: incompatible persist version 0.115.0-dev.0, current: 0.114.0, make sure to upgrade the catalog one version forward at a time #29199

0dt upgrades: environmentd: fatal: incompatible persist version 0.115.0-dev.0, current: 0.114.0, make sure to upgrade the catalog one version forward at a time #29199

def- commented Aug 23, 2024 •

edited

Loading

jkosh44 commented Aug 23, 2024

jkosh44 commented Aug 23, 2024

def- commented Aug 23, 2024

jkosh44 commented Aug 23, 2024

jkosh44 commented Aug 23, 2024

benesch commented Aug 24, 2024 •

edited

Loading

benesch commented Sep 14, 2024

0dt upgrades: environmentd: fatal: incompatible persist version 0.115.0-dev.0, current: 0.114.0, make sure to upgrade the catalog one version forward at a time #29199

0dt upgrades: environmentd: fatal: incompatible persist version 0.115.0-dev.0, current: 0.114.0, make sure to upgrade the catalog one version forward at a time #29199

Comments

def- commented Aug 23, 2024 • edited Loading

What version of Materialize are you using?

What is the issue?

jkosh44 commented Aug 23, 2024

jkosh44 commented Aug 23, 2024

def- commented Aug 23, 2024

jkosh44 commented Aug 23, 2024

jkosh44 commented Aug 23, 2024

benesch commented Aug 24, 2024 • edited Loading

benesch commented Sep 14, 2024

def- commented Aug 23, 2024 •

edited

Loading

benesch commented Aug 24, 2024 •

edited

Loading