Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0dt upgrades: environmentd: fatal: incompatible persist version 0.115.0-dev.0, current: 0.114.0, make sure to upgrade the catalog one version forward at a time #29199

Closed
def- opened this issue Aug 23, 2024 · 7 comments · Fixed by #29214 or #29593
Assignees
Labels
C-bug Category: something is broken T-platform-v2 Theme: Platform v2

Comments

@def-
Copy link
Contributor

def- commented Aug 23, 2024

What version of Materialize are you using?

75a54f5

What is the issue?

I believe what is happening is that #29025 and #29035 don't work together
#29025 tries to exit with 0 when a deployment has been fenced out, but #29035 again causes us to panic (exit code 134), so environmentd is rebooted again and again. This causes the upgrade test failures in https://buildkite.com/materialize/nightly/builds/9225

platform-checks-mz_2-1              | environmentd: fatal: incompatible persist version 0.115.0-dev.0, current: 0.114.0, make sure to upgrade the catalog one version forward at a time
platform-checks-mz_2-1              | thread 'main' panicked at src/environmentd/src/bin/environmentd/main.rs:604:9: environmentd: fatal: incompatible persist version 0.115.0-dev.0, current: 0.114.0, make sure to upgrade the catalog one version forward at a time

I'm not sure what the cleanest way to fix this in code is, maybe the catalog version check should use exit!(0, ... instead of panicking. Or the exit!(0, "this deployment has been fenced out"); should happen earlier? The tests still seem to run fine, the old environmentd just keeps restarting and panicking causing logs to fill up with garbage, so for now from testing side I'll silence the panics until we have a better solution

This started occurring now because the release version was bumped and thus the older version we pick also contains #29035 . This is a general problem where we only notice problems with the catalog versioning when we actually bump the release version. Maybe we should have a test that creates a new build with the next version and tries upgrading to it, so that we can catch those issues earlier. I'm not that worried about test issues, but product issues could also hide like that. I'll give it a try.

@jkosh44
Copy link
Contributor

jkosh44 commented Aug 23, 2024

IMO failing the version check should be treated as a panic and use the same exit code as all other panics. I'll look into why it's exiting with 134 and see if I can update it to use exit code 0.

@jkosh44
Copy link
Contributor

jkosh44 commented Aug 23, 2024

Another tricky thing is testing the fix. It looks like it's version 0.114.0 that's stuck in a crash loop, so no changes we make will take affect until it's included in a release.

@def-
Copy link
Contributor Author

def- commented Aug 23, 2024

That's why I made #29201, which allows us to test it immediately by creating a new version with the next minor. It seems to work: https://buildkite.com/materialize/nightly/builds/9232

@jkosh44
Copy link
Contributor

jkosh44 commented Aug 23, 2024

The issue seems to be that if the error bubbles up to main, then it will unconditionally panic instead of potentially halting on certain errors.

if let Err(err) = run(args) {
panic!("environmentd: fatal: {}", err.display_with_causes());
}

@jkosh44
Copy link
Contributor

jkosh44 commented Aug 23, 2024

I had a discussion with @benesch and this is my understanding of the issue. Before #29035, 0dt worked like this:

  1. Envd version n is running.
  2. We start envd version n + 1, which eventually increments the epoch and binary version in the catalog.
  3. Envd version n sees a higher epoch in the catalog and halts, causing the process to exit with a non-zero exit code.
  4. Envd version n is restarted because it exited with a non-zero exit code.
  5. Envd version n initializes the catalog.
  6. Envd version n enters the preflight checks, looks at the most recent deploy generation in the catalog, sees that it's larger than the deploy version it was started with, and exits with a zero exit code.
  7. Envd version n is NOT restarted because it exited with a zero exit code.

After #29035, we see the same first four steps, and continue like this:

  1. Envd version n tries to initialize the catalog, but fails because the catalog has been opened with a higher binary version, and exits with a non-zero exit code.
  2. Go back to step (4).

So we end up in an infinite crash loop.

My proposal to fix this is the following:

  • The durable catalog will store in memory the deploy generation it was started with.
  • In the durable catalog, instead of immediately returning a fence error, always drain all updates from the persist shard. If there is a deploy generation greater than the deploy generation stored in memory return a DeployGenerationFence error, otherwise return the original fence error.
  • Whenever the adapter/Coordinator/envd sees a DeployGenerationFence error, exit with exit code 0.

benesch added a commit to benesch/materialize that referenced this issue Aug 24, 2024
If an environmentd process discovers that it's been fenced out by an
environmentd process with a newer generation, it should exit with code
zero (0). The entrypoint.sh script in the environmentd Docker image will
then sleep until the orchestration layer gets around to tearing down the
container.

Previously in this situation, the old environmentd would halt with exit
code two (2), then restart and panic upon discovering that the catalog
had been written to by a newer version. It would then remain in a panic
loop until the orchestration layer got around to tearing down the
container. This did not cause incorrectness, but led to log spam in
tests and noisy alerts during release windows in staging and production.

Fix MaterializeInc#29199.
benesch added a commit to benesch/materialize that referenced this issue Aug 24, 2024
If an environmentd process discovers that it's been fenced out by an
environmentd process with a newer generation, it should exit with code
zero (0). The entrypoint.sh script in the environmentd Docker image will
then sleep until the orchestration layer gets around to tearing down the
container.

Previously in this situation, the old environmentd would halt with exit
code two (2), then restart and panic upon discovering that the catalog
had been written to by a newer version. It would then remain in a panic
loop until the orchestration layer got around to tearing down the
container. This did not cause incorrectness, but led to log spam in
tests and noisy alerts during release windows in staging and production.

Fix MaterializeInc#29199.
@benesch
Copy link
Member

benesch commented Aug 24, 2024

My proposal to fix this is the following:

  • The durable catalog will store in memory the deploy generation it was started with.
  • In the durable catalog, instead of immediately returning a fence error, always drain all updates from the persist shard. If there is a deploy generation greater than the deploy generation stored in memory return a DeployGenerationFence error, otherwise return the original fence error.
  • Whenever the adapter/Coordinator/envd sees a DeployGenerationFence error, exit with exit code 0.

Proposal LGTM. I put up #29214 to tackle the last bullet—hoping you can fill in the first two bullets next week, Joe!

@jkosh44 jkosh44 mentioned this issue Aug 27, 2024
5 tasks
jkosh44 added a commit to jkosh44/materialize that referenced this issue Aug 27, 2024
Previously, the durable catalog could be fenced by one of the
following ways:

  - It tried to listen to the durable catalog and saw a higher epoch.
  - It tried to listen to the durable catalog and saw a higher binary
    version.
  - It tried to write to the durable catalog and had an upper
    mismatch.
  - It tried to write to the catalog migration shard and had an upper
    mismatch.

These would either result in a non-descriptive
`DurableCatalogError::Fence` error or a
`DurableCatalogError::IncompatiblePersistVersion` error. Additionally,
if multiple of these conditions happened at the same time, it was
non-deterministic which error would be returned.

This commit modifies the conditions for when the durable catalog can be
fenced to the following conditions:

  - It tried to listen to the durable catalog and saw a higher deploy
    generation.
  - It tried to listen to the durable catalog and saw a higher epoch.
  - It tried to write to the durable catalog shard and had an upper
    mismatch.
  - It tried to write to the catalog migration shard and had an upper
    mismatch.

When initializing the catalog, we still check that the current binary
version is not less than the catalog binary version, but we no longer
check during every update. After the catalog has been initialized,
another catalog can only increase the binary version by also fencing
the current catalog by another method.

This commit also creates a new error enum to explicitly codify these
different types of fences. When multiple of these conditions happen at
the same time, then we deterministically will return a fence error with
the priority of how the conditions are ordered above. The context of
the fence is useful to higher layers in determining what action to
take. For example if the deploy generation has increased, then we'd
like to gracefully exit with exit code 0 and let the new instance take
over.

Works towards resolving MaterializeInc#29199
jkosh44 added a commit that referenced this issue Aug 27, 2024
Previously, the durable catalog could be fenced by one of the
following ways:

  - It tried to listen to the durable catalog and saw a higher epoch.
  - It tried to listen to the durable catalog and saw a higher binary
    version.
  - It tried to write to the durable catalog and had an upper
    mismatch.
  - It tried to write to the catalog migration shard and had an upper
    mismatch.

These would either result in a non-descriptive
`DurableCatalogError::Fence` error or a
`DurableCatalogError::IncompatiblePersistVersion` error. Additionally,
if multiple of these conditions happened at the same time, it was
non-deterministic which error would be returned.

This commit modifies the conditions for when the durable catalog can be
fenced to the following conditions:

  - It tried to listen to the durable catalog and saw a higher deploy
    generation.
  - It tried to listen to the durable catalog and saw a higher epoch.
  - It tried to write to the durable catalog shard and had an upper
    mismatch.
  - It tried to write to the catalog migration shard and had an upper
    mismatch.

When initializing the catalog, we still check that the current binary
version is not less than the catalog binary version, but we no longer
check during every update. After the catalog has been initialized,
another catalog can only increase the binary version by also fencing
the current catalog by another method.

This commit also creates a new error enum to explicitly codify these
different types of fences. When multiple of these conditions happen at
the same time, then we deterministically will return a fence error with
the priority of how the conditions are ordered above. The context of
the fence is useful to higher layers in determining what action to
take. For example if the deploy generation has increased, then we'd
like to gracefully exit with exit code 0 and let the new instance take
over.

Works towards resolving #29199
jkosh44 pushed a commit to benesch/materialize that referenced this issue Aug 27, 2024
If an environmentd process discovers that it's been fenced out by an
environmentd process with a newer generation, it should exit with code
zero (0). The entrypoint.sh script in the environmentd Docker image will
then sleep until the orchestration layer gets around to tearing down the
container.

Previously in this situation, the old environmentd would halt with exit
code two (2), then restart and panic upon discovering that the catalog
had been written to by a newer version. It would then remain in a panic
loop until the orchestration layer got around to tearing down the
container. This did not cause incorrectness, but led to log spam in
tests and noisy alerts during release windows in staging and production.

Fix MaterializeInc#29199.
jkosh44 pushed a commit to benesch/materialize that referenced this issue Aug 27, 2024
If an environmentd process discovers that it's been fenced out by an
environmentd process with a newer generation, it should exit with code
zero (0). The entrypoint.sh script in the environmentd Docker image will
then sleep until the orchestration layer gets around to tearing down the
container.

Previously in this situation, the old environmentd would halt with exit
code two (2), then restart and panic upon discovering that the catalog
had been written to by a newer version. It would then remain in a panic
loop until the orchestration layer got around to tearing down the
container. This did not cause incorrectness, but led to log spam in
tests and noisy alerts during release windows in staging and production.

Fix MaterializeInc#29199.
jkosh44 pushed a commit to benesch/materialize that referenced this issue Aug 27, 2024
If an environmentd process discovers that it's been fenced out by an
environmentd process with a newer generation, it should exit with code
zero (0). The entrypoint.sh script in the environmentd Docker image will
then sleep until the orchestration layer gets around to tearing down the
container.

Previously in this situation, the old environmentd would halt with exit
code two (2), then restart and panic upon discovering that the catalog
had been written to by a newer version. It would then remain in a panic
loop until the orchestration layer got around to tearing down the
container. This did not cause incorrectness, but led to log spam in
tests and noisy alerts during release windows in staging and production.

Fix MaterializeInc#29199.
jkosh44 pushed a commit to benesch/materialize that referenced this issue Aug 27, 2024
If an environmentd process discovers that it's been fenced out by an
environmentd process with a newer generation, it should exit with code
zero (0). The entrypoint.sh script in the environmentd Docker image will
then sleep until the orchestration layer gets around to tearing down the
container.

Previously in this situation, the old environmentd would halt with exit
code two (2), then restart and panic upon discovering that the catalog
had been written to by a newer version. It would then remain in a panic
loop until the orchestration layer got around to tearing down the
container. This did not cause incorrectness, but led to log spam in
tests and noisy alerts during release windows in staging and production.

Fix MaterializeInc#29199.
jkosh44 pushed a commit to benesch/materialize that referenced this issue Aug 28, 2024
If an environmentd process discovers that it's been fenced out by an
environmentd process with a newer generation, it should exit with code
zero (0). The entrypoint.sh script in the environmentd Docker image will
then sleep until the orchestration layer gets around to tearing down the
container.

Previously in this situation, the old environmentd would halt with exit
code two (2), then restart and panic upon discovering that the catalog
had been written to by a newer version. It would then remain in a panic
loop until the orchestration layer got around to tearing down the
container. This did not cause incorrectness, but led to log spam in
tests and noisy alerts during release windows in staging and production.

Fix MaterializeInc#29199.
jkosh44 added a commit to jkosh44/materialize that referenced this issue Aug 29, 2024
Previously, the catalog used both the epoch and the deploy generation
as fencing tokens. However, they were both written to the catalog as
separate entries and at different times. This complicated the fencing
logic and often resulted in fencing errors with less information than
we wanted.

This commit combines the epoch and deploy generation into a single
fencing token that is always written to the catalog atomically.

Works towards resolving MaterializeInc#29199
jkosh44 added a commit to jkosh44/materialize that referenced this issue Aug 29, 2024
Previously, the catalog used both the epoch and the deploy generation
as fencing tokens. However, they were both written to the catalog as
separate entries and at different times. This complicated the fencing
logic and often resulted in fencing errors with less information than
we wanted.

This commit combines the epoch and deploy generation into a single
fencing token that is always written to the catalog atomically.

Works towards resolving MaterializeInc#29199
jkosh44 pushed a commit to benesch/materialize that referenced this issue Aug 29, 2024
If an environmentd process discovers that it's been fenced out by an
environmentd process with a newer generation, it should exit with code
zero (0). The entrypoint.sh script in the environmentd Docker image will
then sleep until the orchestration layer gets around to tearing down the
container.

Previously in this situation, the old environmentd would halt with exit
code two (2), then restart and panic upon discovering that the catalog
had been written to by a newer version. It would then remain in a panic
loop until the orchestration layer got around to tearing down the
container. This did not cause incorrectness, but led to log spam in
tests and noisy alerts during release windows in staging and production.

Fix MaterializeInc#29199.
jkosh44 pushed a commit to benesch/materialize that referenced this issue Aug 30, 2024
If an environmentd process discovers that it's been fenced out by an
environmentd process with a newer generation, it should exit with code
zero (0). The entrypoint.sh script in the environmentd Docker image will
then sleep until the orchestration layer gets around to tearing down the
container.

Previously in this situation, the old environmentd would halt with exit
code two (2), then restart and panic upon discovering that the catalog
had been written to by a newer version. It would then remain in a panic
loop until the orchestration layer got around to tearing down the
container. This did not cause incorrectness, but led to log spam in
tests and noisy alerts during release windows in staging and production.

Fix MaterializeInc#29199.
jkosh44 pushed a commit to benesch/materialize that referenced this issue Aug 30, 2024
If an environmentd process discovers that it's been fenced out by an
environmentd process with a newer generation, it should exit with code
zero (0). The entrypoint.sh script in the environmentd Docker image will
then sleep until the orchestration layer gets around to tearing down the
container.

Previously in this situation, the old environmentd would halt with exit
code two (2), then restart and panic upon discovering that the catalog
had been written to by a newer version. It would then remain in a panic
loop until the orchestration layer got around to tearing down the
container. This did not cause incorrectness, but led to log spam in
tests and noisy alerts during release windows in staging and production.

Fix MaterializeInc#29199.
jkosh44 pushed a commit to benesch/materialize that referenced this issue Aug 30, 2024
If an environmentd process discovers that it's been fenced out by an
environmentd process with a newer generation, it should exit with code
zero (0). The entrypoint.sh script in the environmentd Docker image will
then sleep until the orchestration layer gets around to tearing down the
container.

Previously in this situation, the old environmentd would halt with exit
code two (2), then restart and panic upon discovering that the catalog
had been written to by a newer version. It would then remain in a panic
loop until the orchestration layer got around to tearing down the
container. This did not cause incorrectness, but led to log spam in
tests and noisy alerts during release windows in staging and production.

Fix MaterializeInc#29199.
jkosh44 added a commit that referenced this issue Sep 6, 2024
Previously, the catalog used both the epoch and the deploy generation
as fencing tokens. However, they were both written to the catalog as
separate entries and at different times. This complicated the fencing
logic and often resulted in fencing errors with less information than
we wanted.

This commit combines the epoch and deploy generation into a single
fencing token that is always written to the catalog atomically. It
utilizes the `FenceToken` update kind which was added in a previous
commit. Since the epoch and fence token must be available before the
catalog is opened (i.e. before migrations are run), this commit needs
to handle both epochs and fence tokens. In the next release we can
fully remove the `Epoch` update kind.

Additionally, fence errors can indicate with certainty whether the
fence is due to the deploy generation or epoch. Therefore, we can
remove the halt varint of `unwrap_or_terminate` which was used when
we weren't sure if the deploy generation had increased.

Works towards resolving #29199

---------

Co-authored-by: Dennis Felsing <[email protected]>
@benesch benesch reopened this Sep 14, 2024
@benesch
Copy link
Member

benesch commented Sep 14, 2024

Reopening since #29269 was reverted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: something is broken T-platform-v2 Theme: Platform v2
Projects
None yet
4 participants