Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalizing staged deployments broken on /boot automount #2543

Closed
dbnicholson opened this issue Feb 16, 2022 · 5 comments · Fixed by #2544
Closed

Finalizing staged deployments broken on /boot automount #2543

dbnicholson opened this issue Feb 16, 2022 · 5 comments · Fixed by #2544

Comments

@dbnicholson
Copy link
Member

Recently we changed our updater to use staged deployments in endlessm/eos-updater#298. That worked fine on systems where /boot is a persistent mount point, but it fails on systems that use systemd-boot where /boot is the automounted EFI system partition. There are 2 problems with this:

  1. If the /boot automount expires, the ostree-finalize-staged.service unit runs immediately since it has RequiresMountsFor=/boot. With nothing keeping the automount from expiring, this can happen at any point prior to shutdown and ruin the feature. This actually deadlocks in systemd, but it would be bad even without the automounting bugs.
  2. If RequiresMountsFor=/boot is removed and instead just After=boot.mount is used, then the service is only triggered on shutdown but the ordering remains. However, if the automount has expired, systemd will ignore the request to remount it since the automount is scheduled to be stopped.

See systemd/systemd#22528 for details. Maybe the solution here is staged deployments are not supported on /boot automounts, but I wanted to open for discussion.

@dbnicholson
Copy link
Member Author

The dirty idea I had was to change ostree admin finalize-staged so it opens /boot (and /sysroot) when starting and then blocks waiting for SIGTERM via sigwait. This would effectively turn /boot into a persistent mount if an deployment was scheduled for finalizing since the automount would never expire.

dbnicholson added a commit to endlessm/eos-updater that referenced this issue Feb 16, 2022
This reverts commit 12d263b. On systems
such as PAYG where `/boot` is an automount,
`ostree-finalize-staged.service` fails to work correctly if the
automount expires before shutdown. Until a solution to that issue is
found, go back to the non-staged deployments we've used for years.

https://phabricator.endlessm.com/T5658
systemd/systemd#22528
ostreedev/ostree#2543
@cgwalters
Copy link
Member

I think that approach isn't dirty at all - it makes sense to me. The code is already heavily oriented towards using directory file descriptors, so we already have a natural mechanism to hold open the mounts.

@cgwalters
Copy link
Member

And that would actually change us from using ExecStop= to ExecStart= which is definitely more natural.

@dbnicholson
Copy link
Member Author

Alright, I'll put something together. One question I had is, what if someone unstages the deployment? Should it watch /run/ostree/staged-deployment for deletion and stop itself without finalizing? I think finalizing is pretty much idempotent, so probably it could be allowed to run even if someone unstages.

However, you would need to change the builtin to not initially lock the sysroot since that would prevent doing anything else with it until the unit was stopped. So, I think you'd want to block on the signal, receive the signal, lock the sysroot, load it again (so the state of the deployments is up to date), and then finalize.

@cgwalters
Copy link
Member

Should it watch /run/ostree/staged-deployment for deletion and stop itself without finalizing? I think finalizing is pretty much idempotent, so probably it could be allowed to run even if someone unstages.

If you want to handle this case, that sounds good to me, but it doesn't seem at all required to me. It's a real corner case, and we aren't going to be holding open much resident memory. And we can fix it later if someone actually does complain, so probably keep it simple to start.

Plus, people should be applying kernel updates and rebooting anyways 😄

dbnicholson added a commit to dbnicholson/ostree that referenced this issue Feb 16, 2022
If `/boot` or `/sysroot` are automounts, then the unit will be stopped
as soon as the automounts expire. That's would defeat the purpose of
using systemd to delay finalizing the deployment until shutdown. This is
not uncommon as `systemd-gpt-auto-generator` will create an automount
unit for `/boot` when it's the EFI System Partition and there's no fstab
entry.

Instead of relying on systemd to run the command via `ExecStop` at the
appropriate time, have `finalize-staged` open `/boot` and `/sysroot` and
then block on `SIGTERM`. Having the directories open will prevent the
automounts from expiring, and then we presume that systemd will send
`SIGTERM` when it's time for the service to stop. Finalizing the
deployment still happens when the service is stopped. The difference is
that the process is already running.

In order to keep from blocking legitimate sysroot activity prior to
shutdown, the sysroot lock is only taken after the signal has been
received. Similarly, the sysroot is reloaded to ensure the state of the
deployments is current.

Fixes: ostreedev#2543
dbnicholson added a commit to dbnicholson/ostree that referenced this issue Feb 17, 2022
If `/boot` or `/sysroot` are automounts, then the unit will be stopped
as soon as the automounts expire. That's would defeat the purpose of
using systemd to delay finalizing the deployment until shutdown. This is
not uncommon as `systemd-gpt-auto-generator` will create an automount
unit for `/boot` when it's the EFI System Partition and there's no fstab
entry.

Instead of relying on systemd to run the command via `ExecStop` at the
appropriate time, have `finalize-staged` open `/boot` and `/sysroot` and
then block on `SIGTERM`. Having the directories open will prevent the
automounts from expiring, and then we presume that systemd will send
`SIGTERM` when it's time for the service to stop. Finalizing the
deployment still happens when the service is stopped. The difference is
that the process is already running.

In order to keep from blocking legitimate sysroot activity prior to
shutdown, the sysroot lock is only taken after the signal has been
received. Similarly, the sysroot is reloaded to ensure the state of the
deployments is current.

Fixes: ostreedev#2543
dbnicholson added a commit to dbnicholson/ostree that referenced this issue Feb 17, 2022
If `/boot` is an automount, then the unit will be stopped as soon as the
automount expires. That's would defeat the purpose of using systemd to
delay finalizing the deployment until shutdown. This is not uncommon as
`systemd-gpt-auto-generator` will create an automount unit for `/boot`
when it's the EFI System Partition and there's no fstab entry.

Instead of relying on systemd to run the command via `ExecStop` at the
appropriate time, have `finalize-staged` open `/boot` and then block on
`SIGTERM`. Having the directory open will prevent the automount from
expiring, and then we presume that systemd will send `SIGTERM` when it's
time for the service to stop. Finalizing the deployment still happens
when the service is stopped. The difference is that the process is
already running.

In order to keep from blocking legitimate sysroot activity prior to
shutdown, the sysroot lock is only taken after the signal has been
received. Similarly, the sysroot is reloaded to ensure the state of the
deployments is current.

Fixes: ostreedev#2543
dbnicholson added a commit to dbnicholson/ostree that referenced this issue Feb 17, 2022
If `/boot` is an automount, then the unit will be stopped as soon as the
automount expires. That's would defeat the purpose of using systemd to
delay finalizing the deployment until shutdown. This is not uncommon as
`systemd-gpt-auto-generator` will create an automount unit for `/boot`
when it's the EFI System Partition and there's no fstab entry.

Instead of relying on systemd to run the command via `ExecStop` at the
appropriate time, have `finalize-staged` open `/boot` and then block on
`SIGTERM`. Having the directory open will prevent the automount from
expiring, and then we presume that systemd will send `SIGTERM` when it's
time for the service to stop. Finalizing the deployment still happens
when the service is stopped. The difference is that the process is
already running.

In order to keep from blocking legitimate sysroot activity prior to
shutdown, the sysroot lock is only taken after the signal has been
received. Similarly, the sysroot is reloaded to ensure the state of the
deployments is current.

Fixes: ostreedev#2543
dbnicholson added a commit to endlessm/eos-updater that referenced this issue Mar 3, 2022
The ostree staged deployment process works by waiting until shutdown to
swap the `/boot` symlinks to make the new deployment the default.
However, when `/boot` is the EFI System Partition and there's no `fstab`
entry, `systemd-gpt-auto-generator` sets up an automount so that the
VFAT filesystem is only exposed when needed.

Unfortunately, there are 2 bugs that make this process very fragile:

* Once a systemd automount unit is scheduled to be stopped, it ignores
  notifications from autofs that the target filesystem should be
  mounted. Therefore, if `/boot` isn't mounted when shutdown begins,
  `ostree admin finalize-staged` will fail. See
  systemd/systemd#22528.

* autofs is not mount namespace aware, so it will begin the expiration
  timer for a mount unit unless a process in the root namespace is
  keeping it active. Since `ostree admin finalize-staged` is run from a
  mount namespace (either via systemd or its own to ensure `/sysroot`
  and `/boot` are mounted read-write), the automount daemon (systemd)
  will try to unmount the filesystem if it expires during this process.
  See https://bugzilla.redhat.com/show_bug.cgi?id=2056090.

Therefore, if `/boot` is an autofs filesystem, use a full deployment
instead of a staged deployment. Since systems with an automounted
`/boot` are not common, we want to retain the benefit of staged
deployments for more normal systems. See
ostreedev/ostree#2543 for potential future
fixes in ostree.

https://phabricator.endlessm.com/T33136
dbnicholson added a commit to endlessm/eos-updater that referenced this issue Mar 9, 2022
The ostree staged deployment process works by waiting until shutdown to
swap the `/boot` symlinks to make the new deployment the default.
However, when `/boot` is the EFI System Partition and there's no `fstab`
entry, `systemd-gpt-auto-generator` sets up an automount so that the
VFAT filesystem is only exposed when needed.

Unfortunately, there are 2 bugs that make this process very fragile:

* Once a systemd automount unit is scheduled to be stopped, it ignores
  notifications from autofs that the target filesystem should be
  mounted. Therefore, if `/boot` isn't mounted when shutdown begins,
  `ostree admin finalize-staged` will fail. See
  systemd/systemd#22528.

* autofs is not mount namespace aware, so it will begin the expiration
  timer for a mount unit unless a process in the root namespace is
  keeping it active. Since `ostree admin finalize-staged` is run from a
  mount namespace (either via systemd or its own to ensure `/sysroot`
  and `/boot` are mounted read-write), the automount daemon (systemd)
  will try to unmount the filesystem if it expires during this process.
  See https://bugzilla.redhat.com/show_bug.cgi?id=2056090.

Therefore, if `/boot` is an autofs filesystem, use a full deployment
instead of a staged deployment. Since systems with an automounted
`/boot` are not common, we want to retain the benefit of staged
deployments for more normal systems. See
ostreedev/ostree#2543 for potential future
fixes in ostree.

https://phabricator.endlessm.com/T33136
dbnicholson added a commit to endlessm/eos-updater that referenced this issue Mar 9, 2022
The ostree staged deployment process works by waiting until shutdown to
swap the `/boot` symlinks to make the new deployment the default.
However, when `/boot` is the EFI System Partition and there's no `fstab`
entry, `systemd-gpt-auto-generator` sets up an automount so that the
VFAT filesystem is only exposed when needed.

Unfortunately, there are 2 bugs that make this process very fragile:

* Once a systemd automount unit is scheduled to be stopped, it ignores
  notifications from autofs that the target filesystem should be
  mounted. Therefore, if `/boot` isn't mounted when shutdown begins,
  `ostree admin finalize-staged` will fail. See
  systemd/systemd#22528.

* autofs is not mount namespace aware, so it will begin the expiration
  timer for a mount unit unless a process in the root namespace is
  keeping it active. Since `ostree admin finalize-staged` is run from a
  mount namespace (either via systemd or its own to ensure `/sysroot`
  and `/boot` are mounted read-write), the automount daemon (systemd)
  will try to unmount the filesystem if it expires during this process.
  See https://bugzilla.redhat.com/show_bug.cgi?id=2056090.

Therefore, if `/boot` is an autofs filesystem, use a full deployment
instead of a staged deployment. Since systems with an automounted
`/boot` are not common, we want to retain the benefit of staged
deployments for more normal systems. See
ostreedev/ostree#2543 for potential future
fixes in ostree.

https://phabricator.endlessm.com/T33136
dbnicholson added a commit to endlessm/eos-updater that referenced this issue Mar 11, 2022
The ostree staged deployment process works by waiting until shutdown to
swap the `/boot` symlinks to make the new deployment the default.
However, when `/boot` is the EFI System Partition and there's no `fstab`
entry, `systemd-gpt-auto-generator` sets up an automount so that the
VFAT filesystem is only exposed when needed.

Unfortunately, there are 2 bugs that make this process very fragile:

* Once a systemd automount unit is scheduled to be stopped, it ignores
  notifications from autofs that the target filesystem should be
  mounted. Therefore, if `/boot` isn't mounted when shutdown begins,
  `ostree admin finalize-staged` will fail. See
  systemd/systemd#22528.

* autofs is not mount namespace aware, so it will begin the expiration
  timer for a mount unit unless a process in the root namespace is
  keeping it active. Since `ostree admin finalize-staged` is run from a
  mount namespace (either via systemd or its own to ensure `/sysroot`
  and `/boot` are mounted read-write), the automount daemon (systemd)
  will try to unmount the filesystem if it expires during this process.
  See https://bugzilla.redhat.com/show_bug.cgi?id=2056090.

Therefore, if `/boot` is an autofs filesystem, use a full deployment
instead of a staged deployment. Since systems with an automounted
`/boot` are not common, we want to retain the benefit of staged
deployments for more normal systems. See
ostreedev/ostree#2543 for potential future
fixes in ostree.

https://phabricator.endlessm.com/T33136
(cherry picked from commit a19821a)
dbnicholson added a commit to dbnicholson/ostree that referenced this issue Aug 29, 2022
If `/boot` is an automount, then the unit will be stopped as soon as the
automount expires. That's would defeat the purpose of using systemd to
delay finalizing the deployment until shutdown. This is not uncommon as
`systemd-gpt-auto-generator` will create an automount unit for `/boot`
when it's the EFI System Partition and there's no fstab entry.

To ensure that systemd doesn't stop the service early when the `/boot`
automount expires, introduce a new unit that holds `/boot` open until
it's sent `SIGTERM`. This uses a new `--hold` option for
`finalize-staged` that loads but doesn't lock the sysroot. A separate
unit is used since we want the process to remain active throughout the
finalization run in `ExecStop`. That wouldn't work if it was specified
in `ExecStart` in the same unit since it would be killed before the
`ExecStop` action was run.

Fixes: ostreedev#2543
dbnicholson added a commit to dbnicholson/ostree that referenced this issue Aug 30, 2022
If `/boot` is an automount, then the unit will be stopped as soon as the
automount expires. That's would defeat the purpose of using systemd to
delay finalizing the deployment until shutdown. This is not uncommon as
`systemd-gpt-auto-generator` will create an automount unit for `/boot`
when it's the EFI System Partition and there's no fstab entry.

To ensure that systemd doesn't stop the service early when the `/boot`
automount expires, introduce a new unit that holds `/boot` open until
it's sent `SIGTERM`. This uses a new `--hold` option for
`finalize-staged` that loads but doesn't lock the sysroot. A separate
unit is used since we want the process to remain active throughout the
finalization run in `ExecStop`. That wouldn't work if it was specified
in `ExecStart` in the same unit since it would be killed before the
`ExecStop` action was run.

Fixes: ostreedev#2543
dbnicholson added a commit to dbnicholson/ostree that referenced this issue Aug 30, 2022
If `/boot` is an automount, then the unit will be stopped as soon as the
automount expires. That's would defeat the purpose of using systemd to
delay finalizing the deployment until shutdown. This is not uncommon as
`systemd-gpt-auto-generator` will create an automount unit for `/boot`
when it's the EFI System Partition and there's no fstab entry.

To ensure that systemd doesn't stop the service early when the `/boot`
automount expires, introduce a new unit that holds `/boot` open until
it's sent `SIGTERM`. This uses a new `--hold` option for
`finalize-staged` that loads but doesn't lock the sysroot. A separate
unit is used since we want the process to remain active throughout the
finalization run in `ExecStop`. That wouldn't work if it was specified
in `ExecStart` in the same unit since it would be killed before the
`ExecStop` action was run.

Fixes: ostreedev#2543
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants