Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller shutdown doesn't wait for reconciles to finish #491

Open
rsafonseca opened this issue Jan 12, 2024 · 3 comments
Open

Controller shutdown doesn't wait for reconciles to finish #491

rsafonseca opened this issue Jan 12, 2024 · 3 comments
Labels
area/ux In pursuit of a delightful user experience enhancement New feature or request

Comments

@rsafonseca
Copy link

Every time the image-reflector-controller pod goes down for any reason, it isn't waiting for running reconciliations to finish, and propagates an immediate context cancellation to the running contexts.

This causes running reconciliations to fail and return an error, which in turn is propagated to all channels defined in the notification controller, generating a lot of noise.

Ideally, the running reconciliations should be allowed to finish, or at the worst these expected context cancellation shutdown errors should not be propagated to the notification channels.

Here's an example log of the issue:

{"level":"info","ts":"2024-01-11T11:41:01.305Z","msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":"2024-01-11T11:41:01.312Z","msg":"Stopping and waiting for leader election runnables"}
{"level":"info","ts":"2024-01-11T11:41:01.324Z","msg":"All workers finished","controller":"imagepolicy","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImagePolicy"}
{"level":"info","ts":"2024-01-11T11:41:01.312Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"imagepolicy","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImagePolicy"}
{"level":"info","ts":"2024-01-11T11:41:01.312Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository"}
{"level":"error","ts":"2024-01-11T11:41:01.332Z","msg":"failed to configure authentication options: operation error ECR: GetAuthorizationToken, get identity: get credentials: request canceled, context canceled","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository","ImageRepository":{"name":"**REDACTED**","namespace":"**REDACTED**"},"namespace":"**REDACTED**","name":"**REDACTED**","reconcileID":"64c8c4f2-e200-4466-bb1a-0059375fc0d6","error":"AuthenticationFailed"}
{"level":"info","ts":"2024-01-11T11:41:01.338Z","msg":"Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository","ImageRepository":{"name":"**REDACTED**","namespace":"**REDACTED**"},"namespace":"**REDACTED**","name":"**REDACTED**","reconcileID":"9f627893-53d1-4bdc-acac-a0b3099216e5"}
{"level":"error","ts":"2024-01-11T11:41:01.339Z","msg":"Reconciler error","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository","ImageRepository":{"name":"**REDACTED**","namespace":"**REDACTED**"},"namespace":"**REDACTED**","name":"**REDACTED**","reconcileID":"9f627893-53d1-4bdc-acac-a0b3099216e5","error":"[Patch \"https://10.16.0.1:443/apis/image.toolkit.fluxcd.io/v1beta2/namespaces/**REDACTED**/imagerepositories/**REDACTED**/status?fieldManager=image-reflector-controller\": context canceled, context canceled]","errorCauses":[{"error":"[Patch \"https://10.16.0.1:443/apis/image.toolkit.fluxcd.io/v1beta2/namespaces/**REDACTED**/imagerepositories/**REDACTED**/status?fieldManager=image-reflector-controller\": context canceled, context canceled]","errorCauses":[{"error":"Patch \"https://10.16.0.1:443/apis/image.toolkit.fluxcd.io/v1beta2/namespaces/**REDACTED**/imagerepositories/**REDACTED**/status?fieldManager=image-reflector-controller\": context canceled"},{"error":"context canceled"}]}]}
{"level":"error","ts":"2024-01-11T11:41:01.373Z","msg":"Reconciler error","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository","ImageRepository":{"name":"**REDACTED**","namespace":"**REDACTED**"},"namespace":"**REDACTED**","name":"**REDACTED**","reconcileID":"64c8c4f2-e200-4466-bb1a-0059375fc0d6","error":"[failed to configure authentication options: operation error ECR: GetAuthorizationToken, get identity: get credentials: request canceled, context canceled, context canceled]","errorCauses":[{"error":"failed to configure authentication options: operation error ECR: GetAuthorizationToken, get identity: get credentials: request canceled, context canceled"},{"error":"context canceled","errorCauses":[{"error":"context canceled"},{"error":"context canceled"}]}]}
{"level":"info","ts":"2024-01-11T11:41:01.375Z","msg":"All workers finished","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository"}
{"level":"info","ts":"2024-01-11T11:41:01.375Z","msg":"Stopping and waiting for caches"}
{"level":"info","ts":"2024-01-11T11:41:01.384Z","msg":"Stopping and waiting for webhooks"}
{"level":"info","ts":"2024-01-11T11:41:01.384Z","msg":"Stopping and waiting for HTTP servers"}
{"level":"info","ts":"2024-01-11T11:41:01.384Z","msg":"shutting down server","kind":"health probe","addr":"[::]:9440"}
{"level":"info","ts":"2024-01-11T11:41:01.389Z","logger":"controller-runtime.metrics","msg":"Shutting down metrics server with timeout of 1 minute"}
{"level":"info","ts":"2024-01-11T11:41:01.396Z","msg":"Wait completed, proceeding to shutdown the manager"}
{"level":"error","ts":"2024-01-11T11:41:01.408Z","msg":"error received after stop sequence was engaged","error":"leader election lost"}

@darkowlzz
Copy link
Contributor

Hi, we have a single reconciliation context that's propagated throughout the reconciliation. If the context is cancelled after the reconciliation has started, any new operation that's part of the reconciliation will not succeed. I think it's better to terminate instead of trying to finish the operations by creating a new, independent, context. As far as I know, that's how most of the go programs behave in terms of context, unless there's a good reason to do otherwise.

I would have expected the notifications to also not work because it sends an http request to notification-controller. But looking at the code, the notification event emitter doesn't take any context, hence it is able to send the notification request successfully, see the event code in https://github.com/fluxcd/pkg/blob/ce1f99f03d6756cde86b2fa9494fe13804296bac/runtime/events/recorder.go#L119. This doesn't take a context because it implements the upstream kubernetes event recorder interface, see https://github.com/kubernetes/client-go/blob/v12.0.0/tools/record/event.go#L109, which also doesn't take a context. Our event recorder emits kubernetes event and flux event which is sent to the notification controller. If we decide to not emit events at all during context cancellation, it'll also mean we don't emit kubernetes event, which may be useful for some users. So, although we can check if the given context has been cancelled in

func notify(ctx context.Context, r kuberecorder.EventRecorder, oldObj, newObj conditions.Setter, nextScanMsg string) {
and skip any notifications, something like

if context.Cause(ctx) == context.Canceled {
  return
}

I'm afraid that'll have the side-effect of also skipping logging and emitting kubernetes events. One can argue that these logs and events would not be useful at all, but I'd like to hear more feedback before we do such things. At present I think it's okay to do that, but I would like to hear other's opinions too on this.

For now, since the issue is the spam failure notifications through notification-controller, one way to get around it would be to exclude them at the notification-controller level by setting event exclusion on the respective alert, see https://fluxcd.io/flux/components/notification/alerts/#event-exclusion.

@darkowlzz
Copy link
Contributor

darkowlzz commented Feb 21, 2024

I spent some more time thinking about it and also tried to see this behavior in other flux controllers. It seems others too behave the same if the controller pod is terminated while in the middle of a reconciliation. For example, for source-controller, I got the following alert:

failed to checkout and determine revision: unable to list remote for 'http://example.com/test-group/test-repo.git': Get "http://example.com/test-group/test-repo.git/info/refs?service=git-upload-pack": context canceled

When the controller starts again, if nothing new happens (no change in the result of reconciliation), no new notification is sent. We have a failure recovery notification mechanism in place which works based on the objects being marked as failed and when the object is reconciled the next time, a new notification about the success is sent after comparing the objects current state and the previously known state. This doesn't work for this particular case because of the cancelled context, the object being marked as failed doesn't get persisted in kubernetes. Requests to kubernetes-apiserver fail when sent with a cancelled context. When the controller starts again and gets the object, it has the last ready state, not failed. No recovery happened.
This leaves a failed notification and no subsequent recovery notification. I think this is a good reason to skip notifications when context gets cancelled. I believe context cancellation will only happen when the controller-manager is signaled to stop. This is different from context timeout which will still continue to send notification and behave as expected.

I'm in favor of implementing this across all the controllers.

Thanks @rsafonseca for highlighting this issue.

@darkowlzz darkowlzz added enhancement New feature or request area/ux In pursuit of a delightful user experience labels Feb 21, 2024
@stefanprodan
Copy link
Member

I think this is a good reason to skip notifications when context gets cancelled.

Yes I fully suport this, these type of events are just noise as all controllers will resume work after restart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ux In pursuit of a delightful user experience enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants