Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image-reflector-controller restarts due to OOM Killed #285

Open
1 task done
Andrea-Gallicchio opened this issue Jul 13, 2022 · 5 comments
Open
1 task done

Image-reflector-controller restarts due to OOM Killed #285

Andrea-Gallicchio opened this issue Jul 13, 2022 · 5 comments

Comments

@Andrea-Gallicchio
Copy link

Describe the bug

I run Flux on AWS EKS 1.21.5. I've noticed that after the last Flux update, sometimes happens that the image-reflector-controller pod is restarted due to OOM Killed, even if it has a high CPU and Memory Request/Limit. The number of Helm Releases is between 30 and 40.

  • CPU Request: 0.05
  • CPU Limit: 0.1
  • CPU Average Usage: 0.006
  • Memory Request: 384 MB
  • Memory Limit: 640 MB
  • Memory Average Usage: 187 MB

Steps to reproduce

N/A

Expected behavior

I expect image-reflector-controller to not restart due to OOM Killed.

Screenshots and recordings

No response

OS / Distro

N/A

Flux version

v0.31.3

Flux check

► checking prerequisites
✔ Kubernetes 1.21.12-eks-a64ea69 >=1.20.6-0
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.21.0
✔ image-automation-controller: deployment ready
► ghcr.io/fluxcd/image-automation-controller:v0.22.1
✔ image-reflector-controller: deployment ready
► ghcr.io/fluxcd/image-reflector-controller:v0.18.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v0.25.0
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v0.23.5
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v0.24.4
► checking crds
✔ alerts.notification.toolkit.fluxcd.io/v1beta1
✔ buckets.source.toolkit.fluxcd.io/v1beta1
✔ gitrepositories.source.toolkit.fluxcd.io/v1beta1
✔ helmcharts.source.toolkit.fluxcd.io/v1beta1
✔ helmreleases.helm.toolkit.fluxcd.io/v2beta1
✔ helmrepositories.source.toolkit.fluxcd.io/v1beta1
✔ imagepolicies.image.toolkit.fluxcd.io/v1beta1
✔ imagerepositories.image.toolkit.fluxcd.io/v1beta1
✔ imageupdateautomations.image.toolkit.fluxcd.io/v1beta1
✔ kustomizations.kustomize.toolkit.fluxcd.io/v1beta1
✔ providers.notification.toolkit.fluxcd.io/v1beta1
✔ receivers.notification.toolkit.fluxcd.io/v1beta1
✔ all checks passed

Git provider

No response

Container Registry provider

No response

Additional context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@stefanprodan
Copy link
Member

The image-reflector-controller has nothing to do with Helm. Can you please post here kubectl describe deployment for the controller that runs into OOM.

@Andrea-Gallicchio
Copy link
Author

Name:                   image-reflector-controller
Namespace:              flux-system
CreationTimestamp:      Thu, 23 Dec 2021 11:29:24 +0100
Labels:                 app.kubernetes.io/instance=flux-system
                        app.kubernetes.io/part-of=flux
                        app.kubernetes.io/version=v0.30.2
                        control-plane=controller
                        kustomize.toolkit.fluxcd.io/name=flux-system
                        kustomize.toolkit.fluxcd.io/namespace=flux-system
Annotations:            deployment.kubernetes.io/revision: 6
Selector:               app=image-reflector-controller
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app=image-reflector-controller
  Annotations:      prometheus.io/port: 8080
                    prometheus.io/scrape: true
  Service Account:  image-reflector-controller
  Containers:
   manager:
    Image:       ghcr.io/fluxcd/image-reflector-controller:v0.18.0
    Ports:       8080/TCP, 9440/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --events-addr=http://notification-controller.flux-system.svc.cluster.local./
      --watch-all-namespaces=true
      --log-level=info
      --log-encoding=json
      --enable-leader-election
    Limits:
      cpu:     100m
      memory:  640Mi
    Requests:
      cpu:      50m
      memory:   384Mi
    Liveness:   http-get http://:healthz/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:healthz/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      RUNTIME_NAMESPACE:   (v1:metadata.namespace)
    Mounts:
      /data from data (rw)
      /tmp from temp (rw)
  Volumes:
   temp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
   data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  <none>
NewReplicaSet:   image-reflector-controller-db97c765d (1/1 replicas created)
Events:          <none>

@stefanprodan stefanprodan transferred this issue from fluxcd/flux2 Jul 13, 2022
@pjbgf
Copy link
Member

pjbgf commented Aug 8, 2022

@Andrea-Gallicchio can you confirm whether just before the OOM occurred there was anything abnormal in the logs?

@Nitive
Copy link

Nitive commented Dec 26, 2023

We regularly reproduce the problem

Before OOM kill there is nothing unusual, it's just regular scanning for new tags

2023-12-26T06:45:47+04:00	{"level":"info","ts":"2023-12-26T02:45:47.803Z","msg":"Latest image tag for 'public.ecr.aws/gravitational/teleport-distroless' resolved to 14.2.4","controller":"imagepolicy","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImagePolicy","ImagePolicy":{"name":"teleport","namespace":"flux-system"},"namespace":"flux-system","name":"teleport","reconcileID":"4f4771ff-7dd2-4b8e-9803-075f0a2460c4"}
2023-12-26T06:45:41+04:00	{"level":"info","ts":"2023-12-26T02:45:41.332Z","msg":"Latest image tag for 'grafana/promtail' resolved to 2.9.3","controller":"imagepolicy","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImagePolicy","ImagePolicy":{"name":"promtail","namespace":"flux-system"},"namespace":"flux-system","name":"promtail","reconcileID":"7a100653-c4f0-45c6-aac6-a15f09f01de6"}
2023-12-26T06:45:41+04:00	{"level":"info","ts":"2023-12-26T02:45:41.312Z","msg":"Latest image tag for 'grafana/promtail' resolved to 2.9.3","controller":"imagepolicy","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImagePolicy","ImagePolicy":{"name":"promtail","namespace":"flux-system"},"namespace":"flux-system","name":"promtail","reconcileID":"f0a9b715-e8ed-46fd-972a-e4852b2746a2"}
2023-12-26T06:45:41+04:00	{"level":"info","ts":"2023-12-26T02:45:41.296Z","msg":"no new tags found, next scan in 5m0s","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository","ImageRepository":{"name":"promtail","namespace":"flux-system"},"namespace":"flux-system","name":"promtail","reconcileID":"9c4644d4-bed0-4c0a-ab90-d74fa197f61c"}

But the main problem is that after the OOM kill, the container can't recover and enters the CrashloopBackOff state

Here are the logs for the container starting after the OOM kill

2023-12-26T06:48:44+04:00	{"level":"info","ts":"2023-12-26T02:48:44.414Z","logger":"runtime","msg":"attempting to acquire leader lease flux-system/image-reflector-controller-leader-election...\n"}
2023-12-26T06:48:44+04:00	{"level":"info","ts":"2023-12-26T02:48:44.413Z","msg":"starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
2023-12-26T06:48:44+04:00	{"level":"info","ts":"2023-12-26T02:48:44.309Z","msg":"Starting server","kind":"health probe","addr":"[::]:9440"}
2023-12-26T06:48:44+04:00	{"level":"info","ts":"2023-12-26T02:48:44.308Z","logger":"setup","msg":"starting manager"}
2023-12-26T06:48:44+04:00	{"level":"info","ts":"2023-12-26T02:48:44.302Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
2023-12-26T06:48:44+04:00	badger 2023/12/26 02:48:44 INFO: Deleting empty file: /data/000004.vlog
2023-12-26T06:48:44+04:00	badger 2023/12/26 02:48:44 INFO: Set nextTxnTs to 1657
2023-12-26T06:48:44+04:00	badger 2023/12/26 02:48:44 INFO: Discard stats nextEmptySlot: 0
2023-12-26T06:48:44+04:00	badger 2023/12/26 02:48:44 INFO: All 0 tables opened in 0s

@mikkoc
Copy link

mikkoc commented May 7, 2024

We are seeing the same issue, despite increasing memory requests and limits: currently at 512M/1G.

Nothing in the logs just before the OOMKilled (which happened on Mon, 06 May 2024 22:17:39 +0100).

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants