[operator] Collector fails with featureGate errors when Upgrading the Operator to chart version 0.68.1 #1320

jlcrow · 2024-08-28T19:59:24Z

Performed a routine helm upgrade from chart version 0.65.1 to 0.68.1 after the upgrade created Open Telemetry collector will not start. No errors in the operator - the collector errors and Crashloops

otel-prometheus-collector-0                        0/1     CrashLoopBackOff   7 (4m20s ago)   15m
 
Error: invalid argument "-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost" for "--feature-gates" flag: feature gate "confmap.unifyEnvVarExpansion" is stable, can not be disabled
2024/08/28 19:23:44 collector server run finished with error: invalid argument "-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost" for "--feature-gates" flag: feature gate "confmap.unifyEnvVarExpansion" is stable, can not be disabled

Collector config

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-prometheus
  namespace: monitoring
spec:
  mode: statefulset
  podAnnotations:
     sidecar.istio.io/inject: "false"
  targetAllocator:
    serviceAccount: opentelemetry-targetallocator-sa
    enabled: true
    prometheusCR:
      enabled: true
    observability:
      metrics:
        enableMetrics: true
    resources:
      requests:
        memory: 300Mi
        cpu: 300m
      limits:
        memory: 512Mi
        cpu: 500m
  priorityClassName: highest-priority
  resources:
    requests:
      memory: 600Mi
      cpu: 300m
    limits:
      memory: 1Gi
      cpu: 500m
  env:
    - name: K8S_POD_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
  config: |
    processors:
      batch: {}
      memory_limiter:
        check_interval: 5s
        limit_percentage: 90    
    extensions:
      health_check:
        endpoint: 0.0.0.0:13133
      memory_ballast: {}
    receivers:
      prometheus:
        config:
          scrape_configs:
          - job_name: 'otel-collector'
            scrape_interval: 10s
            static_configs:
            - targets: [ '0.0.0.0:8888' ]         
            metric_relabel_configs:
            - action: labeldrop
              regex: (id|name)
            - action: labelmap
              regex: label_(.+)
          - job_name: kubernetes-nodes-cadvisor
            bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            honor_timestamps: true
            kubernetes_sd_configs:
            - role: node
            relabel_configs:
            - source_labels: [__meta_kubernetes_pod_node_name]
              action: replace
              target_label: node
              regex: (.*)
              replacement: $$1         
            - action: labelmap
              regex: __meta_kubernetes_node_label_(.+)
            - replacement: kubernetes.default.svc:443
              target_label: __address__
            - regex: (.+)
              replacement: /api/v1/nodes/$$1/proxy/metrics/cadvisor
              source_labels:
              - __meta_kubernetes_node_name
              target_label: __metrics_path__
            scheme: https
            tls_config:
              ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
              insecure_skip_verify: true
          - job_name: kube-state-metrics
            kubernetes_sd_configs:
            - role: endpoints
              selectors:
              - role: endpoints
                label: "app.kubernetes.io/name=kube-state-metrics" 
            relabel_configs:
            - action: keep
              regex: true
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_scrape
            - action: replace
              regex: (https?)
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_scheme
              target_label: __scheme__
            - action: replace
              regex: (.+)
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_path
              target_label: __metrics_path__
            - action: replace
              regex: ([^:]+)(?::\d+)?;(\d+)
              replacement: $$1:$$2
              source_labels:
              - __address__
              - __meta_kubernetes_service_annotation_prometheus_io_port
              target_label: __address__
            - action: labelmap
              regex: __meta_kubernetes_service_label_(.+)
            - action: replace
              source_labels:
              - __meta_kubernetes_namespace
              target_label: exporter_namespace
            - action: replace
              source_labels:
              - __meta_kubernetes_service_name
              target_label: service_name
            - action: replace
              source_labels:
              - __meta_kubernetes_pod_node_name
              target_label: node
            metric_relabel_configs:
            - source_labels: [__name__]
              regex: kube_pod_status_(reason|scheduled|ready)
              action: drop
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
    exporters:
      prometheusremotewrite:
        endpoint: https://<internal miimir endpoint>
        retry_on_failure:
          enabled: true
          initial_interval: 1s
          max_interval: 10s
          max_elapsed_time: 30s
    service:
      telemetry:
          metrics:
            address: "0.0.0.0:8888"
            level: basic
          logs:
            level: "warn"  
      extensions:
      - health_check
      - memory_ballast
      pipelines:
        metrics:
          receivers:
          - prometheus
          - otlp
          processors:
          - memory_limiter
          - batch
          exporters:
          - prometheusremotewrite

The text was updated successfully, but these errors were encountered:

dosubot · 2024-08-28T20:01:26Z

Hey there, @jlcrow! 👋 I'm here to assist you with any bugs, questions, or contributions you have while waiting for a human maintainer.

The issue you're encountering with the OpenTelemetry collector failing with featureGate errors and crashing in a loop after upgrading to chart version 0.68.1 is due to changes in the OpenTelemetryCollector CRD. Specifically, the confmap.unifyEnvVarExpansion feature gate is now stable and cannot be disabled.

To resolve this issue, you need to ensure that your CRDs are properly annotated and labeled before upgrading. Here are the steps:

Annotate and label the CRDs:

RELEASE_NAME=my-opentelemetry-operator
RELEASE_NAMESPACE=opentelemetry-operator-system
kubectl annotate crds instrumentations.opentelemetry.io opentelemetrycollectors.opentelemetry.io opampbridges.opentelemetry.io \
  meta.helm.sh/release-name=${RELEASE_NAME} \
  meta.helm.sh/release-namespace=${RELEASE_NAMESPACE}
kubectl label crds instrumentations.opentelemetry.io opentelemetrycollectors.opentelemetry.io opampbridges.opentelemetry.io app.kubernetes.io/managed-by=Helm

Follow the TLS certificate requirements and upgrade instructions provided in the OpenTelemetry Operator Helm Chart documentation.

Additionally, if you are upgrading from version 0.67 to 0.68, note that the preset.kubernetesEvents preset now excludes DELETED watch types. If you want to keep ingesting DELETED updates for Kubernetes Events, you will need to configure the k8sobjectsreceiver manually ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

jaronoff97 · 2024-08-29T15:14:32Z

@jlcrow this is a know bug and will be fixed by this PR in the operator open-telemetry/opentelemetry-operator#3074

jaronoff97 · 2024-08-29T15:14:53Z

separately, i would recommend upgrading the collector's version to v1beta1 when you get a chance :)

jaronoff97 · 2024-09-06T15:19:41Z

solved by open-telemetry/opentelemetry-operator#3074

this will be fixed in the next operator helm release. Thank you for your patience :)

jaronoff97 · 2024-09-10T18:07:06Z

@jlcrow can you upgrade to latest and let me know if that fixes things?

jlcrow · 2024-09-10T18:20:02Z

@jaronoff97
Just did a helm repo update open-telemetry tried upgrading to 0.69.0

open-telemetry/opentelemetry-operator  	0.69.0       	0.108.0    	OpenTelemetry Operator Helm chart for Kubernetes

Still seeing errors when the collector comes up

otel-prometheus-collector-0                       0/1     Error       1 (5s ago)    11s
otel-prometheus-targetallocator-7bb6d4d7b-bq8q7   1/1     Running     0             12s
➜  cluster-management git: klon monitoring-system otel-prometheus-collector-0                                                                                                                               
Error: invalid argument "-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost" for "--feature-gates" flag: feature gate "confmap.unifyEnvVarExpansion" is stable, can not be disabled
2024/09/10 18:14:02 collector server run finished with error: invalid argument "-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost" for "--feature-gates" flag: feature gate "confmap.unifyEnvVarExpansion" is stable, can not be disabled

jaronoff97 · 2024-09-10T18:22:33Z

hmm any logs from the operator?

jlcrow · 2024-09-10T18:40:55Z

@jaronoff97 Nothing on the operator but info logs for the manager container

{"level":"INFO","timestamp":"2024-09-10T18:37:56Z","message":"Starting workers","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","worker count":1}
{"level":"INFO","timestamp":"2024-09-10T18:37:56Z","message":"Starting workers","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","worker count":1}

jaronoff97 · 2024-09-10T18:41:15Z

one note, i tried running your config and you should know that the memory_ballast extension is removed. testing this locally now though!

jaronoff97 · 2024-09-10T18:42:08Z

i saw this message from the otel operator:

{"level":"INFO","timestamp":"2024-09-10T18:41:10Z","logger":"collector-upgrade","message":"instance upgraded","name":"otel-prometheus","namespace":"default","version":"0.108.0"}

and this is working now:

⫸ k logs otel-prometheus-collector-0
2024-09-10T18:41:15.297Z	warn	[email protected]/warning.go:42	Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. Enable the feature gate to change the default and remove this warning.	{"kind": "extension", "name": "health_check", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks", "feature gate ID": "component.UseLocalHostAsDefaultHost"}
2024-09-10T18:41:15.302Z	warn	[email protected]/warning.go:42	Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. Enable the feature gate to change the default and remove this warning.	{"kind": "receiver", "name": "otlp", "data_type": "metrics", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks", "feature gate ID": "component.UseLocalHostAsDefaultHost"}

Note: the target allocator is failing to startup because it's missing permissions on its service account, but otherwise things worked fully as expected.

jaronoff97 · 2024-09-10T18:43:48Z

before:

  Containers:
   otc-container:
    Image:       otel/opentelemetry-collector-k8s:0.104.0
    Ports:       8888/TCP, 4317/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --config=/conf/collector.yaml
      --feature-gates=-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost

After:

  Containers:
   otc-container:
    Image:       otel/opentelemetry-collector-k8s:0.108.0
    Ports:       8888/TCP, 4317/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --config=/conf/collector.yaml
      --feature-gates=-component.UseLocalHostAsDefaultHost

jlcrow · 2024-09-10T18:47:06Z

@jaronoff97 Should have provided my latest config:

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel-prometheus
  namespace: monitoring-system
spec:
  mode: statefulset
  podAnnotations:
     sidecar.istio.io/inject: "false"
  targetAllocator:
    serviceAccount: opentelemetry-targetallocator-sa
    enabled: true
    prometheusCR:
      enabled: true
    observability:
      metrics:
        enableMetrics: true
    resources:
      requests:
        memory: 300Mi
        cpu: 300m
      limits:
        memory: 512Mi
        cpu: 500m
  priorityClassName: highest-priority
  resources:
    requests:
      memory: 600Mi
      cpu: 300m
    limits:
      memory: 1Gi
      cpu: 500m
  env:
    - name: K8S_POD_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
    - name: K8S_POD_IP
      valueFrom:
        fieldRef:
          fieldPath: status.podIP          
  config:
    processors:
      batch: {}
      memory_limiter:
        check_interval: 5s
        limit_percentage: 90    
    extensions:
      health_check:
        endpoint: ${K8S_POD_IP}:13133
    receivers:
      prometheus:
        config:
          scrape_configs:
          - job_name: 'otel-collector'
            scrape_interval: 10s
            static_configs:
            - targets: [ "${K8S_POD_IP}:8888" ]         
            metric_relabel_configs:
            - action: labeldrop
              regex: (id|name)
            - action: labelmap
              regex: label_(.+)
          - job_name: kubernetes-nodes-cadvisor
            bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            honor_timestamps: true
            kubernetes_sd_configs:
            - role: node
            relabel_configs:
            - source_labels: [__meta_kubernetes_pod_node_name]
              action: replace
              target_label: node
              regex: (.*)
              replacement: $$1         
            - action: labelmap
              regex: __meta_kubernetes_node_label_(.+)
            - replacement: kubernetes.default.svc:443
              target_label: __address__
            - regex: (.+)
              replacement: /api/v1/nodes/$$1/proxy/metrics/cadvisor
              source_labels:
              - __meta_kubernetes_node_name
              target_label: __metrics_path__
            scheme: https
            tls_config:
              ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
              insecure_skip_verify: true
          - job_name: kube-state-metrics
            kubernetes_sd_configs:
            - role: endpoints
              selectors:
              - role: endpoints
                label: "app.kubernetes.io/name=kube-state-metrics" 
            relabel_configs:
            - action: keep
              regex: true
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_scrape
            - action: replace
              regex: (https?)
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_scheme
              target_label: __scheme__
            - action: replace
              regex: (.+)
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_path
              target_label: __metrics_path__
            - action: replace
              regex: ([^:]+)(?::\d+)?;(\d+)
              replacement: $$1:$$2
              source_labels:
              - __address__
              - __meta_kubernetes_service_annotation_prometheus_io_port
              target_label: __address__
            - action: labelmap
              regex: __meta_kubernetes_service_label_(.+)
            - action: replace
              source_labels:
              - __meta_kubernetes_namespace
              target_label: exporter_namespace
            - action: replace
              source_labels:
              - __meta_kubernetes_service_name
              target_label: service_name
            - action: replace
              source_labels:
              - __meta_kubernetes_pod_node_name
              target_label: node
            metric_relabel_configs:
            - source_labels: [__name__]
              regex: kube_pod_status_(reason|scheduled|ready)
              action: drop
      otlp:
        protocols:
          grpc:
            endpoint: ${K8S_POD_IP}:4317
    exporters:
      prometheusremotewrite:
        endpoint: https://mimir/api/v1/push
        retry_on_failure:
          enabled: true
          initial_interval: 1s
          max_interval: 10s
          max_elapsed_time: 30s
    service:
      telemetry:
          metrics:
            address: "${K8S_POD_IP}:8888"
            level: basic
          logs:
            level: "warn"  
      extensions:
      - health_check
      pipelines:
        metrics:
          receivers:
          - prometheus
          - otlp
          processors:
          - memory_limiter
          - batch
          exporters:
          - prometheusremotewrite

jaronoff97 · 2024-09-10T18:49:20Z

also note, i needed to get rid of the priority class name and the service account name which weren't provided. but thanks for updating, giving it a try...

jaronoff97 · 2024-09-10T18:52:36Z

yeah i tested going from 0.65.0 -> 0.69.0 which was fully successful with this configuration:

Config

``` apiVersion: opentelemetry.io/v1beta1 kind: OpenTelemetryCollector metadata: name: otel-prometheus spec: mode: statefulset podAnnotations: sidecar.istio.io/inject: "false" targetAllocator: enabled: true prometheusCR: enabled: true observability: metrics: enableMetrics: true resources: requests: memory: 300Mi cpu: 300m limits: memory: 512Mi cpu: 500m resources: requests: memory: 600Mi cpu: 300m limits: memory: 1Gi cpu: 500m env: - name: K8S_POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: K8S_POD_IP valueFrom: fieldRef: fieldPath: status.podIP config: processors: batch: {} memory_limiter: check_interval: 5s limit_percentage: 90 extensions: health_check: endpoint: ${K8S_POD_IP}:13133 receivers: prometheus: config: scrape_configs: - job_name: "otel-collector" scrape_interval: 10s static_configs: - targets: ["${K8S_POD_IP}:8888"] metric_relabel_configs: - action: labeldrop regex: (id|name) - action: labelmap regex: label_(.+) - job_name: kubernetes-nodes-cadvisor bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token honor_timestamps: true kubernetes_sd_configs: - role: node relabel_configs: - source_labels: [__meta_kubernetes_pod_node_name] action: replace target_label: node regex: (.*) replacement: $$1 - action: labelmap regex: __meta_kubernetes_node_label_(.+) - replacement: kubernetes.default.svc:443 target_label: __address__ - regex: (.+) replacement: /api/v1/nodes/$$1/proxy/metrics/cadvisor source_labels: - __meta_kubernetes_node_name target_label: __metrics_path__ scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true - job_name: kube-state-metrics kubernetes_sd_configs: - role: endpoints selectors: - role: endpoints label: "app.kubernetes.io/name=kube-state-metrics" relabel_configs: - action: keep regex: true source_labels: - __meta_kubernetes_service_annotation_prometheus_io_scrape - action: replace regex: (https?) source_labels: - __meta_kubernetes_service_annotation_prometheus_io_scheme target_label: __scheme__ - action: replace regex: (.+) source_labels: - __meta_kubernetes_service_annotation_prometheus_io_path target_label: __metrics_path__ - action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $$1:$$2 source_labels: - __address__ - __meta_kubernetes_service_annotation_prometheus_io_port target_label: __address__ - action: labelmap regex: __meta_kubernetes_service_label_(.+) - action: replace source_labels: - __meta_kubernetes_namespace target_label: exporter_namespace - action: replace source_labels: - __meta_kubernetes_service_name target_label: service_name - action: replace source_labels: - __meta_kubernetes_pod_node_name target_label: node metric_relabel_configs: - source_labels: [__name__] regex: kube_pod_status_(reason|scheduled|ready) action: drop otlp: protocols: grpc: endpoint: ${K8S_POD_IP}:4317 exporters: debug: {} service: telemetry: metrics: address: "${K8S_POD_IP}:8888" level: basic logs: level: "warn" extensions: - health_check pipelines: metrics: receivers: - prometheus - otlp processors: - memory_limiter - batch exporters: - debug ---

Source: opentelemetry-kube-stack/templates/clusterrole.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: example-collector
rules:

apiGroups: [""]
resources:
- namespaces
- nodes
- nodes/proxy
- nodes/metrics
- nodes/stats
- services
- endpoints
- pods
- events
- secrets
  verbs: ["get", "list", "watch"]
apiGroups: ["monitoring.coreos.com"]
resources:
- servicemonitors
- podmonitors
  verbs: ["get", "list", "watch"]
apiGroups:
- extensions
  resources:
- ingresses
  verbs: ["get", "list", "watch"]
apiGroups:
- apps
  resources:
- daemonsets
- deployments
- replicasets
- statefulsets
  verbs: ["get", "list", "watch"]
apiGroups:
- networking.k8s.io
  resources:
- ingresses
  verbs: ["get", "list", "watch"]
apiGroups: ["discovery.k8s.io"]
resources:
- endpointslices
  verbs: ["get", "list", "watch"]
nonResourceURLs: ["/metrics", "/metrics/cadvisor"]
verbs: ["get"]
apiGroups:
- ""
  resources:
- events
- namespaces
- namespaces/status
- nodes
- nodes/spec
- pods
- pods/status
- replicationcontrollers
- replicationcontrollers/status
- resourcequotas
- services
  verbs:
- get
- list
- watch
apiGroups:
- apps
  resources:
- daemonsets
- deployments
- replicasets
- statefulsets
  verbs:
- get
- list
- watch
apiGroups:
- extensions
  resources:
- daemonsets
- deployments
- replicasets
  verbs:
- get
- list
- watch
apiGroups:
- batch
  resources:
- jobs
- cronjobs
  verbs:
- get
- list
- watch
apiGroups:
- autoscaling
  resources:
- horizontalpodautoscalers
  verbs:
- get
- list
- watch
apiGroups: ["events.k8s.io"]
resources: ["events"]
verbs: ["watch", "list"]

Source: opentelemetry-kube-stack/templates/clusterrole.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: example-daemon
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: example-collector
subjects:

kind: ServiceAccount
quirk of the Operator
name: "otel-prometheus-collector"
namespace: default
kind: ServiceAccount
name: otel-prometheus-targetallocator
namespace: default

</details>

jlcrow · 2024-09-10T19:01:12Z

@jaronoff97 idk man the feature gates seem to be sticking around for me when the operator is deploying the collector. I'm running on GKE don't think that should matter though.

  otc-container:
    Container ID:  containerd://724dfd2080e9b46afac3fde71cb9e56747d8c6d352cd7c82b9baf272ed40a301
    Image:         otel/opentelemetry-collector-contrib:0.106.1
    Image ID:      docker.io/otel/opentelemetry-collector-contrib@sha256:12a6cab81088666668e312f1e814698f14f205d879181ec5f770301ab17692c2
    Ports:         8888/TCP, 4317/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      --config=/conf/collector.yaml
      --feature-gates=-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost

  otc-container:
    Container ID:  containerd://1cf06d1b6368d070ceb3a9f9448351b1638140a459ee9dbb2b9dbf7e3b173610
    Image:         otel/opentelemetry-collector-contrib:0.108.0
    Image ID:      docker.io/otel/opentelemetry-collector-contrib@sha256:923eb1cfae32fe09676cfd74762b2b237349f2273888529594f6c6ffe1fb3d7e
    Ports:         8888/TCP, 4317/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      --config=/conf/collector.yaml
      --feature-gates=-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost

jaronoff97 · 2024-09-10T19:04:19Z

what was the version before? I thought it 0.65.1, but want to confirm. And did you install the operator helm chart with upgrades disabled or any other flags? If i can get a local repro, I can try to get a fix out ASAP, otherwise it would be helpful to enable debug logging on the operator.

jlcrow · 2024-09-10T19:05:51Z

I was able to make it to 0.67.0, any version later breaks the same way

jaronoff97 · 2024-09-10T19:12:23Z

yeah i just did this exact process:

install 0.67.0
install config above
check success ✅
upgrade to 0.69.0
check success ✅
One notable difference is that your 0.67.0 operator collector install has the -confmap.unifyEnvVarExpansion featuregate on it whereas mine does not. If you delete and recreate the otelcol object is it still present? Another option would be to upgrade to operator 0.69.0, and then delete recreate the otelcol object at which point it should be gone... If that doesn't work or isn't possible let me know and we can sort out some other options.

jaronoff97 · 2024-09-10T19:14:26Z

another user who reported a similar issue by doing a clean install of the operator #1339 (comment)

jlcrow · 2024-09-10T19:24:41Z

@jaronoff97

Looks like a full uninstall and reinstall and now the flag is no longer present and the collector comes up successfully

jaronoff97 · 2024-09-10T19:26:33Z

okay thats good, but im not satisfied with it. im going to keep investigating here and try to get a repro... im thinking maybe going from an older version to one that adds the flag, back to the previous version and then up to latest may cause it.

jlcrow · 2024-09-10T20:40:50Z

@jaronoff97 I spoke too soon, somewhere along the lines the targetallocator stopped picking up my monitors and I lost almost all of my metrics, I just went back to the alpha spec and 67 to get things working again

jaronoff97 · 2024-09-10T20:44:00Z

that's probably due to the permissions change i alluded to here. This was the error message I saw:

{"level":"error","ts":"2024-09-10T18:41:53Z","logger":"setup.prometheus-cr-watcher","msg":"Failed to create namespace informer in promOperator CRD watcher","error":"missing list/watch permissions on the 'namespaces' resource: missing \"list\" permission on resource \"namespaces\" (group: \"\") for all namespaces: missing \"watch\" permission on resource \"namespaces\" (group: \"\") for all namespaces","stacktrace":"github.com/open-telemetry/opentelemetry-operator/cmd/otel-allocator/watcher.NewPrometheusCRWatcher\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/cmd/otel-allocator/watcher/promOperator.go:115\nmain.main\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/cmd/otel-allocator/main.go:119\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.6/x64/src/runtime/proc.go:271"}

jaronoff97 · 2024-09-10T20:46:10Z

apiGroups: [""]
resources:
- namespaces
verbs: ["get", "list", "watch"]

this block should do the trick, but I'm on mobile rn so sorry if it's not exactly right 😅

jlcrow · 2024-09-24T20:11:51Z

@jaronoff97 I spoke too soon, somewhere along the lines the targetallocator stopped picking up my monitors and I lost almost all of my metrics, I just went back to the alpha spec and 67 to get things working again

I'm still having weird issues with the targetallocator on one of my clusters - it consistently fails to pick up any servicemonitor or podmonitor crds. I tried a number of things including full uninstall and reinstall, working with version 69 of the chart and 108 of the collector. I checked the rbac for the sa account and the auth appears to be there.

kubectl auth can-i get podmonitors --as=system:serviceaccount:monitoring-system:otel-prometheus-targetallocator                                                             
yes

kubectl auth can-i get servicemonitors --as=system:serviceaccount:monitoring-system:otel-prometheus-targetallocator                                                   
yes

At the end on a whim I reverted the api back to v1alpha1 and when I deployed the spec and the targetallocator/scrape_configs started showing all the podmonitors and servicemonitors instead of only the default prometheus config that's in the chart. I'm actually not understanding at all why this isn't working correctly as I have another operator on another GKE cluster with the same config that doesn't seem to have an issue with the beta api.

dosubot bot added the chart:operator Issue related to opentelemetry-operator helm chart label Aug 28, 2024

dosubot bot mentioned this issue Sep 9, 2024

invalid argument issue after 0.69.0 upgrade #1339

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[operator] Collector fails with featureGate errors when Upgrading the Operator to chart version 0.68.1 #1320

[operator] Collector fails with featureGate errors when Upgrading the Operator to chart version 0.68.1 #1320

jlcrow commented Aug 28, 2024

dosubot bot commented Aug 28, 2024

jaronoff97 commented Aug 29, 2024

jaronoff97 commented Aug 29, 2024

jaronoff97 commented Sep 6, 2024

jaronoff97 commented Sep 10, 2024

jlcrow commented Sep 10, 2024 •

edited

Loading

jaronoff97 commented Sep 10, 2024

jlcrow commented Sep 10, 2024

jaronoff97 commented Sep 10, 2024

jaronoff97 commented Sep 10, 2024 •

edited

Loading

jaronoff97 commented Sep 10, 2024

jlcrow commented Sep 10, 2024

jaronoff97 commented Sep 10, 2024

jaronoff97 commented Sep 10, 2024

Source: opentelemetry-kube-stack/templates/clusterrole.yaml

Source: opentelemetry-kube-stack/templates/clusterrole.yaml

quirk of the Operator

jlcrow commented Sep 10, 2024 •

edited

Loading

jaronoff97 commented Sep 10, 2024

jlcrow commented Sep 10, 2024

jaronoff97 commented Sep 10, 2024

jaronoff97 commented Sep 10, 2024

jlcrow commented Sep 10, 2024

jaronoff97 commented Sep 10, 2024

jlcrow commented Sep 10, 2024

jaronoff97 commented Sep 10, 2024

jaronoff97 commented Sep 10, 2024

jlcrow commented Sep 24, 2024

[operator] Collector fails with featureGate errors when Upgrading the Operator to chart version 0.68.1 #1320

[operator] Collector fails with featureGate errors when Upgrading the Operator to chart version 0.68.1 #1320

Comments

jlcrow commented Aug 28, 2024

dosubot bot commented Aug 28, 2024

jaronoff97 commented Aug 29, 2024

jaronoff97 commented Aug 29, 2024

jaronoff97 commented Sep 6, 2024

jaronoff97 commented Sep 10, 2024

jlcrow commented Sep 10, 2024 • edited Loading

jaronoff97 commented Sep 10, 2024

jlcrow commented Sep 10, 2024

jaronoff97 commented Sep 10, 2024

jaronoff97 commented Sep 10, 2024 • edited Loading

jaronoff97 commented Sep 10, 2024

jlcrow commented Sep 10, 2024

jaronoff97 commented Sep 10, 2024

jaronoff97 commented Sep 10, 2024

Source: opentelemetry-kube-stack/templates/clusterrole.yaml

Source: opentelemetry-kube-stack/templates/clusterrole.yaml

quirk of the Operator

jlcrow commented Sep 10, 2024 • edited Loading

jaronoff97 commented Sep 10, 2024

jlcrow commented Sep 10, 2024

jaronoff97 commented Sep 10, 2024

jaronoff97 commented Sep 10, 2024

jlcrow commented Sep 10, 2024

jaronoff97 commented Sep 10, 2024

jlcrow commented Sep 10, 2024

jaronoff97 commented Sep 10, 2024

jaronoff97 commented Sep 10, 2024

jlcrow commented Sep 24, 2024

jlcrow commented Sep 10, 2024 •

edited

Loading

jaronoff97 commented Sep 10, 2024 •

edited

Loading

jlcrow commented Sep 10, 2024 •

edited

Loading