Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The operator exit due to error retrieving resource lock during ingress update #97

Open
eranco74 opened this issue Sep 11, 2023 · 0 comments

Comments

@eranco74
Copy link
Contributor

Seems that the ingress update affects the operator lease renewal causing it to exit and delaying the CR reconciliation.

The operator exits while it's waiting for the ingress to get updated:

2023-09-11T07:56:23Z	INFO	Waiting for ingress to update	{"controller": "clusterrelocation", "controllerGroup": "rhsyseng.github.io", "controllerKind": "ClusterRelocation", "ClusterRelocation": {"name":"cluster"}, "namespace": "", "name": "cluster", "reconcileID": "0d68e259-8df3-4af2-a8cb-3cc0015b9c64"}
2023-09-11T07:56:33Z	ERROR	Reconciler error	{"controller": "clusterrelocation", "controllerGroup": "rhsyseng.github.io", "controllerKind": "ClusterRelocation", "ClusterRelocation": {"name":"cluster"}, "namespace": "", "name": "cluster", "reconcileID": "0d68e259-8df3-4af2-a8cb-3cc0015b9c64", "error": "dial tcp 192.168.127.10:443: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235
2023-09-11T07:56:33Z	INFO	validation succeeded	{"controller": "clusterrelocation", "controllerGroup": "rhsyseng.github.io", "controllerKind": "ClusterRelocation", "ClusterRelocation": {"name":"cluster"}, "namespace": "", "name": "cluster", "reconcileID": "c2eb83a7-1de7-4d10-b0a7-22c3919ea01d"}
2023-09-11T07:56:33Z	INFO	TLS cert already exists for Ingresses	{"controller": "clusterrelocation", "controllerGroup": "rhsyseng.github.io", "controllerKind": "ClusterRelocation", "ClusterRelocation": {"name":"cluster"}, "namespace": "", "name": "cluster", "reconcileID": "c2eb83a7-1de7-4d10-b0a7-22c3919ea01d"}
2023-09-11T07:56:33Z	INFO	Using user provided API certificate	{"controller": "clusterrelocation", "controllerGroup": "rhsyseng.github.io", "controllerKind": "ClusterRelocation", "ClusterRelocation": {"name":"cluster"}, "namespace": "", "name": "cluster", "reconcileID": "c2eb83a7-1de7-4d10-b0a7-22c3919ea01d", "namespace": "relocation", "name": "new-api-certs"}
2023-09-11T07:56:33Z	ERROR	Reconciler error	{"controller": "clusterrelocation", "controllerGroup": "rhsyseng.github.io", "controllerKind": "ClusterRelocation", "ClusterRelocation": {"name":"cluster"}, "namespace": "", "name": "cluster", "reconcileID": "c2eb83a7-1de7-4d10-b0a7-22c3919ea01d", "error": "dial tcp 192.168.127.10:443: connect: connection refused"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235
E0911 07:56:52.138017       1 leaderelection.go:330] error retrieving resource lock openshift-operators/f4de3632.rhsyseng.github.io: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-operators/leases/f4de3632.rhsyseng.github.io": dial tcp 172.30.0.1:443: connect: connection refused
E0911 07:57:02.139324       1 leaderelection.go:330] error retrieving resource lock openshift-operators/f4de3632.rhsyseng.github.io: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-operators/leases/f4de3632.rhsyseng.github.io": dial tcp 172.30.0.1:443: connect: connection refused
E0911 07:58:23.958624       1 leaderelection.go:330] error retrieving resource lock openshift-operators/f4de3632.rhsyseng.github.io: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-operators/leases/f4de3632.rhsyseng.github.io": dial tcp 172.30.0.1:443: connect: connection refused
E0911 07:58:33.960122       1 leaderelection.go:330] error retrieving resource lock openshift-operators/f4de3632.rhsyseng.github.io: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-operators/leases/f4de3632.rhsyseng.github.io": dial tcp 172.30.0.1:443: connect: connection refused

Once the new instance starts it hangs for some time while it's trying to acquire the lease:

2023-09-11T07:58:56Z	INFO	setup	starting manager
I0911 07:58:56.215805       1 leaderelection.go:248] attempting to acquire leader lease openshift-operators/f4de3632.rhsyseng.github.io...
2023-09-11T07:58:56Z	INFO	Starting server	{"kind": "health probe", "addr": "[::]:8081"}
2023-09-11T07:58:56Z	INFO	Starting server	{"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
I0911 08:00:05.290010       1 leaderelection.go:258] successfully acquired lease openshift-operators/f4de3632.rhsyseng.github.io
2023-09-11T08:00:05Z	DEBUG	events	cluster-relocation-operator-controller-manager-75666d5c5-tmn66_aca225e6-ef45-4574-b015-8132b0091818 became leader	

Expected Behavior

Current Behavior

Possible Solution

  1. Don't use a leader lease if there's a single instance of the operator
  2. Use a longer lease duration

Steps to Reproduce (for bugs)

  1. I noticed this when I applied the CR with a new domain and without ingress cert (so the operator generated a self signed one)

Context

This issues delays the clusterrelocation CR reconciliation
I applied the CR on a stable cluster that was installed houres ago.

Regression

Unsure

Your Environment

  • Version used (cluster-relocation-operator):
    latest operator from operator HUB
  • Environment name and version (e.g. OCP v1.12.20):
    4.10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant