Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[redis] Failure to elect new leader #353

Open
glrf opened this issue Sep 9, 2021 · 0 comments · May be fixed by #352
Open

[redis] Failure to elect new leader #353

glrf opened this issue Sep 9, 2021 · 0 comments · May be fixed by #352
Labels
bug Something isn't working

Comments

@glrf
Copy link
Contributor

glrf commented Sep 9, 2021

Describe the bug

When restarting the leader pod, there is a possibility that the remaining nodes are unable to decide on a new leader.

Additional context

This should be fixed by bitnami/charts#7278 and further improved by bitnami/charts#7333.

This seems to be mostly mitigated in production by setting downAfterMilliseconds and failoverTimeout, but it's still possible.

To Reproduce

Steps to reproduce the behavior:

  1. Use default for downAfterMilliseconds and failoverTimeout
  2. Restart the leader Pod, or trigger a rolling update
  3. Be (un-)lucky

Logs

Nodes after restarting the master:

NAME                READY   STATUS             RESTARTS   AGE    IP            NODE                      NOMINATED NODE   READINESS GATES
test-redis-node-2   2/2     Running            0          119s   10.42.0.166   k3d-projectsyn-server-0   <none>           <none>
test-redis-node-1   2/2     Running            0          69s    10.42.0.167   k3d-projectsyn-server-0   <none>           <none>
test-redis-node-0   0/2     CrashLoopBackOff   2          29s    10.42.0.168   k3d-projectsyn-server-0   <none>           <none>

Log of former leader test-redis-node-0

 12:40:20.33 INFO  ==> test-redis-headless.redis-test.svc.cluster.local has my IP: 10.42.0.168
 12:40:20.34 INFO  ==> Cleaning sentinels in sentinel node: 10.42.0.167
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
1
 12:40:25.34 INFO  ==> Cleaning sentinels in sentinel node: 10.42.0.166
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
1
 12:40:30.35 INFO  ==> Sentinels clean up done
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at test-redis.redis-test.svc.cluster.local:26379: Connection refused
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at -p:6379: Name or service not known

Log of test-redis-node-1

1:X 09 Sep 2021 12:38:47.557 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:X 09 Sep 2021 12:38:47.557 # Redis version=6.2.1, bits=64, commit=00000000, modified=0, pid=1, just started
1:X 09 Sep 2021 12:38:47.557 # Configuration loaded
1:X 09 Sep 2021 12:38:47.558 * monotonic clock: POSIX clock_gettime
1:X 09 Sep 2021 12:38:47.558 * Running mode=sentinel, port=26379.
1:X 09 Sep 2021 12:38:47.559 # Sentinel ID is 93d594182506a64e9c0fb3e893ec67dbd7d3255d
1:X 09 Sep 2021 12:38:47.559 # +monitor master mymaster 10.42.0.163 6379 quorum 2
1:X 09 Sep 2021 12:39:22.360 # +reset-master master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:39:24.228 * +sentinel sentinel 362c939b89efbabc09ba1d11a50146bccd5614d9 10.42.0.166 26379 @ mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:39:30.849 # +reset-master master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:39:32.460 * +sentinel sentinel 362c939b89efbabc09ba1d11a50146bccd5614d9 10.42.0.166 26379 @ mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:39:44.147 # +reset-master master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:39:44.743 * +sentinel sentinel 362c939b89efbabc09ba1d11a50146bccd5614d9 10.42.0.166 26379 @ mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:40:04.228 # +sdown master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:40:09.289 # +new-epoch 1
1:X 09 Sep 2021 12:40:09.291 # +vote-for-leader 362c939b89efbabc09ba1d11a50146bccd5614d9 1
1:X 09 Sep 2021 12:40:09.520 # +odown master mymaster 10.42.0.163 6379 #quorum 2/2
1:X 09 Sep 2021 12:40:09.520 # Next failover delay: I will not start a failover before Thu Sep  9 12:40:46 2021
1:X 09 Sep 2021 12:40:20.346 # +reset-master master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:40:21.207 * +sentinel sentinel 362c939b89efbabc09ba1d11a50146bccd5614d9 10.42.0.166 26379 @ mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:40:40.375 # +sdown master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:40:45.507 # +new-epoch 2
1:X 09 Sep 2021 12:40:45.511 # +vote-for-leader 362c939b89efbabc09ba1d11a50146bccd5614d9 2
1:X 09 Sep 2021 12:40:45.652 # +odown master mymaster 10.42.0.163 6379 #quorum 2/2
1:X 09 Sep 2021 12:40:45.652 # Next failover delay: I will not start a failover before Thu Sep  9 12:41:22 2021
1:X 09 Sep 2021 12:41:15.777 # -odown master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:41:19.754 # +reset-master master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:41:20.002 * +sentinel sentinel 362c939b89efbabc09ba1d11a50146bccd5614d9 10.42.0.166 26379 @ mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:41:39.795 # +sdown master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:41:39.857 # +odown master mymaster 10.42.0.163 6379 #quorum 2/2
1:X 09 Sep 2021 12:41:39.857 # +new-epoch 3
1:X 09 Sep 2021 12:41:39.857 # +try-failover master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:41:39.872 # +vote-for-leader 93d594182506a64e9c0fb3e893ec67dbd7d3255d 3
1:X 09 Sep 2021 12:41:39.879 # 362c939b89efbabc09ba1d11a50146bccd5614d9 voted for 93d594182506a64e9c0fb3e893ec67dbd7d3255d 3
1:X 09 Sep 2021 12:41:39.943 # +elected-leader master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:41:39.943 # +failover-state-select-slave master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:41:40.005 # -failover-abort-no-good-slave master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:41:40.076 # Next failover delay: I will not start a failover before Thu Sep  9 12:42:16 2021
1:X 09 Sep 2021 12:42:16.041 # +new-epoch 4
1:X 09 Sep 2021 12:42:16.046 # +vote-for-leader 362c939b89efbabc09ba1d11a50146bccd5614d9 4
1:X 09 Sep 2021 12:42:16.089 # Next failover delay: I will not start a failover before Thu Sep  9 12:42:52 2021

The remaining nodes are unable to elect a new leader and try to connect to non existent former leader.

Expected behavior

Environment (please complete the following information):

  • Chart: latest
  • Helm: v3
  • Kubernetes API: v1.21
  • Distribution (Openshift, Rancher, etc.): k3s
@glrf glrf added the bug Something isn't working label Sep 9, 2021
@glrf glrf linked a pull request Sep 9, 2021 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant