[redis] Failure to elect new leader #353

glrf · 2021-09-09T12:44:13Z

Describe the bug

When restarting the leader pod, there is a possibility that the remaining nodes are unable to decide on a new leader.

Additional context

This should be fixed by bitnami/charts#7278 and further improved by bitnami/charts#7333.

This seems to be mostly mitigated in production by setting downAfterMilliseconds and failoverTimeout, but it's still possible.

To Reproduce

Steps to reproduce the behavior:

Use default for downAfterMilliseconds and failoverTimeout
Restart the leader Pod, or trigger a rolling update
Be (un-)lucky

Logs

Nodes after restarting the master:

NAME                READY   STATUS             RESTARTS   AGE    IP            NODE                      NOMINATED NODE   READINESS GATES
test-redis-node-2   2/2     Running            0          119s   10.42.0.166   k3d-projectsyn-server-0   <none>           <none>
test-redis-node-1   2/2     Running            0          69s    10.42.0.167   k3d-projectsyn-server-0   <none>           <none>
test-redis-node-0   0/2     CrashLoopBackOff   2          29s    10.42.0.168   k3d-projectsyn-server-0   <none>           <none>

Log of former leader test-redis-node-0

 12:40:20.33 INFO  ==> test-redis-headless.redis-test.svc.cluster.local has my IP: 10.42.0.168
 12:40:20.34 INFO  ==> Cleaning sentinels in sentinel node: 10.42.0.167
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
1
 12:40:25.34 INFO  ==> Cleaning sentinels in sentinel node: 10.42.0.166
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
1
 12:40:30.35 INFO  ==> Sentinels clean up done
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at test-redis.redis-test.svc.cluster.local:26379: Connection refused
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at -p:6379: Name or service not known

Log of test-redis-node-1

1:X 09 Sep 2021 12:38:47.557 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:X 09 Sep 2021 12:38:47.557 # Redis version=6.2.1, bits=64, commit=00000000, modified=0, pid=1, just started
1:X 09 Sep 2021 12:38:47.557 # Configuration loaded
1:X 09 Sep 2021 12:38:47.558 * monotonic clock: POSIX clock_gettime
1:X 09 Sep 2021 12:38:47.558 * Running mode=sentinel, port=26379.
1:X 09 Sep 2021 12:38:47.559 # Sentinel ID is 93d594182506a64e9c0fb3e893ec67dbd7d3255d
1:X 09 Sep 2021 12:38:47.559 # +monitor master mymaster 10.42.0.163 6379 quorum 2
1:X 09 Sep 2021 12:39:22.360 # +reset-master master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:39:24.228 * +sentinel sentinel 362c939b89efbabc09ba1d11a50146bccd5614d9 10.42.0.166 26379 @ mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:39:30.849 # +reset-master master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:39:32.460 * +sentinel sentinel 362c939b89efbabc09ba1d11a50146bccd5614d9 10.42.0.166 26379 @ mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:39:44.147 # +reset-master master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:39:44.743 * +sentinel sentinel 362c939b89efbabc09ba1d11a50146bccd5614d9 10.42.0.166 26379 @ mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:40:04.228 # +sdown master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:40:09.289 # +new-epoch 1
1:X 09 Sep 2021 12:40:09.291 # +vote-for-leader 362c939b89efbabc09ba1d11a50146bccd5614d9 1
1:X 09 Sep 2021 12:40:09.520 # +odown master mymaster 10.42.0.163 6379 #quorum 2/2
1:X 09 Sep 2021 12:40:09.520 # Next failover delay: I will not start a failover before Thu Sep  9 12:40:46 2021
1:X 09 Sep 2021 12:40:20.346 # +reset-master master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:40:21.207 * +sentinel sentinel 362c939b89efbabc09ba1d11a50146bccd5614d9 10.42.0.166 26379 @ mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:40:40.375 # +sdown master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:40:45.507 # +new-epoch 2
1:X 09 Sep 2021 12:40:45.511 # +vote-for-leader 362c939b89efbabc09ba1d11a50146bccd5614d9 2
1:X 09 Sep 2021 12:40:45.652 # +odown master mymaster 10.42.0.163 6379 #quorum 2/2
1:X 09 Sep 2021 12:40:45.652 # Next failover delay: I will not start a failover before Thu Sep  9 12:41:22 2021
1:X 09 Sep 2021 12:41:15.777 # -odown master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:41:19.754 # +reset-master master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:41:20.002 * +sentinel sentinel 362c939b89efbabc09ba1d11a50146bccd5614d9 10.42.0.166 26379 @ mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:41:39.795 # +sdown master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:41:39.857 # +odown master mymaster 10.42.0.163 6379 #quorum 2/2
1:X 09 Sep 2021 12:41:39.857 # +new-epoch 3
1:X 09 Sep 2021 12:41:39.857 # +try-failover master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:41:39.872 # +vote-for-leader 93d594182506a64e9c0fb3e893ec67dbd7d3255d 3
1:X 09 Sep 2021 12:41:39.879 # 362c939b89efbabc09ba1d11a50146bccd5614d9 voted for 93d594182506a64e9c0fb3e893ec67dbd7d3255d 3
1:X 09 Sep 2021 12:41:39.943 # +elected-leader master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:41:39.943 # +failover-state-select-slave master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:41:40.005 # -failover-abort-no-good-slave master mymaster 10.42.0.163 6379
1:X 09 Sep 2021 12:41:40.076 # Next failover delay: I will not start a failover before Thu Sep  9 12:42:16 2021
1:X 09 Sep 2021 12:42:16.041 # +new-epoch 4
1:X 09 Sep 2021 12:42:16.046 # +vote-for-leader 362c939b89efbabc09ba1d11a50146bccd5614d9 4
1:X 09 Sep 2021 12:42:16.089 # Next failover delay: I will not start a failover before Thu Sep  9 12:42:52 2021

The remaining nodes are unable to elect a new leader and try to connect to non existent former leader.

Expected behavior

Environment (please complete the following information):

Chart: latest
Helm: v3
Kubernetes API: v1.21
Distribution (Openshift, Rancher, etc.): k3s

The text was updated successfully, but these errors were encountered:

glrf added the bug Something isn't working label Sep 9, 2021

glrf linked a pull request Sep 9, 2021 that will close this issue

Fix race condition that prevented election of new leader #352

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[redis] Failure to elect new leader #353

[redis] Failure to elect new leader #353

glrf commented Sep 9, 2021 •

edited by tobru

Loading

[redis] Failure to elect new leader #353

[redis] Failure to elect new leader #353

Comments

glrf commented Sep 9, 2021 • edited by tobru Loading

Describe the bug

Additional context

To Reproduce

Logs

Expected behavior

glrf commented Sep 9, 2021 •

edited by tobru

Loading