Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

innacurate rke2controlplane status when maxSurge is set to 0 #356

Open
zioc opened this issue Jul 3, 2024 · 0 comments
Open

innacurate rke2controlplane status when maxSurge is set to 0 #356

zioc opened this issue Jul 3, 2024 · 0 comments
Labels
kind/bug Something isn't working priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@zioc
Copy link

zioc commented Jul 3, 2024

What happened:

This issue was observed while searching for a workaround for a this issue in sylva project: https://gitlab.com/sylva-projects/sylva-core/-/issues/1412

It is somehow related to #355

When controlplane is upgraded with following strategy:

    rolloutStrategy:
      rollingUpdate:
        maxSurge: 0

Controlplane Ready conditions is set to True whereas the last control plane machine is being rolled out:

NAME                                                  CLUSTER                           NODENAME                                          PROVIDERID                                                                                                                      PHASE      AGE     VERSION
mgmt-1353958806-rke2-capm3-virt-control-plane-kw7q4   mgmt-1353958806-rke2-capm3-virt   mgmt-1353958806-rke2-capm3-virt-management-cp-2   metal3://sylva-system/mgmt-1353958806-rke2-capm3-virt-management-cp-2/mgmt-1353958806-rke2-capm3-virt-cp-2d747566b4-n9sgv       Deleting   81m     v1.28.8
mgmt-1353958806-rke2-capm3-virt-control-plane-shl4r   mgmt-1353958806-rke2-capm3-virt   mgmt-1353958806-rke2-capm3-virt-management-cp-0   metal3://sylva-system/mgmt-1353958806-rke2-capm3-virt-management-cp-0/mgmt-1353958806-rke2-capm3-virt-cp-2d747566b4-nqls5       Running    8m56s   v1.28.8
mgmt-1353958806-rke2-capm3-virt-control-plane-w4ldw   mgmt-1353958806-rke2-capm3-virt   mgmt-1353958806-rke2-capm3-virt-management-cp-1   metal3://sylva-system/mgmt-1353958806-rke2-capm3-virt-management-cp-1/mgmt-1353958806-rke2-capm3-virt-cp-2d747566b4-422g7       Running    31m     v1.28.8

We can indeed see that even if mgmt-1353958806-rke2-capm3-virt-control-plane machine is being deleted, it still has a REady=Trus condition:

- apiVersion: cluster.x-k8s.io/v1beta1
  kind: Machine
  metadata:
    annotations:
      controlplane.cluster.x-k8s.io/rke2-server-configuration: '{"tlsSan":["172.18.0.2"],"disableComponents":{"pluginComponents":["rke2-ingress-nginx"]},"cni":"calico","etcd":{"backupConfig":{},"customConfig":{"extraArgs":["auto-compaction-mode=periodic","auto-compaction-retention=12h","quota-backend-bytes=4294967296"]}}}'
    creationTimestamp: "2024-06-29T21:20:17Z"
    deletionGracePeriodSeconds: 0
    deletionTimestamp: "2024-06-29T22:40:07Z"
    finalizers:
    - machine.cluster.x-k8s.io
    generation: 2
    labels:
      cluster.x-k8s.io/cluster-name: mgmt-1353958806-rke2-capm3-virt
      cluster.x-k8s.io/control-plane: ""
      name: mgmt-1353958806-rke2-capm3-virt-control-plane-kw7q4
    namespace: sylva-system
    ownerReferences:
    - apiVersion: controlplane.cluster.x-k8s.io/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: RKE2ControlPlane
      name: mgmt-1353958806-rke2-capm3-virt-control-plane
      uid: 2603a3dc-5112-4203-a86e-39d1fe67f365
    resourceVersion: "313188"
    uid: 8eea3ee0-f40d-48fb-8244-6bef3b3491bd
  spec:
    bootstrap:
      configRef:
        apiVersion: bootstrap.cluster.x-k8s.io/v1alpha1
        kind: RKE2Config
        name: mgmt-1353958806-rke2-capm3-virt-control-plane-khrlc
        namespace: sylva-system
        uid: 9428a36d-d5b1-4cc2-a52f-60e4b9d47b0f
      dataSecretName: mgmt-1353958806-rke2-capm3-virt-control-plane-khrlc
    clusterName: mgmt-1353958806-rke2-capm3-virt
    infrastructureRef:
      apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
      kind: Metal3Machine
      name: mgmt-1353958806-rke2-capm3-virt-cp-2d747566b4-n9sgv
      namespace: sylva-system
      uid: 768f1aa0-502a-41bb-8904-ecdedbc1c553
    nodeDeletionTimeout: 10s
    nodeDrainTimeout: 5m0s
    providerID: metal3://sylva-system/mgmt-1353958806-rke2-capm3-virt-management-cp-2/mgmt-1353958806-rke2-capm3-virt-cp-2d747566b4-n9sgv
    version: v1.28.8
  status:
    addresses:
    - address: 192.168.100.12
      type: InternalIP
    - address: fe80::be48:57d4:3796:1ceb%ens5
      type: InternalIP
    - address: 192.168.10.12
      type: InternalIP
    - address: fe80::e84c:af6d:e97d:f1e9%ens4
      type: InternalIP
    - address: localhost.localdomain
      type: Hostname
    - address: localhost.localdomain
      type: InternalDNS
    bootstrapReady: true
    conditions:
    - lastTransitionTime: "2024-06-29T21:20:22Z"
      status: "True"
      type: Ready
    - lastTransitionTime: "2024-06-29T22:32:15Z"
      status: "True"
      type: AgentHealthy
    - lastTransitionTime: "2024-06-29T21:20:22Z"
      status: "True"
      type: BootstrapReady
    - lastTransitionTime: "2024-06-29T22:40:07Z"
      message: Draining the node before deletion
      reason: Draining
      severity: Info
      status: "False"
      type: DrainingSucceeded
    - lastTransitionTime: "2024-06-29T22:40:08Z"
      reason: Deleting
      severity: Info
      status: "False"
      type: EtcdMemberHealthy
    - lastTransitionTime: "2024-06-29T21:20:22Z"
      status: "True"
      type: InfrastructureReady
    - lastTransitionTime: "2024-06-29T22:32:16Z"
      status: "True"
      type: NodeHealthy
    - lastTransitionTime: "2024-06-29T21:20:23Z"
      status: "True"
      type: NodeMetadataUpToDate
    - lastTransitionTime: "2024-06-29T22:40:35Z"
      status: "True"
      type: PreDrainDeleteHookSucceeded

Consequently controller sets rke2controlplane as ready here as we have len(readyMachines)) == replicas

Shouldn't it instead check that the count of machines that are Ready and UpToDate matches the spec.replicas?

@zioc zioc added kind/bug Something isn't working needs-priority Indicates an issue or PR needs a priority assigning to it needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 3, 2024
@alexander-demicev alexander-demicev added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-priority Indicates an issue or PR needs a priority assigning to it needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

2 participants