Node sync errors when using a custom (`domain-name` empty) AWS DHCP Option Set for the VPC #384

damdo · 2022-05-19T15:05:06Z

What happened:
When a custom AWS DHCP Option Set, with empty domain-name, is assigned to the cluster VPC, and a node joins the cluster shortly after, the node syncing in the cloud provider's node-controller fails with the following:

E0518 11:37:33.302686       1 node_controller.go:213] error syncing 'ip-10-0-144-157': failed to get provider ID for node ip-10-0-144-157 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0518 11:37:33.197748       1 aws.go:5212] Unable to convert node name "ip-10-0-144-157" to aws instanceID, fall back to findInstanceByNodeName: node has no providerID

And the node, after briefly appearing

$ kubectl get nodes -w
...
ip-10-0-144-157                              NotReady                   worker   0s    v1.23.3+69213f8
ip-10-0-144-157                              NotReady                   worker   0s    v1.23.3+69213f8
ip-10-0-144-157                              NotReady                   worker   0s    v1.23.3+69213f8
ip-10-0-144-157                              NotReady                   worker   0s    v1.23.3+69213f8

is then deleted shortly after.

What you expected to happen:
The node syncing should succeed, as the instance backing the node should be found by the node-controller.

How to reproduce it (as minimally and precisely as possible):

Get a working K8s cluster with external cloud-provider-aws set up
Create a custom DHCP Option Set with empty/missing domain-name: aws ec2 create-dhcp-options --dhcp-configurations '[{"Key":"domain-name-servers","Values":["AmazonProvidedDNS"]}]'
Update the K8s VPC to used the custom DHCP Option Set just created
Make a new node join the cluster
Watch out for nodes to catch the brief node appearance and disappearance kubectl get nodes -w
Check the AWS Cloud provider logs

Anything else we need to know?:
After a bit of digging it turns out this boils down to how the nodeName is computed in the kubelet vs. the assumptions we do in the cloud provider.

The kubelet computes the nodeName by invoking getNodeName() which in turn behaves in different ways depending on whether in-tree vs. external providers are used. More in detail when the --cloud-provider=external is set on the kubelet cloud will be nil and the hostname will be used as a value for nodeName.

The AWS cloud provider, when syncing the Node in the node-controller, tries to find the instance backing the node by describing all instances and filtering out the one with private-dns-name matching the nodeName (which in this case is the hostname).
This works when the hostname has the same value of the private-dns-name, but doesn't in cases where they differ.

For example when a node is created with the custom DHCP Option Set previously described, the hostname will be of the form: ip-10-0-144-157 as opposed to its privated-dns-name which will be of the form: ip-10-0-144-157.ec2.internal.

Environment:

Kubernetes version (use kubectl version): v1.23.3+69213f8
Cloud provider or hardware configuration: v1.23.0
OS (e.g. from /etc/os-release): Red Hat Enterprise Linux CoreOS 411.85.202205101201-0 (Ootpa)
Kernel (e.g. uname -a): 4.18.0-348.23.1.el8_5.x86_64

/kind bug

The text was updated successfully, but these errors were encountered:

olemarkus · 2022-05-19T20:22:01Z

We lack some documentation in this area, but if you enable Resource-based naming on your instances, custom DHCP options will work. If you use IP-based naming, CCM will expect the FQDN of IP-based hostnames.

JoelSpeed · 2022-05-23T09:46:43Z

If you use IP-based naming, CCM will expect the FQDN of IP-based hostnames.

IP based naming is the default and most commonly used across AWS end users. I feel like we should try to fix this rather than providing a workaround that some may call a breaking change. Certainly in some managed software offerings based on Kubernetes I would expect that workaround to be considered an unacceptable change as the node naming must be consistent for a node during its own lifetime, but also within and among the others within the cluster

Do we know if there has been any previous discussion about this bug that lead to a wont-fix decision or can we open the floor for ideas on how to fix this?

olemarkus · 2022-05-23T10:18:44Z

CCM has always required FQDN for ip based node names. And with the domain being the regional default. No change there.

But for RBN, we have decided to relax this. Note that node node names will remain consistent for the lifetime of the individual node. Having different conventions while transitioning is entirely graceful. kOps do this in periodically running e2es.

What the default is depends on the installer. Not really that many installers using external CCM yet. But kOps does and kOps also transitions to RBN as part of it.

nckturner · 2022-05-25T16:16:50Z

/triage accepted

- What I did I Added AWS specific systemd unit (aws-kubelet-providerid.service) and file (/usr/local/bin/aws-kubelet-providerid) for generating the AWS instance provider-id (then stored in the KUBELET_PROVIDERID env var), in order to pass it as the --provider-id argument to the kubelet service binary. We needed to add such flag, and make it non-empty only on AWS, to make the node syncing (specifically backing instance detection) work via provider-id detection, to cover cases where the node hostname doesn't match the expected private-dns-name (e.g. when a custom DHCP Option Set with empty domain-name is used). Should fix: https://bugzilla.redhat.com/show_bug.cgi?id=2084450 Reference to an upstream issue with context: kubernetes/cloud-provider-aws#384 - How to verify it Try the reproduction steps available at: https://bugzilla.redhat.com/show_bug.cgi?id=2084450#c0 while launching a cluster with this MCO PR included. Verify that the issue is not reproducible anymore.

k8s-triage-robot · 2022-08-23T16:33:42Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

vpnachev · 2022-09-15T08:38:42Z

/remove-lifecycle stale

k8s-triage-robot · 2022-12-14T08:43:17Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-01-13T08:50:36Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-01-20T06:12:49Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

damdo · 2024-01-21T22:48:52Z

/remove-lifecycle rotten

ialidzhikov · 2024-01-22T07:44:58Z

/lifecycle frozen

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 19, 2022

damdo mentioned this issue May 25, 2022

Bug 2084450: Add unit/file for AWS to compute instance provider-id and pass it to the kubelet openshift/machine-config-operator#3162

Merged

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 25, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 23, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 15, 2022

plkokanov mentioned this issue Sep 21, 2022

Add validation for domain-name of custom dhcp options used in VPCs gardener/gardener-extension-provider-aws#615

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 13, 2023

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 20, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 21, 2024

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node sync errors when using a custom (`domain-name` empty) AWS DHCP Option Set for the VPC #384

Node sync errors when using a custom (`domain-name` empty) AWS DHCP Option Set for the VPC #384

damdo commented May 19, 2022 •

edited

Loading

olemarkus commented May 19, 2022

JoelSpeed commented May 23, 2022

olemarkus commented May 23, 2022

nckturner commented May 25, 2022

k8s-triage-robot commented Aug 23, 2022

vpnachev commented Sep 15, 2022

k8s-triage-robot commented Dec 14, 2022

k8s-triage-robot commented Jan 13, 2023

k8s-triage-robot commented Jan 20, 2024

damdo commented Jan 21, 2024

ialidzhikov commented Jan 22, 2024

Node sync errors when using a custom (domain-name empty) AWS DHCP Option Set for the VPC #384

Node sync errors when using a custom (domain-name empty) AWS DHCP Option Set for the VPC #384

Comments

damdo commented May 19, 2022 • edited Loading

olemarkus commented May 19, 2022

JoelSpeed commented May 23, 2022

olemarkus commented May 23, 2022

nckturner commented May 25, 2022

k8s-triage-robot commented Aug 23, 2022

vpnachev commented Sep 15, 2022

k8s-triage-robot commented Dec 14, 2022

k8s-triage-robot commented Jan 13, 2023

k8s-triage-robot commented Jan 20, 2024

damdo commented Jan 21, 2024

ialidzhikov commented Jan 22, 2024

Node sync errors when using a custom (`domain-name` empty) AWS DHCP Option Set for the VPC #384

Node sync errors when using a custom (`domain-name` empty) AWS DHCP Option Set for the VPC #384

damdo commented May 19, 2022 •

edited

Loading