Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add network topology information to nodes #1025

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

wwvela
Copy link

@wwvela wwvela commented Sep 19, 2024

What type of PR is this?
/kind feature

What this PR does / why we need it:
This PR is to Add network topology information to nodes for Issue #998.

Which issue(s) this PR fixes:
Fixes #998

Does this PR introduce a user-facing change?:
NONE

@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 19, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kishorj for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Sep 19, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 19, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @wwvela. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 19, 2024
topology, err := c.ec2.DescribeInstanceTopology(topologyRequest)
if err != nil || topology == nil {
klog.Infof("topology api not supported for instance type: %s or region: %s", instanceType, region)
return additionalLabels, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't return here, you should wrap the following block in an else

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we will use the supportedInstanceType instead of instance prefix now, the error throw our there should not because not supported issue. I throw the error directly here in new version

if c.topologySupportedInstance(instanceType) && slices.Contains(c.cfg.Global.TopologySupportedRegions, region) {
topologyRequest := &ec2.DescribeInstanceTopologyInput{InstanceIds: []*string{&instanceID}}
topology, err := c.ec2.DescribeInstanceTopology(topologyRequest)
if err != nil || topology == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to check what the err is before we say that the instance isn't supported

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revised the topologySupportedInstance to check the supported instance types instead of the instance prefix, so the DescribeInstanceTopology api should supported here. I will throw error directly

return additionalLabels, nil
}

// TopologySupportedInstancePrefixes is help to filter instance types supported by topology API
func (c *Cloud) topologySupportedInstance(instanceType string) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (c *Cloud) topologySupportedInstance(instanceType string) bool {
func (c *Cloud) isTopologySupportedInstance(instanceType string) bool {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revised

// TopologySupportedInstancePrefixes is help to filter instance types supported by topology API
func (c *Cloud) topologySupportedInstance(instanceType string) bool {
for _, prefix := range c.cfg.Global.TopologySupportedInstancePrefixes {
if strings.HasPrefix(instanceType, prefix) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably just want to work with a list of fully-formed instance type names, prefix matching has some ambiguity and configuration footguns ("p4" would match all p4dn instances, for example). The list isn't going to change very often and we don't need the benefit of shorthand IMO

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to be updating the cloud-provider each time there's a new AWS instance type, though?

Copy link
Contributor

@ndbaker1 ndbaker1 Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be a bad idea to do some dynamic discover based on an DescribeInstanceTopology result? then cache the instance types' support with a TTL? from what i can tell the api just returns an empty list when the instance-id's type is not supported

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed this with Carter before. We don't want to trigger this api for all nodes, because some cx might have hundreds or thousands of nodes in their cluster, if we call this api for all nodes, it very likely trigger throttle issue.

And DescribeInstanceTopology only supported for very limited instances types which lots of cx might not use. check the supported instance type and regions here. https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeInstanceTopology.html

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to be updating the cloud-provider each time there's a new AWS instance type, though?

We don't need to update cloud-provider, TopologySupportedInstanceTypes and TopologySupportedRegions is added in cloud-config where will accept the input in KCP CCM manifest. We only need to update the manifest with the cloud-config input.

Comment on lines 89 to 92
for index, networkNode := range topology.NetworkNodes {
label := LabelNetworkNode + strconv.Itoa(index)
additionalLabels[label] = *networkNode
}
Copy link
Contributor

@cartermckinnon cartermckinnon Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear how you would use this. How would you "query" the labels using a pod's nodeSelector, or a telemetry system, if the index of the networkNode appears in the label key?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after reading through the instance topology docs i think i have an idea how this works?

Assuming the goal is to run workloads that share the same topology, you'd need to know the network nodes that your capacity sits on first, then use that in a nodeSelector for the corresponding layer index. AFAIK there's no way to apply a nodeSelector based on some heuristic like "schedule pods to nodes where the label topology.k8s.aws/network-node-layer-1 matches", so this logic would have to be handled by a high level system. the only guarentee is that there are always at least 3 layers.

However, this scheme breaks down if network nodes can be part of different layers, which i couldn't immediately grok from the docs but i assumed wasn't possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this scheme breaks down if network nodes can be part of different layers

yep, if trees with different roots can overlap (share network nodes).

The long form doc is helpful, the API doc doesn’t even say that order is meaningful in the networkNodeSet. 😄

I’m wondering if the “layer” number needs to be in the labels. Is it enough to just say “schedule this pod within this networkNode” instead of “within this networkNode which is the layer 2 networkNode for the instance”?

—-

A primary motivation for this feature was observability, being able to correlate issues with something more granular than AZ. If you want to slice and dice your instance metrics using network nodes, should you care about the layer of the network node?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we inverted this, so that the label key contains the networkNode and the value is the layer number, that would allow you to use the Exists operator in a nodeAffinity to place the pod within a networkNode, without needing the layer to also match: https://kubernetes.io/docs/reference/kubernetes-api/common-definitions/node-selector-requirement/#NodeSelectorRequirement

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, if trees with different roots can overlap (share network nodes).

We can assume that's not the case for now.

A primary motivation for this feature was observability, being able to correlate issues with something more granular than AZ

Do you mean the feature to apply arbitrary labels from within the cloud provider? I think observability at the zone-id level is where we started, but I'd like to think we can help with scheduling here too. The next thing I'm thinking is a label that describes whether an instance has a GPU or not, so we don't need to schedule GPU pods (Device Plugin, EFA etc) on nodes that don't need it etc.

Assuming the goal is to run workloads that share the same topology, you'd need to know the network nodes that your capacity sits on first, then use that in a nodeSelector for the corresponding layer index.

Right that's what I was expecting you'd need to do. You don't know the topology until you launch the instance first, so it forces this ordering of sorts where you launch the instance first, you realize what your network nodes are, and you should ideally see a common network node across your instances. I think you could maybe define weighted node affinities across the network node layers so you pack it as close as possible?

@cartermckinnon
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 19, 2024
@cartermckinnon
Copy link
Contributor

/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Sep 19, 2024
Comment on lines 37 to 40
} else if len(resp.Instances) == 0 {
return nil, nil
}
return resp.Instances[0], err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this wrapper's logic seems strange, shouldn't we return the list of []*ec2.InstanceTopology instead of just the first item?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DescribeInstanceTopology can accept a list of instance IDs, if we pass only 1 instance ID, it will return 1 according topology info something like below.

"Instances": [
        {
            "InstanceId": "i-xxx",
            "InstanceType": "p5e.48xlarge",
            "NetworkNodes": [
                "nn-xxxxx",
                "nn-xxxxx",
                "nn-xxxxx"
            ],
            "AvailabilityZone": "us-west-2c",
            "ZoneId": "usw2-az3"
        }
    ]

But yeah, it seems more reasonable to return all instances in api and get the first instance in where trigger it and get result. Will revised in next

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, because we still allow the full ec2.DescribeInstanceTopologyInput payload which could specify more than one instance id

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revised

@@ -20,4 +20,5 @@ const (
// LabelZoneID is a topology label that can be applied to any resource
// but will be initially applied to nodes.
LabelZoneID = "topology.k8s.aws/zone-id"
LabelNetworkNode = "topology.k8s.aws/network-node-"
Copy link
Contributor

@ndbaker1 ndbaker1 Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LabelNetworkNode = "topology.k8s.aws/network-node-"
LabelNetworkNode = "topology.k8s.aws/network-node-layer-"

IIUC, the info you're embedding into the key is about the layer of the network node in the node set

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revised in next version

Comment on lines 89 to 92
for index, networkNode := range topology.NetworkNodes {
label := LabelNetworkNode + strconv.Itoa(index)
additionalLabels[label] = *networkNode
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after reading through the instance topology docs i think i have an idea how this works?

Assuming the goal is to run workloads that share the same topology, you'd need to know the network nodes that your capacity sits on first, then use that in a nodeSelector for the corresponding layer index. AFAIK there's no way to apply a nodeSelector based on some heuristic like "schedule pods to nodes where the label topology.k8s.aws/network-node-layer-1 matches", so this logic would have to be handled by a high level system. the only guarentee is that there are always at least 3 layers.

However, this scheme breaks down if network nodes can be part of different layers, which i couldn't immediately grok from the docs but i assumed wasn't possible.

@k8s-ci-robot
Copy link
Contributor

@wwvela: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cloud-provider-aws-check 9e72a88 link true /test pull-cloud-provider-aws-check
pull-cloud-provider-aws-test 9e72a88 link true /test pull-cloud-provider-aws-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add network topology information to nodes
5 participants