Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-NUMA systems for testing on Kubernetes test infrastructure #28211

Closed
swatisehgal opened this issue Dec 8, 2022 · 27 comments
Closed

Multi-NUMA systems for testing on Kubernetes test infrastructure #28211

swatisehgal opened this issue Dec 8, 2022 · 27 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one.

Comments

@swatisehgal
Copy link
Contributor

swatisehgal commented Dec 8, 2022

SIG Node community is trying to avoid perma-beta status of features (as per discussion in SIG Node Meeting 2022/12/06) and Topology Manager and Memory Manger are both candidates for GA graduation. Currently, node e2e tests for both these components are skipped (examples: Topology Manager Node e2e test and Memory Manager Node e2e test) as we don't have multi-NUMA hardware in the test infrastructure.

In SIG Node CI meeting on 2022/12/07, it was highlighted that:

Two cheapest options with Numa on GCP:

1. n2-standard-32  $908.47

skanzhelev@n2-standard-32:~$ grep NUMA=y /boot/config-`uname -r`
lscpu | grep -i numa
CONFIG_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_ACPI_NUMA=y
NUMA node(s):                	2
NUMA node0 CPU(s):           	0-7,16-23
NUMA node1 CPU(s):           	8-15,24-31

2. n2d-standard-32 $790.49

skanzhelev@n2d-standard-32:~$ grep NUMA=y /boot/config-`uname -r`
lscpu | grep -i numa
CONFIG_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_ACPI_NUMA=y
NUMA node(s):                	2
NUMA node0 CPU(s):           	0-7,16-23
NUMA node1 CPU(s):           	8-15,24-31

Is it possible to add either n2-standard-32 or n2d-standard-32 to the test infrastructure as it is not ideal to skip e2e test and it could be a potential blocker for GA graduation of these features?

@k8s-ci-robot
Copy link
Contributor

@swatisehgal: There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:

  • /sig <group-name>
  • /wg <group-name>
  • /committee <group-name>

Please see the group list for a listing of the SIGs, working groups, and committees available.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 8, 2022
@swatisehgal
Copy link
Contributor Author

/sig-node

@ameukam
Copy link
Member

ameukam commented Dec 8, 2022

@swatisehgal Are you adding new tests or do you want to modify existing tests ?

@ameukam
Copy link
Member

ameukam commented Dec 8, 2022

cc @dims

@BenTheElder
Copy link
Member

e2e tests don't have permanent machines so there's nothing to add.

You'll need to run a dedicated CI suite with these more expensive machines, we should not jump to machines this large in the standard e2e suites, I forget exactly what we use but it's something like 1-8 core machines.

this sort of thing should already be configurable in the e2e scripts

@BenTheElder
Copy link
Member

Alternatively we have some cluster e2e CI on AWS if that helps.

@swatisehgal
Copy link
Contributor Author

swatisehgal commented Dec 8, 2022

@swatisehgal Are you adding new tests do you want to modify existing tests ?

We want to make sure that the existing tests and the ones we add in the future are actually executed and we reap the benefit of signals from CI. There is currently no way for us to identify regressions.

The main problem is that the existing node-e2e tests related to Topology Manager and Memory Manager in spite of being configured (e.g. here) are skipped when they don't run on multi-NUMA machines (e.g. here) which happens 100% of the times as we don't have such nodes in the test-infra.

@swatisehgal
Copy link
Contributor Author

swatisehgal commented Dec 8, 2022

e2e tests don't have permanent machines so there's nothing to add.

You'll need to run a dedicated CI suite with these more expensive machines, we should not jump to machines this large in the standard e2e suites, I forget exactly what we use but it's something like 1-8 core machines.

this sort of thing should already be configurable in the e2e scripts

This is useful information, Thanks!
Does that mean we can create a dedicated lane for testing features that require special hardware capabilities like NUMA and point it to run on n2d-standard-32 (given that it is the cheaper option) and we don't need these VMs to be pre-provisioned?

Other than creating a PR to propose introduction of a separate test-suite that runs on a dedicated lane with these machines in test-infra repo and getting it reviewed/approved, do I need to get approval for using these more expensive machines?

@swatisehgal
Copy link
Contributor Author

swatisehgal commented Dec 8, 2022

Alternatively we have some cluster e2e CI on AWS if that helps.

After a big of digging I found m5n.24xlarg c5.24xlarge, a machine type provided by AWS that has multi-NUMA. It is a beefy system with 96 vCPUs/ 192 GB RAM and is way more expensive than the GCP options mentioned above. I will do my research to figure that out if there are cheaper options. I wonder if someone from AWS team can help out here?

@tokt
Copy link

tokt commented Dec 9, 2022

https://instances.vantage.sh/aws/ec2/c5.24xlarge - this seems to be a bit cheaper?

AWS Docs aren't very helpful unfortunately (or maybe it's just my search skills).

@swatisehgal
Copy link
Contributor Author

swatisehgal commented Dec 9, 2022

https://instances.vantage.sh/aws/ec2/c5.24xlarge - this seems to be a bit cheaper?

AWS Docs aren't very helpful unfortunately (or maybe it's just my search skills).

I intended to point to c5.24xlarge in my previous comment, linked to it correctly but used the wrong machine name 🤦‍♀️
Anyway, it appears to be expensive with a monthly cost of almost $3k compared to $790.49 for GCP. I am guessing that a spot instance would not be a suitable option for us given that it can be taken away with a very short notice.

image

@ameukam
Copy link
Member

ameukam commented Dec 9, 2022

@swatisehgal Are you adding new tests do you want to modify existing tests ?

We want to make sure that the existing tests and the ones we add in the future are actually executed and we reap the benefit of signals from CI. There is currently no way for us to identify regressions.

The main problem is that the existing node-e2e tests related to Topology Manager and Memory Manager in spite of being configured (e.g. here) are skipped when they don't run on multi-NUMA machines (e.g. here) which happens 100% of the times as we don't have such nodes in the test-infra.

Sorry. I should have be clear about what I meant by tests. My question is toward prowjobs. Is there an intent to introduce new prowjobs to run e2e tests for topology/memory manager ?

@swatisehgal
Copy link
Contributor Author

We already have a few prow jobs for resource managers:

  1. ci-kubernetes-node-kubelet-containerd-resource-managers
  2. ci-crio-cgroupv1-node-e2e-resource-managers
  3. pull-kubernetes-node-kubelet-serial-topology-manager

These jobs refer to image-config-files (e .g. image-config-serial-resource-managers.yaml and image-config-serial-cpu-manager.yaml) where n1-standard-4 is specified as machine. Change itself is trivial: we change the machines specified in the config files to machines with multi-NUMA (n2-standard-32 or n2d-standard-32) but this can have implications on our test infrastructure cost and that is what I wanted to discuss here.

@swatisehgal
Copy link
Contributor Author

You'll need to run a dedicated CI suite with these more expensive machines, we should not jump to machines this large in the standard e2e suites, I forget exactly what we use but it's something like 1-8 core machines.

@BenTheElder Could you please elaborate what you mean by a dedicated test suite. We currently have node e2e tests for
cpu_manager, memory_manager and topology_manager under Node E2E tests. Do we need to move them somewhere else?

@SergeyKanzhelev
Copy link
Member

I think @BenTheElder refers to a separate job definition for topology manager with the image config spacifying the desired machine type like here:

machine: n1-standard-2 # These tests need a lot of memory

Let's try to create one and make see how fast it runs. Once we have numbers - it will be easier conversation.

@BenTheElder
Copy link
Member

Anyway, it appears to be expensive with a monthly cost of almost $3k compared to $790.49 for GCP. I am guessing that a spot instance would not be a suitable option for us given that it can be taken away with a very short notice.

  1. Presumably these machines should not exist for the entire month, as they're temporary for the duration of an e2e test, unless I'm deeply mistaken. We dispose of the entire GCP project contents or AWS sub-account after the completion of e2e tests, generally speaking. We rent projects/accounts not machines from https://github.com/kubernetes-sigs/boskos
  2. These are not paid for with money. They're paid for with GCP / AWS credits provided to the project, currently the GCP account is paying for nearly everything in the project and we've exceeded our ($3M/year!) budget there, so the tradeoff is more complex (see: https://www.youtube.com/watch?v=mJAC4asDCOw). But it sounds like the AWS machines are much more expensive and not an efficient approach to testing this in a minimum-viable way.

I think @BenTheElder refers to a separate job definition for topology manager with the image config spacifying the desired machine type like here:

test-infra/jobs/e2e_node/image-config-serial.yaml

Line 12 in bd6746a

machine: n1-standard-2 # These tests need a lot of memory
Let's try to create one and make see how fast it runs. Once we have numbers - it will be easier conversation.

yes

@swatisehgal
Copy link
Contributor Author

Anyway, it appears to be expensive with a monthly cost of almost $3k compared to $790.49 for GCP. I am guessing that a spot instance would not be a suitable option for us given that it can be taken away with a very short notice.

  1. Presumably these machines should not exist for the entire month, as they're temporary for the duration of an e2e test, unless I'm deeply mistaken. We dispose of the entire GCP project contents or AWS sub-account after the completion of e2e tests, generally speaking. We rent projects/accounts not machines from https://github.com/kubernetes-sigs/boskos
  2. These are not paid for with money. They're paid for with GCP / AWS credits provided to the project, currently the GCP account is paying for nearly everything in the project and we've exceeded our ($3M/year!) budget there, so the tradeoff is more complex (see: https://www.youtube.com/watch?v=mJAC4asDCOw). But it sounds like the AWS machines are much more expensive and not an efficient approach to testing this in a minimum-viable way.

Thanks @BenTheElder, this is extremely insightful. It explains the logistics of how we handling VMs for testing and the constraints we are dealing with.

I think @BenTheElder refers to a separate job definition for topology manager with the image config spacifying the desired machine type like here:
test-infra/jobs/e2e_node/image-config-serial.yaml
Line 12 in bd6746a
machine: n1-standard-2 # These tests need a lot of memory
Let's try to create one and make see how fast it runs. Once we have numbers - it will be easier conversation.

yes

Perfect, I will work on a PR and share it for a review. Thanks!

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 19, 2023
@SergeyKanzhelev
Copy link
Member

/remove-lifecycle stale

@swatisehgal is there anything left to do here?

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 20, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 18, 2023
@swatisehgal
Copy link
Contributor Author

#28369 and #29717 were created to run jobs on multi-NUMA test infra and enable periodic jobs on multi-NUMA systems. This issue can be closed now.

/close

@k8s-ci-robot
Copy link
Contributor

@swatisehgal: Closing this issue.

In response to this:

#28369 and #29717 were created to run jobs on multi-NUMA test infra and enable periodic jobs on multi-NUMA systems. This issue can be closed now.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kannon92
Copy link
Contributor

kubernetes/enhancements#3545 (comment)

I think this seems to have regressed. Looking at our tests, I am seeing the Multi-NUMA tests are being skipped.

@kannon92
Copy link
Contributor

Once we get these working again, I think maybe we should consider failing the tests if the NUMA alignment constraint is not satisfied. That way we can be notified that there was an issue.

@ffromani
Copy link
Contributor

Once we get these working again, I think maybe we should consider failing the tests if the NUMA alignment constraint is not satisfied. That way we can be notified that there was an issue.

I'm worried this can lead to permared and/or possibly the test being disabled (!!). But I have zero data supporting my (irrational) fear, so I won't push back too hard if we decide to go this direction

@BenTheElder
Copy link
Member

I'm worried this can lead to permared and/or possibly the test being disabled (!!). But I have zero data supporting my (irrational) fear, so I won't push back too hard if we decide to go this direction

Tests that require specific hardware should:

  1. be labeled with a feature label (cc @pohly @aojea for the current state of that bit), so they are only run when accepting this label (most jobs won't run them and that's fine/expected)
  2. fail if the hardware is not available, so we don't wind up with silent skipping and no failure alerting if the jobs intended to run them are not setup correctly

Tests that require specific hardware / non-standard integrations should not be in the default suite and that's fine.

@BenTheElder
Copy link
Member

cross linking kubernetes/k8s.io#7339

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

9 participants