Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: make sure the kubelet reservation are multi-NUMA aware #160

Open
ffromani opened this issue Apr 1, 2020 · 2 comments
Open

RFE: make sure the kubelet reservation are multi-NUMA aware #160

ffromani opened this issue Apr 1, 2020 · 2 comments

Comments

@ffromani
Copy link
Member

ffromani commented Apr 1, 2020

While writing and verifying the e2e tests for k8s/okd topology manager, I stumbled on a scenario which may be relevant and interesting for the performance-addon-operator.

Let's consider a cluster whose workers are multi-NUMA with, say, 2 numa nodes each and, says, 72 cpus (but this works also with 80 cpus, 64 cpus...)

numa node 0 cpus: 0,2,4,6,8,10...40
numa node 1 cpus: 1,3,5,7,9,11...39

NOTE: I need to check if and how HyperThreading affects this picture

oftentimes the PCI devices are connected to numa node #0
oftentimes the kubelet reserves cpus 0-3 (aka 0,1,2,3) for system purposes

In this scenario, if we want to start a workload which requires 1+ device and using all the cores on a NUMA node, the only suitable node is #0.

but because of the default configuration:

  1. you cannot allocate a full NUMA node to the workload. This is an incident due to a side effect of default settings, which we can avoid with a smarter configuration.
  2. even if you allocate the remaining cores, and even if you isolate cores, you will have some interference because mixed workload (system + user) runs on the same node.

A possible fix would be: the operator should offer the option to try to reserve system resources on NUMA nodes which don't have PCI devices connected to. From CPU-only workloads (aka workloads who don't need PCI devices at all) this makes no difference, and we can free some resources for PCI-requiring workloads.
If all the NUMA nodes have PCI devices attached to them, the operator can happily do nothing and
trust the cluster admin.

@ffromani ffromani changed the title make sure the kubelet reservation are multi-NUMA aware RFE: make sure the kubelet reservation are multi-NUMA aware Apr 1, 2020
@MarSik
Copy link
Member

MarSik commented Apr 1, 2020

The operator expects the list of cpus to reserve, so the user can set reserved to 1,3,5,7 in situation like this. There is no automation, we rely on the sysadmin knowledge about the hardware in question.

Automated cpu discovery was in the initial plan, but was too hard to implement in the first phase. Also every user might have different preferences (some want 1 housekeeping core per NUMA, some want totally isolated NUMA node for workload, ...)

@ffromani
Copy link
Member Author

ffromani commented Apr 1, 2020

The operator expects the list of cpus to reserve, so the user can set reserved to 1,3,5,7 in situation like this. There is no automation, we rely on the sysadmin knowledge about the hardware in question.

Automated cpu discovery was in the initial plan, but was too hard to implement in the first phase. Also every user might have different preferences (some want 1 housekeeping core per NUMA, some want totally isolated NUMA node for workload, ...)

Makes sense. We can perhaps offer in a future release, the option for the operator to figure out a smart reservation, or the smarter it can figure out on its own. There are cases which are safe and relatively easy to figure out. The cluster admin must always have the option to override the reservation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants