KEP-4800: Split UnCoreCache awareness #4810

ajcaldelas · 2024-08-27T17:30:17Z

One-line PR description: Introduce a new CPU Manager Static policy option for L3 Cache awareness
Issue link: Enhancement Split L3 Cache Topology Awareness in CPU Manager #4800

k8s-ci-robot · 2024-08-27T17:30:25Z

Welcome @ajcaldelas!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2024-08-27T17:30:26Z

Hi @ajcaldelas. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

kannon92 · 2024-08-27T17:32:21Z

/cc

/ok-to-test

keps/sig-node/4800-cpumanager-split-l3cache/kep.yaml

keps/sig-node/4800-cpumanager-split-l3cache/README.md

k8s-ci-robot · 2024-08-29T16:15:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ajcaldelas
Once this PR has been reviewed and has the lgtm label, please assign jpbetz, mrunalp for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

keps/sig-node/4800-cpumanager-split-l3cache/README.md

kannon92 · 2024-08-30T13:22:08Z

keps/sig-node/4800-cpumanager-split-l3cache/README.md

+
+###### How can this feature be enabled / disabled in a live cluster?
+
+Cannot be dynamically enabled/disabled because of the CPU Manager state


Is this true? So the only way to enable this feature is to create a node with these settings?

Could I not set this policy and restart kubelet?

I agree @kannon92 phrasing seems more correct and easier to follow. We can totally enable the feature but this require a kubelet restart, and possibly the deletion of the cpu manager state file

Would something like this clear it up?

This feature depends on the CPU Manager State file found in /var/lib/kubelet/cpu_manager_state because of this feature the deletion of this file and restart of the Kubelet with the feature enabled again.

This is basically the reasoning maybe @wongchar could help me out here with the phrasing and more reasoning as to why this happens.

That's correct. Need to drain the node, enable/disable the feature, remove the cpu_manager_state file, reload daemons, and restart kubelet

I'd write
"disabling the feature requires a kubelet config change and a kubelet restart" or so.

The state file deletion is a a side effect inherited by how cpumanager works. Nothing specific in this feature seems to imply or require a deletion is warranted.

Actually, deletions of state files should be avoided entirely and they are a smell, but that's another story for another day.

How does this sound?
From our testing this is a requirement going from none to static cannot be done dynamically because of the cpu_manager_state file. The node needs to be drained and the policy checkpoint file need to be removed before restart Kubelet. This requirement stems from a CPUManager and has nothing to do with this feature specifically.

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

wongchar · 2024-09-30T14:17:09Z

/lead-opted-in

ffromani · 2024-09-30T14:18:28Z

/lead-opted-in

this a lead can add :)
@haircommander for awareness ^^^

haircommander · 2024-09-30T14:27:14Z

this has already been opted into :) #4800 (comment)

keps/prod-readiness/sig-node/4800.yaml

ffromani

getting there, mostly suggestions and clarification proposals

ffromani · 2024-10-01T14:05:30Z

keps/sig-node/4800-cpumanager-split-uncorecache/kep.yaml

+  - "@ffromani"
+  - "@kannon92"
+approvers:
+  - "@mrunalp" 


please replace with @klueska per sig-node KEP planning (cc @mrunalp ). Or perhaps just leave @sig-node-leads ?
I'd be fine with both above.

For now Ill replace it with @klueska

keps/sig-node/4800-cpumanager-split-uncorecache/kep.yaml

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

ffromani · 2024-10-01T14:37:12Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

+
+<!--
+For each of them, fill in the following information by copying the below template:
+  - [Failure mode brief description]
+    - Detection: How can it be detected via metrics? Stated another way:
+      how can an operator troubleshoot without logging into a master or worker node?
+    - Mitigations: What can be done to stop the bleeding, especially for already
+      running user workloads?
+    - Diagnostics: What are the useful log messages and their required logging
+      levels that could help debug the issue?
+      Not required until feature graduated to beta.
+    - Testing: Are there any tests for failure mode? If not, describe why.
+-->


since the feature is best effort, the known failure mode is fail to allocate optimally; the proposed metric and the podresources API should offer enough insights.

In the scenario where we can't optimally schedule would it actually be considered a failure mode? It's more like a warning, right?

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

soltysh

There are some comments from Francesco for alpha PRR, and a few nits from me. But mostly looks good, just waiting for sig-node approval.

soltysh · 2024-10-02T13:38:27Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

+- [ ] (R) Production readiness review approved
+- [ ] "Implementation History" section is up-to-date for milestone
+- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
+- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes


Nit: make sure to check the appropriate boxes above.

I checked off whatever I knew for sure but some checks I wasn't sure about

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

soltysh · 2024-10-02T13:57:29Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

+- Feature is enabled by a static policy option flag.  
+- It does not change behavior of non-static policy
+- It preserves the behavior of other static options.
+- Failure will occur if mismatch between options occurs and the Pod will not be scheduled.


This entire section is hard to read. Can you write it such that there's a clear:

- a risk: a mitigation - b risk: b mitigation etc..

soltysh · 2024-10-02T14:02:28Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

+#### Alpha
+
+- Feature implemented behind a feature gate flag option
+- Initial e2e tests completed and enabled


Let's add it here explicitly, so we don't forget.

soltysh · 2024-10-02T14:05:14Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

+
+###### What specific metrics should inform a rollback?
+
+N/A


I'd suggest adding increased pod startup latency as an information for a rollback, at minimum.

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

soltysh · 2024-10-02T14:06:09Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

+###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
+
+- Metrics
+  - Pod deployment time (seconds)


Can you point out specific metric here?

We don't specifically have a known metric to measure this, that's part of our proposal to add a new metric would we mention that here?

For this and for #4810 (comment) we can use topology_manager_admission_duration_ms - topology manager since GA will always be engaged even if no NUMA alignment is requested, so this is a good approximation. I'd be open to add more metrics in the future, but this can be a nice start and perhaps enough for alpha.

soltysh · 2024-10-02T14:07:15Z

keps/sig-node/4800-cpumanager-split-uncorecache/kep.yaml

+disable-supported: true
+
+# The following PRR answers are required at beta release
+metrics: []


This should be filled in with the metrics you've mentioned in a few places, not a blocker for alpha, though.

If they don't actually exist yet, is that okay? We are still trying to come up with metrics to implement

Co-authored-by: Maciej Szulik <[email protected]>

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 27, 2024

k8s-ci-robot requested review from dchen1107 and mrunalp August 27, 2024 17:30

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 27, 2024

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Aug 27, 2024

k8s-ci-robot requested a review from kannon92 August 27, 2024 17:32

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 27, 2024

kannon92 reviewed Aug 27, 2024

View reviewed changes

keps/sig-node/4800-cpumanager-split-l3cache/kep.yaml Outdated Show resolved Hide resolved

kannon92 reviewed Aug 27, 2024

View reviewed changes