-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4800: Split UnCoreCache awareness #4810
base: master
Are you sure you want to change the base?
Conversation
Welcome @ajcaldelas! |
Hi @ajcaldelas. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/cc /ok-to-test |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ajcaldelas The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
||
###### How can this feature be enabled / disabled in a live cluster? | ||
|
||
Cannot be dynamically enabled/disabled because of the CPU Manager state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? So the only way to enable this feature is to create a node with these settings?
Could I not set this policy and restart kubelet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree @kannon92 phrasing seems more correct and easier to follow. We can totally enable the feature but this require a kubelet restart, and possibly the deletion of the cpu manager state file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would something like this clear it up?
This feature depends on the CPU Manager State file found in /var/lib/kubelet/cpu_manager_state
because of this feature the deletion of this file and restart of the Kubelet with the feature enabled again.
This is basically the reasoning maybe @wongchar could help me out here with the phrasing and more reasoning as to why this happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. Need to drain the node, enable/disable the feature, remove the cpu_manager_state file, reload daemons, and restart kubelet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd write
"disabling the feature requires a kubelet config change and a kubelet restart" or so.
The state file deletion is a a side effect inherited by how cpumanager works. Nothing specific in this feature seems to imply or require a deletion is warranted.
Actually, deletions of state files should be avoided entirely and they are a smell, but that's another story for another day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this sound?
From our testing this is a requirement going from none
to static
cannot be done dynamically because of the cpu_manager_state
file. The node needs to be drained and the policy checkpoint file need to be removed before restart Kubelet. This requirement stems from a CPUManager
and has nothing to do with this feature specifically.
/lead-opted-in |
this a lead can add :) |
this has already been opted into :) #4800 (comment) |
f40f716
to
62e3cb6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getting there, mostly suggestions and clarification proposals
- "@ffromani" | ||
- "@kannon92" | ||
approvers: | ||
- "@mrunalp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now Ill replace it with @klueska
|
||
<!-- | ||
For each of them, fill in the following information by copying the below template: | ||
- [Failure mode brief description] | ||
- Detection: How can it be detected via metrics? Stated another way: | ||
how can an operator troubleshoot without logging into a master or worker node? | ||
- Mitigations: What can be done to stop the bleeding, especially for already | ||
running user workloads? | ||
- Diagnostics: What are the useful log messages and their required logging | ||
levels that could help debug the issue? | ||
Not required until feature graduated to beta. | ||
- Testing: Are there any tests for failure mode? If not, describe why. | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the feature is best effort, the known failure mode is fail to allocate optimally; the proposed metric and the podresources API should offer enough insights.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the scenario where we can't optimally schedule would it actually be considered a failure mode? It's more like a warning, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some comments from Francesco for alpha PRR, and a few nits from me. But mostly looks good, just waiting for sig-node approval.
- [ ] (R) Production readiness review approved | ||
- [ ] "Implementation History" section is up-to-date for milestone | ||
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] | ||
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: make sure to check the appropriate boxes above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked off whatever I knew for sure but some checks I wasn't sure about
- Feature is enabled by a static policy option flag. | ||
- It does not change behavior of non-static policy | ||
- It preserves the behavior of other static options. | ||
- Failure will occur if mismatch between options occurs and the Pod will not be scheduled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entire section is hard to read. Can you write it such that there's a clear:
- a risk:
a mitigation
- b risk:
b mitigation
etc..
#### Alpha | ||
|
||
- Feature implemented behind a feature gate flag option | ||
- Initial e2e tests completed and enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add it here explicitly, so we don't forget.
|
||
###### What specific metrics should inform a rollback? | ||
|
||
N/A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest adding increased pod startup latency as an information for a rollback, at minimum.
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? | ||
|
||
- Metrics | ||
- Pod deployment time (seconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point out specific metric here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't specifically have a known metric to measure this, that's part of our proposal to add a new metric would we mention that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this and for #4810 (comment) we can use topology_manager_admission_duration_ms
- topology manager since GA will always be engaged even if no NUMA alignment is requested, so this is a good approximation. I'd be open to add more metrics in the future, but this can be a nice start and perhaps enough for alpha.
disable-supported: true | ||
|
||
# The following PRR answers are required at beta release | ||
metrics: [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be filled in with the metrics you've mentioned in a few places, not a blocker for alpha, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they don't actually exist yet, is that okay? We are still trying to come up with metrics to implement
Co-authored-by: Maciej Szulik <[email protected]>
Co-authored-by: Maciej Szulik <[email protected]>
Co-authored-by: Maciej Szulik <[email protected]>
One-line PR description: Introduce a new CPU Manager Static policy option for L3 Cache awareness
Issue link: Enhancement Split L3 Cache Topology Awareness in CPU Manager #4800