Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4800: Split UnCoreCache awareness #4810

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

ajcaldelas
Copy link

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 27, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @ajcaldelas!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 27, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @ajcaldelas. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Aug 27, 2024
@kannon92
Copy link
Contributor

/cc

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 27, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ajcaldelas
Once this PR has been reviewed and has the lgtm label, please assign jpbetz, mrunalp for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


###### How can this feature be enabled / disabled in a live cluster?

Cannot be dynamically enabled/disabled because of the CPU Manager state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? So the only way to enable this feature is to create a node with these settings?

Could I not set this policy and restart kubelet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree @kannon92 phrasing seems more correct and easier to follow. We can totally enable the feature but this require a kubelet restart, and possibly the deletion of the cpu manager state file

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like this clear it up?

This feature depends on the CPU Manager State file found in /var/lib/kubelet/cpu_manager_state because of this feature the deletion of this file and restart of the Kubelet with the feature enabled again.

This is basically the reasoning maybe @wongchar could help me out here with the phrasing and more reasoning as to why this happens.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. Need to drain the node, enable/disable the feature, remove the cpu_manager_state file, reload daemons, and restart kubelet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd write
"disabling the feature requires a kubelet config change and a kubelet restart" or so.

The state file deletion is a a side effect inherited by how cpumanager works. Nothing specific in this feature seems to imply or require a deletion is warranted.

Actually, deletions of state files should be avoided entirely and they are a smell, but that's another story for another day.

Copy link
Author

@ajcaldelas ajcaldelas Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this sound?
From our testing this is a requirement going from none to static cannot be done dynamically because of the cpu_manager_state file. The node needs to be drained and the policy checkpoint file need to be removed before restart Kubelet. This requirement stems from a CPUManager and has nothing to do with this feature specifically.

@wongchar
Copy link

/lead-opted-in

@ffromani
Copy link
Contributor

/lead-opted-in

this a lead can add :)
@haircommander for awareness ^^^

@haircommander
Copy link
Contributor

this has already been opted into :) #4800 (comment)

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getting there, mostly suggestions and clarification proposals

- "@ffromani"
- "@kannon92"
approvers:
- "@mrunalp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please replace with @klueska per sig-node KEP planning (cc @mrunalp ). Or perhaps just leave @sig-node-leads ?
I'd be fine with both above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now Ill replace it with @klueska

keps/sig-node/4800-cpumanager-split-uncorecache/README.md Outdated Show resolved Hide resolved
keps/sig-node/4800-cpumanager-split-uncorecache/README.md Outdated Show resolved Hide resolved
keps/sig-node/4800-cpumanager-split-uncorecache/README.md Outdated Show resolved Hide resolved
keps/sig-node/4800-cpumanager-split-uncorecache/README.md Outdated Show resolved Hide resolved
Comment on lines +643 to +655

<!--
For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
- Detection: How can it be detected via metrics? Stated another way:
how can an operator troubleshoot without logging into a master or worker node?
- Mitigations: What can be done to stop the bleeding, especially for already
running user workloads?
- Diagnostics: What are the useful log messages and their required logging
levels that could help debug the issue?
Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the feature is best effort, the known failure mode is fail to allocate optimally; the proposed metric and the podresources API should offer enough insights.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the scenario where we can't optimally schedule would it actually be considered a failure mode? It's more like a warning, right?

keps/sig-node/4800-cpumanager-split-uncorecache/README.md Outdated Show resolved Hide resolved
keps/sig-node/4800-cpumanager-split-uncorecache/README.md Outdated Show resolved Hide resolved
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some comments from Francesco for alpha PRR, and a few nits from me. But mostly looks good, just waiting for sig-node approval.

- [ ] (R) Production readiness review approved
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: make sure to check the appropriate boxes above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked off whatever I knew for sure but some checks I wasn't sure about

keps/sig-node/4800-cpumanager-split-uncorecache/README.md Outdated Show resolved Hide resolved
keps/sig-node/4800-cpumanager-split-uncorecache/README.md Outdated Show resolved Hide resolved
keps/sig-node/4800-cpumanager-split-uncorecache/README.md Outdated Show resolved Hide resolved
- Feature is enabled by a static policy option flag.
- It does not change behavior of non-static policy
- It preserves the behavior of other static options.
- Failure will occur if mismatch between options occurs and the Pod will not be scheduled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire section is hard to read. Can you write it such that there's a clear:

- a risk:
   a mitigation
- b risk: 
  b mitigation
etc..

#### Alpha

- Feature implemented behind a feature gate flag option
- Initial e2e tests completed and enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add it here explicitly, so we don't forget.


###### What specific metrics should inform a rollback?

N/A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest adding increased pod startup latency as an information for a rollback, at minimum.

keps/sig-node/4800-cpumanager-split-uncorecache/README.md Outdated Show resolved Hide resolved
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

- Metrics
- Pod deployment time (seconds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point out specific metric here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't specifically have a known metric to measure this, that's part of our proposal to add a new metric would we mention that here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this and for #4810 (comment) we can use topology_manager_admission_duration_ms - topology manager since GA will always be engaged even if no NUMA alignment is requested, so this is a good approximation. I'd be open to add more metrics in the future, but this can be a nice start and perhaps enough for alpha.

disable-supported: true

# The following PRR answers are required at beta release
metrics: []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be filled in with the metrics you've mentioned in a few places, not a blocker for alpha, though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they don't actually exist yet, is that okay? We are still trying to come up with metrics to implement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Status: Waiting on Author
Development

Successfully merging this pull request may close these issues.