Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-4.14] resync 20240529 #202

Merged
merged 8 commits into from
Jun 4, 2024
Merged

Conversation

ffromani
Copy link
Member

resync to consume fixes to ephemeral storage

add API fixes because the cherry-picked commits where made against a much modern codebase (and k8s libs)

Bump NRT API package to v0.1.2; there is no API change,
but we have now a better replacement for the internal
`getID` helper, which we can now remove.

Signed-off-by: Francesco Romani <[email protected]>
(cherry picked from commit 8d9a4cd)
"host-level" resources are resources which are not
expected to have NUMA affinity. This means
that these resources not showing up in per-NUMA
resource counters should not prevent per se scheduling
on a given node.

Signed-off-by: Francesco Romani <[email protected]>
(cherry picked from commit f7057da)
We call "NUMA-affine" resources compute resources like
CPU and memory/hugepages which we know they do
expose NUMA affinity.

This is another attempt to factor this logic in a central place.

Signed-off-by: Francesco Romani <[email protected]>
(cherry picked from commit c48f462)
Rewrite the accounting of NUMA-local resources when
using scope=container. The previous code was too lenient
and worked mostly by side effects when dealing with
non-NUMA affine resources.

A non-NUMA affine resource (aka a hostlevel resource)
is a resource which is not guaranteed to always have
a NUMA affinity. CPU and memory (incl. hugepages) always do,
but devices may or may not, both options are legal for
device plugins.

Similarly, ephemeral storage is a prominent example of resource
which should never have a NUMA affinity.
The accounting in this case was wrong because previously the
resource was considered NUMA affine.

Note: it's legal to configure topology updaters (e.g. NFD)
to not advertise CPU and memory in NRT objects.
Thus is best to treat lack of them as warnings, not
as blocking errors.

However if the per-NUMA affine counters go negative
this is definitely an error condition we need to detect
and be very loud about it.

Signed-off-by: Francesco Romani <[email protected]>
(cherry picked from commit e9b8aa4)
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 29, 2024
@openshift-ci openshift-ci bot requested review from swatisehgal and yanirq May 29, 2024 14:12
Copy link

openshift-ci bot commented May 29, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 29, 2024
The ephemeral storage resource is not a deciding factor
for noderesourcetopology filtering, but it was incorrectly
accounted causing bad scheduling decisions.
First, we add some integration test coverage to catch
these issues.

Signed-off-by: Francesco Romani <[email protected]>
(cherry picked from commit e3388b9)
Signed-off-by: Francesco Romani <[email protected]>
add compatibility fixes to deal with older codebase,
non-backported patches and older k8s libs.

Signed-off-by: Francesco Romani <[email protected]>
record targeted cherry picks on top of latest rebase

Signed-off-by: Francesco Romani <[email protected]>
@ffromani ffromani changed the title WIP: [release-4.14] resync 20240529 [release-4.14] resync 20240529 May 30, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 30, 2024
@ffromani
Copy link
Member Author

/hold

we need to verify more recent backports before

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 30, 2024
@ffromani
Copy link
Member Author

ffromani commented Jun 4, 2024

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 4, 2024
@ffromani ffromani merged commit 92c3bc7 into release-4.14 Jun 4, 2024
6 of 7 checks passed
@ffromani ffromani deleted the resync-20240529-4.14 branch June 4, 2024 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant