Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-standard driver installs #666

Merged
merged 4 commits into from
Jun 17, 2024
Merged

Conversation

elezar
Copy link
Member

@elezar elezar commented Apr 19, 2024

Thes changes add support to the NVIDIA GPU Device Plugin for driver
installations that have the following properties:

  • driver files are installed at a hostDriverRoot, say: /host/nvidia/driver
  • device nodes are created at /dev

In addition to the mounts of driver libraries (e.g. libnvidia-ml.so.VERSION)
and device nodes that are handled by the NVIDIA Container Runtime or CDI to
allow the device plugin to detect and enumerate devices, the hostDriverRoot
is mounted into the device plugin at /driver-root. This allows the detection
of driver files that are not required for the device plugin to fuction, but may
be required by specific workloads.

From the perspective of the CDI spec generation being run in the device plugin
container, all driver files are rooted at /driver-root. Furthermore, since
no device nodes are detected at /driver-root/dev we assume that these are
at /dev in the container (and on the host). The generated CDI specifications
need to be transformed so that they are valid for container workloads started
on the host. This means that occurrences of /driver-root need to be
transformed to hostDriverRoot or /host/nvidia/driver as per our example.

infolib := info.New(info.WithRoot(c.driverRoot))
hasNVML, _ := infolib.HasNvml()
if !hasNVML {
klog.Warning("No valid resources detected, creating a null CDI handler")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we mention that NVML was not found in the warning message here?

@@ -66,6 +69,18 @@ func newHandler(opts ...Option) (Interface, error) {
if !c.deviceListStrategies.IsCDIEnabled() {
return &null{}, nil
}
if c.driverRoot == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see us setting a default value for devRoot if it is not explicitly set by the caller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's probably better to do, but is redundant since we 1. set it in the caller, and 2. also set it in the CDI api. Let me clean it up regardless.

Comment on lines 178 to 187
// The transformation above may change the required spec version.
// TODO: This should be integrated into the transformer.
minVersion, err := cdiapi.MinimumRequiredVersion(spec.Raw())
if err != nil {
return fmt.Errorf("failed to get minimum required CDI spec version: %v", err)
}
spec.Raw().Version = minVersion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is this change required for this PR? This doesn't appear to be related to the devRoot changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this is required is that devices with different host paths to device paths require a v0.5.0 spec. This means that the transform changes the initial spec version. I will look at how we can handle this through transformer options instead.

api/config/v1/flags.go Outdated Show resolved Hide resolved
internal/plugin/server.go Outdated Show resolved Hide resolved
@@ -46,7 +51,8 @@ func NewPluginManager(config *spec.Config) (manager.Interface, error) {

cdiHandler, err := cdi.New(
cdi.WithDeviceListStrategies(deviceListStrategies),
cdi.WithDriverRoot(*config.Flags.Plugin.ContainerDriverRoot),
cdi.WithDriverRoot(string(driverRoot)),
cdi.WithDevRoot(driverRoot.getDevRoot()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- why are we not using the NVIDIA_DEVICE_ROOT option here that was added to the device plugin CLI? From what I gather, driverRoot.getDevRoot() will always resolve to either driverRoot or /. What if a custom dev root is configured that is neither of these paths?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The devRoot from the context of CDI in this case is the dev root in the container. This is determined from the driverRoot (in the container) by checking if {{ .driverRoot }}/dev exists and using that, otherwise using /dev.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Let's keep this as is for now.

One thing to consider in the context of the GPU Operator -- I currently set DEV_ROOT_CTR_PATH in the driver-ready status file, to indicate the container path of the device root. If it makes sense to do so, we could consider leveraging this here in the device-plugin (and other operands).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought that I had was that we could set this in the management CDI spec. Will have to check what that would mean though.

@elezar elezar force-pushed the gke-emulation branch 2 times, most recently from f5304a9 to c9721c1 Compare June 3, 2024 14:00
@elezar elezar marked this pull request as ready for review June 4, 2024 12:38
cdi.WithTargetDriverRoot(*config.Flags.NvidiaDriverRoot),
cdi.WithTargetDevRoot(*config.Flags.NvidiaDevRoot),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the TargetDevRoot is the dev root on the host. This is set to NvidiaDevRoot.

@elezar elezar force-pushed the gke-emulation branch 3 times, most recently from 012fa78 to 23eddda Compare June 5, 2024 14:03
Copy link
Contributor

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

for k, v := range s {
if strings.HasPrefix(k, "cdi-") && v {
return true
}
}
return false
}

// IsOnlyCDIEnabled returns whether all strategies being used require CDI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// IsOnlyCDIEnabled returns whether all strategies being used require CDI.
// AllCDIEnabled returns whether all strategies being used require CDI.

@@ -46,7 +51,8 @@ func NewPluginManager(config *spec.Config) (manager.Interface, error) {

cdiHandler, err := cdi.New(
cdi.WithDeviceListStrategies(deviceListStrategies),
cdi.WithDriverRoot(*config.Flags.Plugin.ContainerDriverRoot),
cdi.WithDriverRoot(string(driverRoot)),
cdi.WithDevRoot(driverRoot.getDevRoot()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Let's keep this as is for now.

One thing to consider in the context of the GPU Operator -- I currently set DEV_ROOT_CTR_PATH in the driver-ready status file, to indicate the container path of the device root. If it makes sense to do so, we could consider leveraging this here in the device-plugin (and other operands).

This change adds a nvidaDevRoot config option to allow for driver
installations where the device nodes are not rooted at the nvidiaDriverRoot.

This option can be controlled using the HOST_DEV_ROOT environment variable
or host-dev-root command line argument.

If this value is not specified, the value of nvidiaDriverRoot will be used.

Signed-off-by: Evan Lezar <[email protected]>
In the case where ONLY CDI-based device-list-strategies are
requested, the standard updates to the container allocate response
do not make sense.

This change skips these updates in this case.

Signed-off-by: Evan Lezar <[email protected]>
@elezar elezar enabled auto-merge June 17, 2024 11:20
@elezar elezar merged commit 092b88d into NVIDIA:main Jun 17, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants