-
Notifications
You must be signed in to change notification settings - Fork 614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support non-standard driver installs #666
Conversation
15d8edb
to
2a86769
Compare
59cc999
to
6c8f72b
Compare
internal/cdi/cdi.go
Outdated
infolib := info.New(info.WithRoot(c.driverRoot)) | ||
hasNVML, _ := infolib.HasNvml() | ||
if !hasNVML { | ||
klog.Warning("No valid resources detected, creating a null CDI handler") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we mention that NVML was not found in the warning message here?
internal/cdi/cdi.go
Outdated
@@ -66,6 +69,18 @@ func newHandler(opts ...Option) (Interface, error) { | |||
if !c.deviceListStrategies.IsCDIEnabled() { | |||
return &null{}, nil | |||
} | |||
if c.driverRoot == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see us setting a default value for devRoot
if it is not explicitly set by the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's probably better to do, but is redundant since we 1. set it in the caller, and 2. also set it in the CDI api. Let me clean it up regardless.
internal/cdi/cdi.go
Outdated
// The transformation above may change the required spec version. | ||
// TODO: This should be integrated into the transformer. | ||
minVersion, err := cdiapi.MinimumRequiredVersion(spec.Raw()) | ||
if err != nil { | ||
return fmt.Errorf("failed to get minimum required CDI spec version: %v", err) | ||
} | ||
spec.Raw().Version = minVersion | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is this change required for this PR? This doesn't appear to be related to the devRoot changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason this is required is that devices with different host paths to device paths require a v0.5.0 spec. This means that the transform changes the initial spec version. I will look at how we can handle this through transformer options instead.
@@ -46,7 +51,8 @@ func NewPluginManager(config *spec.Config) (manager.Interface, error) { | |||
|
|||
cdiHandler, err := cdi.New( | |||
cdi.WithDeviceListStrategies(deviceListStrategies), | |||
cdi.WithDriverRoot(*config.Flags.Plugin.ContainerDriverRoot), | |||
cdi.WithDriverRoot(string(driverRoot)), | |||
cdi.WithDevRoot(driverRoot.getDevRoot()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question -- why are we not using the NVIDIA_DEVICE_ROOT
option here that was added to the device plugin CLI? From what I gather, driverRoot.getDevRoot()
will always resolve to either driverRoot
or /
. What if a custom dev root is configured that is neither of these paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The devRoot
from the context of CDI in this case is the dev root in the container. This is determined from the driverRoot
(in the container) by checking if {{ .driverRoot }}/dev
exists and using that, otherwise using /dev
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Let's keep this as is for now.
One thing to consider in the context of the GPU Operator -- I currently set DEV_ROOT_CTR_PATH
in the driver-ready
status file, to indicate the container path of the device root. If it makes sense to do so, we could consider leveraging this here in the device-plugin (and other operands).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thought that I had was that we could set this in the management CDI spec. Will have to check what that would mean though.
f5304a9
to
c9721c1
Compare
cdi.WithTargetDriverRoot(*config.Flags.NvidiaDriverRoot), | ||
cdi.WithTargetDevRoot(*config.Flags.NvidiaDevRoot), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the TargetDevRoot
is the dev root on the host. This is set to NvidiaDevRoot
.
012fa78
to
23eddda
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
api/config/v1/strategy.go
Outdated
for k, v := range s { | ||
if strings.HasPrefix(k, "cdi-") && v { | ||
return true | ||
} | ||
} | ||
return false | ||
} | ||
|
||
// IsOnlyCDIEnabled returns whether all strategies being used require CDI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// IsOnlyCDIEnabled returns whether all strategies being used require CDI. | |
// AllCDIEnabled returns whether all strategies being used require CDI. |
@@ -46,7 +51,8 @@ func NewPluginManager(config *spec.Config) (manager.Interface, error) { | |||
|
|||
cdiHandler, err := cdi.New( | |||
cdi.WithDeviceListStrategies(deviceListStrategies), | |||
cdi.WithDriverRoot(*config.Flags.Plugin.ContainerDriverRoot), | |||
cdi.WithDriverRoot(string(driverRoot)), | |||
cdi.WithDevRoot(driverRoot.getDevRoot()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Let's keep this as is for now.
One thing to consider in the context of the GPU Operator -- I currently set DEV_ROOT_CTR_PATH
in the driver-ready
status file, to indicate the container path of the device root. If it makes sense to do so, we could consider leveraging this here in the device-plugin (and other operands).
Signed-off-by: Evan Lezar <[email protected]>
This change adds a nvidaDevRoot config option to allow for driver installations where the device nodes are not rooted at the nvidiaDriverRoot. This option can be controlled using the HOST_DEV_ROOT environment variable or host-dev-root command line argument. If this value is not specified, the value of nvidiaDriverRoot will be used. Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
In the case where ONLY CDI-based device-list-strategies are requested, the standard updates to the container allocate response do not make sense. This change skips these updates in this case. Signed-off-by: Evan Lezar <[email protected]>
Thes changes add support to the NVIDIA GPU Device Plugin for driver
installations that have the following properties:
hostDriverRoot
, say:/host/nvidia/driver
/dev
In addition to the mounts of driver libraries (e.g.
libnvidia-ml.so.VERSION
)and device nodes that are handled by the NVIDIA Container Runtime or CDI to
allow the device plugin to detect and enumerate devices, the
hostDriverRoot
is mounted into the device plugin at
/driver-root
. This allows the detectionof driver files that are not required for the device plugin to fuction, but may
be required by specific workloads.
From the perspective of the CDI spec generation being run in the device plugin
container, all driver files are rooted at
/driver-root
. Furthermore, sinceno device nodes are detected at
/driver-root/dev
we assume that these areat
/dev
in the container (and on the host). The generated CDI specificationsneed to be transformed so that they are valid for container workloads started
on the host. This means that occurrences of
/driver-root
need to betransformed to
hostDriverRoot
or/host/nvidia/driver
as per our example.