Create CDI spec error "libcuda.so.535.129.03 not found" in version "v0.15.0-rc.2" #621

Dragoncell · 2024-04-04T23:10:43Z

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): COS
Kernel Version: linux 6.1
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): GKE

2. Issue or feature description

I am in progress of adding support: NVIDIA/gpu-operator#659

With below change:
NVIDIA/gpu-operator@master...Dragoncell:gpu-operator:master-gke

the device plugin works well with version v0.14.5

helm upgrade -i --create-namespace --namespace gpu-operator noperator deployments/gpu-operator --set driver.enabled=false --set cdi.enabled=true --set cdi.default=true --set operator.runtimeClass=nvidia-cdi --set hostRoot=/ --set driverRoot=/home/kubernetes/bin/nvidia --set devRoot=/ --set operator.repository=[gcr.io/jiamingxu-gke-dev](http://gcr.io/jiamingxu-gke-dev) --set operator.version=v0418 --set toolkit.installDir=/home/kubernetes/bin/nvidia --set toolkit.repository=[gcr.io/jiamingxu-gke-dev](http://gcr.io/jiamingxu-gke-dev)  --set toolkit.version=v4 --set validator.repository=[gcr.io/jiamingxu-gke-dev](http://gcr.io/jiamingxu-gke-dev) --set validator.version=v0412_3 --set devicePlugin.version=v0.14.5

a) Pod is running

$ kubectl get pods -n gpu-operator
NAME                                                       READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-rr2x2                                1/1     Running     0          4h16m
gpu-operator-66575c8958-sslch                              1/1     Running     0          4h16m
noperator-node-feature-discovery-gc-6968c7c64-g7w7r        1/1     Running     0          4h16m
noperator-node-feature-discovery-master-749679f664-dvs48   1/1     Running     0          4h16m
noperator-node-feature-discovery-worker-glhxw              1/1     Running     0          4h16m
nvidia-container-toolkit-daemonset-wvpvx                   1/1     Running     0          4h16m
nvidia-cuda-validator-z84ks                                0/1     Completed   0          4h15m
nvidia-dcgm-exporter-9r87v                                 1/1     Running     0          4h16m
nvidia-device-plugin-daemonset-fp7hm                       1/1     Running     0          4h16m
nvidia-operator-validator-hstkb                            1/1     Running     0          4h16m

b) nvidia-smi workload works well

$ cat test-pod-smi.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: my-gpu-pod
spec:
  containers:
  - name: my-gpu-container
    image: nvidia/cuda:11.0.3-base-ubuntu20.04
    command: ["bash", "-c"]
    args: 
    - |-
      export PATH="$PATH:/home/kubernetes/bin/nvidia/bin";
      export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/kubernetes/bin/nvidia/lib64;
      nvidia-smi;
    resources:
      limits: 
        nvidia.com/gpu: "1"

$ kubectl logs  my-gpu-pod 
Thu Apr  4 23:16:57 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   36C    P0              16W /  72W |      4MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

However, if switched to v0.15.0-rc.2, hit the below error in the device plugin log, and it is in a crash loop:

E0404 18:39:43.479163       1 main.go:132] error starting plugins: error creating plugin manager: unable to create cdi spec file: failed to get CDI spec: failed to create discoverer for common entities: failed to create discoverer for driver files: failed to create discoverer for driver libraries: failed to get libraries for driver version: failed to locate libcuda.so.535.129.03: pattern libcuda.so.535.129.03 not found

It used the same configuration for the device plugin like below

NVIDIA_DRIVER_ROOT=/
CONTAINER_DRIVER_ROOT=/host
NVIDIA_CTK_PATH=/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/home/kubernetes/bin/nvidia/lib64
PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/kubernetes/bin/nvidia/bin

NVIDIA_DRIVER_ROOT=/ is used to discover the device, and the PATH/LD_LIBRARY_PATH is used to discover lib ( libcuda.so.535.129.03 is actually under /home/kubernetes/bin/nvidia/lib64)

Is there something changed in the new version which caused this error ? Thanks

The text was updated successfully, but these errors were encountered:

Dragoncell · 2024-04-04T23:12:40Z

/cc @cdesiniotis @elezar @bobbypage

elezar · 2024-04-23T07:48:53Z

This should be addressed by #666.

github-actions · 2024-07-23T04:26:11Z

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions · 2024-08-22T04:26:43Z

This issue was automatically closed due to inactivity.

elezar self-assigned this Apr 10, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 23, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create CDI spec error "libcuda.so.535.129.03 not found" in version "v0.15.0-rc.2" #621

Create CDI spec error "libcuda.so.535.129.03 not found" in version "v0.15.0-rc.2" #621

Dragoncell commented Apr 4, 2024 •

edited

Loading

Dragoncell commented Apr 4, 2024

elezar commented Apr 23, 2024

github-actions bot commented Jul 23, 2024

github-actions bot commented Aug 22, 2024

Create CDI spec error "libcuda.so.535.129.03 not found" in version "v0.15.0-rc.2" #621

Create CDI spec error "libcuda.so.535.129.03 not found" in version "v0.15.0-rc.2" #621

Comments

Dragoncell commented Apr 4, 2024 • edited Loading

1. Quick Debug Information

2. Issue or feature description

Dragoncell commented Apr 4, 2024

elezar commented Apr 23, 2024

github-actions bot commented Jul 23, 2024

github-actions bot commented Aug 22, 2024

Dragoncell commented Apr 4, 2024 •

edited

Loading