Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create CDI spec error "libcuda.so.535.129.03 not found" in version "v0.15.0-rc.2" #621

Closed
Dragoncell opened this issue Apr 4, 2024 · 4 comments
Assignees
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@Dragoncell
Copy link

Dragoncell commented Apr 4, 2024

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): COS
  • Kernel Version: linux 6.1
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): GKE

2. Issue or feature description

I am in progress of adding support: NVIDIA/gpu-operator#659

With below change:
NVIDIA/gpu-operator@master...Dragoncell:gpu-operator:master-gke

the device plugin works well with version v0.14.5

helm upgrade -i --create-namespace --namespace gpu-operator noperator deployments/gpu-operator --set driver.enabled=false --set cdi.enabled=true --set cdi.default=true --set operator.runtimeClass=nvidia-cdi --set hostRoot=/ --set driverRoot=/home/kubernetes/bin/nvidia --set devRoot=/ --set operator.repository=[gcr.io/jiamingxu-gke-dev](http://gcr.io/jiamingxu-gke-dev) --set operator.version=v0418 --set toolkit.installDir=/home/kubernetes/bin/nvidia --set toolkit.repository=[gcr.io/jiamingxu-gke-dev](http://gcr.io/jiamingxu-gke-dev)  --set toolkit.version=v4 --set validator.repository=[gcr.io/jiamingxu-gke-dev](http://gcr.io/jiamingxu-gke-dev) --set validator.version=v0412_3 --set devicePlugin.version=v0.14.5

a) Pod is running

$ kubectl get pods -n gpu-operator
NAME                                                       READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-rr2x2                                1/1     Running     0          4h16m
gpu-operator-66575c8958-sslch                              1/1     Running     0          4h16m
noperator-node-feature-discovery-gc-6968c7c64-g7w7r        1/1     Running     0          4h16m
noperator-node-feature-discovery-master-749679f664-dvs48   1/1     Running     0          4h16m
noperator-node-feature-discovery-worker-glhxw              1/1     Running     0          4h16m
nvidia-container-toolkit-daemonset-wvpvx                   1/1     Running     0          4h16m
nvidia-cuda-validator-z84ks                                0/1     Completed   0          4h15m
nvidia-dcgm-exporter-9r87v                                 1/1     Running     0          4h16m
nvidia-device-plugin-daemonset-fp7hm                       1/1     Running     0          4h16m
nvidia-operator-validator-hstkb                            1/1     Running     0          4h16m

b) nvidia-smi workload works well

$ cat test-pod-smi.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: my-gpu-pod
spec:
  containers:
  - name: my-gpu-container
    image: nvidia/cuda:11.0.3-base-ubuntu20.04
    command: ["bash", "-c"]
    args: 
    - |-
      export PATH="$PATH:/home/kubernetes/bin/nvidia/bin";
      export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/kubernetes/bin/nvidia/lib64;
      nvidia-smi;
    resources:
      limits: 
        nvidia.com/gpu: "1"

$ kubectl logs  my-gpu-pod 
Thu Apr  4 23:16:57 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   36C    P0              16W /  72W |      4MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

However, if switched to v0.15.0-rc.2, hit the below error in the device plugin log, and it is in a crash loop:

E0404 18:39:43.479163       1 main.go:132] error starting plugins: error creating plugin manager: unable to create cdi spec file: failed to get CDI spec: failed to create discoverer for common entities: failed to create discoverer for driver files: failed to create discoverer for driver libraries: failed to get libraries for driver version: failed to locate libcuda.so.535.129.03: pattern libcuda.so.535.129.03 not found

It used the same configuration for the device plugin like below

NVIDIA_DRIVER_ROOT=/
CONTAINER_DRIVER_ROOT=/host
NVIDIA_CTK_PATH=/home/kubernetes/bin/nvidia/toolkit/nvidia-ctk
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/home/kubernetes/bin/nvidia/lib64
PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/kubernetes/bin/nvidia/bin

NVIDIA_DRIVER_ROOT=/ is used to discover the device, and the PATH/LD_LIBRARY_PATH is used to discover lib ( libcuda.so.535.129.03 is actually under /home/kubernetes/bin/nvidia/lib64)

Is there something changed in the new version which caused this error ? Thanks

@Dragoncell
Copy link
Author

/cc @cdesiniotis @elezar @bobbypage

@elezar elezar self-assigned this Apr 10, 2024
@elezar
Copy link
Member

elezar commented Apr 23, 2024

This should be addressed by #666.

Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 23, 2024
Copy link

This issue was automatically closed due to inactivity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

2 participants