You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
$ cat test-pod-smi.yaml
apiVersion: v1
kind: Pod
metadata:
name: my-gpu-pod
spec:
containers:
- name: my-gpu-container
image: nvidia/cuda:11.0.3-base-ubuntu20.04
command: ["bash", "-c"]
args:
- |-
export PATH="$PATH:/home/kubernetes/bin/nvidia/bin";
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/kubernetes/bin/nvidia/lib64;
nvidia-smi;
resources:
limits:
nvidia.com/gpu: "1"
$ kubectl logs my-gpu-pod
Thu Apr 4 23:16:57 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA L4 Off | 00000000:00:03.0 Off | 0 |
| N/A 36C P0 16W / 72W | 4MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
However, if switched to v0.15.0-rc.2, hit the below error in the device plugin log, and it is in a crash loop:
E0404 18:39:43.479163 1 main.go:132] error starting plugins: error creating plugin manager: unable to create cdi spec file: failed to get CDI spec: failed to create discoverer for common entities: failed to create discoverer for driver files: failed to create discoverer for driver libraries: failed to get libraries for driver version: failed to locate libcuda.so.535.129.03: pattern libcuda.so.535.129.03 not found
It used the same configuration for the device plugin like below
NVIDIA_DRIVER_ROOT=/ is used to discover the device, and the PATH/LD_LIBRARY_PATH is used to discover lib ( libcuda.so.535.129.03 is actually under /home/kubernetes/bin/nvidia/lib64)
Is there something changed in the new version which caused this error ? Thanks
The text was updated successfully, but these errors were encountered:
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
1. Quick Debug Information
2. Issue or feature description
I am in progress of adding support: NVIDIA/gpu-operator#659
With below change:
NVIDIA/gpu-operator@master...Dragoncell:gpu-operator:master-gke
the device plugin works well with version v0.14.5
a) Pod is running
b) nvidia-smi workload works well
However, if switched to
v0.15.0-rc.2
, hit the below error in the device plugin log, and it is in a crash loop:It used the same configuration for the device plugin like below
NVIDIA_DRIVER_ROOT=/
is used to discover the device, and the PATH/LD_LIBRARY_PATH is used to discover lib ( libcuda.so.535.129.03 is actually under /home/kubernetes/bin/nvidia/lib64)Is there something changed in the new version which caused this error ? Thanks
The text was updated successfully, but these errors were encountered: