Skip to content

Commit

Permalink
Merge pull request #98 from mikemckiernan/nouveau
Browse files Browse the repository at this point in the history
Troubleshoot nouveau
  • Loading branch information
mikemckiernan committed Sep 11, 2024
2 parents 4a9602f + f5d223c commit 640a420
Showing 1 changed file with 57 additions and 1 deletion.
58 changes: 57 additions & 1 deletion gpu-operator/troubleshooting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,62 @@
Troubleshooting the NVIDIA GPU Operator
#######################################

.. contents::
:depth: 2
:local:
:backlinks: none

****************************************************
GPU Operator Validator: Failed to Create Pod Sandbox
****************************************************

.. rubric:: Issue
:class: h4

On some occasions, the driver container is unable to unload the ``nouveau`` Linux kernel module.

.. rubric:: Observation
:class: h4

- Running ``kubectl describe pod -n gpu-operator -l app=nvidia-operator-validator`` includes the following event:

.. code-block:: console
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 8s (x21 over 9m2s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
- Running one of the following commands on the node indicates that the ``nouveau`` Linux kernel module is loaded:

.. code-block:: console
$ lsmod | grep -i nouveau
$ dmesg | grep -i nouveau
$ journalctl -xb | grep -i nouveau
.. rubric:: Root Cause
:class: h4

The ``nouveau`` Linux kernel module is loaded and the driver container is unable to unload the module.
Because the ``nouveau`` module is loaded, the driver container cannot load the ``nvidia`` module.

.. rubric:: Action
:class: h4

On each node, run the following commands to prevent loading the ``nouveau`` Linux kernel module on boot:

.. code-block:: console
$ sudo tee /etc/modules-load.d/ipmi.conf <<< "ipmi_msghandler" \
&& sudo tee /etc/modprobe.d/blacklist-nouveau.conf <<< "blacklist nouveau" \
&& sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf <<< "options nouveau modeset=0"
$ sudo update-initramfs -u
$ sudo init 6
*************************************
GPU Operator Pods Stuck in Crash Loop
*************************************
Expand Down Expand Up @@ -131,7 +187,7 @@ The output from the driver validation container indicates that the infoROM is co
| N/A 42C P0 29W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
Expand Down

0 comments on commit 640a420

Please sign in to comment.