Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for COS #77

Merged
merged 3 commits into from
Jul 31, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
173 changes: 145 additions & 28 deletions gpu-operator/google-gke.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,21 +30,40 @@ NVIDIA GPU Operator with Google GKE
About Using the Operator with Google GKE
****************************************

You can use the NVIDIA GPU Operator with Google Kubernetes Engine (GKE),
but you must use an operating system that is supported by the Operator.

By default, Google GKE configures nodes with the Container-Optimized OS with Containerd from Google.
This operating system is not supported by the Operator.

To use a supported operating system, such as Ubuntu 22.04 or 20.04, configure your
GKE cluster entirely with Ubuntu containerd nodes images or with a node pool
that uses Ubuntu containerd node images.

By selecting a supported operating system rather than Container-Optimized OS with Containerd,
you can customize which NVIDIA software components are installed by the GPU Operator at deployment time.
For example, the Operator can deploy GPU driver containers and use the Operator
to manage the lifecycle of the NVIDIA software components.

There are two ways to use NVIDIA GPU Operator with Google Kubernetes Engine (GKE).
You can use Google driver installer to install and manage NVIDIA GPU Driver on the nodes
or you can use the Operator and driver manager to manage the driver and other NVIDIA software components.

The choice depends on the operating system and whether you prefer to have the Operator manage all the software components.

.. list-table::
:header-rows: 1
:stub-columns: 1
:widths: 1 2 5

* -
- Supported OS
- Summary

* - | Google
| Driver
| Installer
-
- Container-Optimized OS
- Ubuntu with containerd
- The Google driver installer manages the NVIDIA GPU Driver.
NVIDIA GPU Operator manages other software components.

* - | NVIDIA
| Driver
| Manager
-
- Ubuntu with containerd
- NVIDIA GPU Operator manages the lifecycle and upgrades of the driver and other NVIDIA software.

The preceding information relates to using GKE Standard node pools.
For Autopilot Pods, using the GPU Operator is not supported, and you can refer to
`Deploy GPU workloads in Autopilot <https://cloud.google.com/kubernetes-engine/docs/how-to/autopilot-gpus>`__.

*************
Prerequisites
Expand All @@ -67,11 +86,112 @@ Prerequisites
in the Google Cloud documentation.


*********
Procedure
*********
*********************************
Using the Google Driver Installer
cdesiniotis marked this conversation as resolved.
Show resolved Hide resolved
*********************************

Perform the following steps to create a GKE cluster with the ``gcloud`` CLI and use Google driver installer to manage the GPU driver.
You can create a node pool that uses a Container-Optimized OS node image or a Ubuntu node image.

#. Create the cluster.
Refer to `Running GPUs in GKE Standard clusters <https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#create>`__
in the GKE documentation.

When you create the cluster, specify the following additional ``gcloud`` command-line options:
mikemckiernan marked this conversation as resolved.
Show resolved Hide resolved

- ``--node-labels="gke-no-default-nvidia-gpu-device-plugin=true"``
mikemckiernan marked this conversation as resolved.
Show resolved Hide resolved

The node label disables the GKE GPU device plugin daemon set on GPU nodes.

- ``--accelerator type=...,gpu-driver-version=disabled``

This argument prevents automatically installing the GPU driver on GPU nodes.
mikemckiernan marked this conversation as resolved.
Show resolved Hide resolved

#. Get the authentication credentials for the cluster:

.. code-block:: console

$ USE_GKE_GCLOUD_AUTH_PLUGIN=True \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop the USE_GKE_GCLOUD_AUTH_PLUGIN=True and rely on gcloud default, it shouldn't be needed

Suggested change
$ USE_GKE_GCLOUD_AUTH_PLUGIN=True \

cc @Dragoncell do you have some context on why we suggested this originally?

gcloud container clusters get-credentials demo-cluster --zone us-west1-a
mikemckiernan marked this conversation as resolved.
Show resolved Hide resolved

#. Optional: Verify that you can connect to the cluster:

.. code-block:: console

$ kubectl get nodes -o wide

#. Create the namespace for the NVIDIA GPU Operator:

.. code-block:: console

$ kubectl create ns gpu-operator

#. Create a file, such as ``gpu-operator-quota.yaml``, with contents like the following example:

.. literalinclude:: ./manifests/input/google-gke-gpu-operator-quota.yaml
:language: yaml

#. Apply the resource quota:

Perform the following steps to create a GKE cluster with the ``gcloud`` CLI.
.. code-block:: console

$ kubectl apply -n gpu-operator -f gpu-operator-quota.yaml

#. Optional: View the resource quota:

.. code-block:: console

$ kubectl get -n gpu-operator resourcequota

*Example Output*

.. code-block:: output

NAME AGE REQUEST
gpu-operator-quota 38s pods: 0/100

#. Install the Google driver installer daemon set.

For Container-Optimized OS:

.. code-block:: console

$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

For Ubuntu, the manifest to apply depends on GPU model and node version.
Refer to the **Ubuntu** tab at
`Manually install NVIDIA GPU drivers <https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers>`__
in the GKE documentation.

#. Install the Operator using Helm:

.. code-block:: console

$ helm install --wait --generate-name \
-n gpu-operator \
nvidia/gpu-operator \
--set hostPaths.driverInstallDir=/home/kubernetes/bin/nvidia \
--set toolkit.installDir=/home/kubernetes/bin/nvidia \
--set cdi.enabled=true \
--set cdi.default=true \
--set driver.enabled=false

Set the NVIDIA Container Toolkit and driver installation path to ``/home/kubernetes/bin/nvidia``.
On GKE node images, the ``/home`` directory is writable and is a stateful location for storing the NVIDIA runtime binaries.
mikemckiernan marked this conversation as resolved.
Show resolved Hide resolved

To configure MIG with NVIDIA MIG Manager, specify the following additional Helm command arguments:

.. code-block:: console

--set migManager.env[0].name=WITH_REBOOT \
--set-string migManager.env[0].value=true


***************************
Using NVIDIA Driver Manager
***************************

Perform the following steps to create a GKE cluster with the ``gcloud`` CLI and use the Operator and NVIDIA Driver Manager to manage the GPU driver.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a reader, it is not entirely clear to me that this section, Using NVIDIA Driver Manager, is only applicable for Ubuntu node pools. It may be worth adding a sentence here stating that this approach is only supported on Ubuntu.

The steps create the cluster with a node pool that uses a Ubuntu and containerd node image.

#. Create the cluster by running a command that is similar to the following example:
Expand All @@ -94,12 +214,11 @@ The steps create the cluster with a node pool that uses a Ubuntu and containerd
--logging=SYSTEM,WORKLOAD \
--monitoring=SYSTEM \
--enable-ip-alias \
--no-enable-intra-node-visibility \
--default-max-pods-per-node "110" \
--no-enable-master-authorized-networks \
--tags=nvidia-ingress-all

Creating the cluster requires several minutes.
Creating the cluster requires several minutes.

#. Get the authentication credentials for the cluster:

Expand Down Expand Up @@ -146,12 +265,8 @@ The steps create the cluster with a node pool that uses a Ubuntu and containerd
gpu-operator-quota 38s pods: 0/100


**********
Next Steps
**********

* You are ready to :ref:`install the NVIDIA GPU Operator <install-gpu-operator>`
with Helm.
#. Install the Operator.
Refer to :ref:`install the NVIDIA GPU Operator <install-gpu-operator>`.


*******************
Expand All @@ -160,4 +275,6 @@ Related Information

* If you have an existing GKE cluster, refer to
`Add and manage node pools <https://cloud.google.com/kubernetes-engine/docs/how-to/node-pools>`_
in the Google Kubernetes Engine documentation.
in the GKE documentation.
* When you create new node pools, specify the ``--node-labels="gke-no-default-nvidia-gpu-device-plugin=true"`` CLI argument
mikemckiernan marked this conversation as resolved.
Show resolved Hide resolved
to disable the GKE GPU device plugin daemon set on GPU nodes.
13 changes: 13 additions & 0 deletions gpu-operator/release-notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,19 @@ See the :ref:`GPU Operator Component Matrix` for a list of components included i

----

.. _v24.6.0:

24.6.0
======

New Features
------------

* Added support for using the Operator with Container-Optimized OS on Google Kubernetes Engine (GKE).
The process uses the Google driver installer to manage the NVIDIA GPU Driver.
For Ubuntu on GKE, you can use the Google driver installer or continue to use the NVIDIA Driver Manager as with previous releases.
Refer to :doc:`google-gke` for more information.

.. _v24.3.0:

24.3.0
Expand Down