From 245cf68a4f4df10ec7508aa40d3e7950af8471cd Mon Sep 17 00:00:00 2001 From: Chip Zoller Date: Sun, 25 Aug 2024 07:36:07 -0400 Subject: [PATCH 01/19] fix gfd Signed-off-by: Chip Zoller --- docs/gpu-feature-discovery/README.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/docs/gpu-feature-discovery/README.md b/docs/gpu-feature-discovery/README.md index 57e63acb7..75df7851c 100644 --- a/docs/gpu-feature-discovery/README.md +++ b/docs/gpu-feature-discovery/README.md @@ -41,7 +41,7 @@ to ease the transition. The list of prerequisites for running the NVIDIA GPU Feature Discovery is described below: * nvidia-docker version > 2.0 (see how to [install](https://github.com/NVIDIA/nvidia-docker) -and it's [prerequisites](https://github.com/nvidia/nvidia-docker/wiki/Installation-\(version-2.0\)#prerequisites)) +and its [prerequisites](https://github.com/nvidia/nvidia-docker/wiki/Installation-\(version-2.0\)#prerequisites)) * docker configured with nvidia as the [default runtime](https://github.com/NVIDIA/nvidia-docker/wiki/Advanced-topics#default-runtime). * Kubernetes version >= 1.10 * NVIDIA device plugin for Kubernetes (see how to [setup](https://github.com/NVIDIA/k8s-device-plugin)) @@ -115,14 +115,14 @@ $ kubectl apply -f gpu-feature-discovery-job.yaml ``` **Note:** This method should only be used for testing and not deployed in a -productions setting. +production setting. ### Verifying Everything Works With both NFD and GFD deployed and running, you should now be able to see GPU related labels appearing on any nodes that have GPUs installed on them. -``` +```shell $ kubectl get nodes -o yaml apiVersion: v1 items: @@ -147,13 +147,13 @@ items: nvidia.com/gpu.product: A100-SXM4-40GB ... ... - ``` ## The GFD Command line interface Available options: -``` + +```shell gpu-feature-discovery: Usage: gpu-feature-discovery [--fail-on-init-error=] [--mig-strategy=] [--oneshot | --sleep-interval=] [--no-timestamp] [--output-file= | -o ] @@ -173,7 +173,6 @@ Options: Arguments: : none | single | mixed - ``` You can also use environment variables: @@ -291,7 +290,7 @@ the `gpu-feature-discovery` component in standalone mode. The most basic installation command without any options is then: -``` +```shell $ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ --version 0.15.0 \ --namespace gpu-feature-discovery \ @@ -340,7 +339,7 @@ Download the source code: git clone https://github.com/NVIDIA/k8s-device-plugin ``` -Get dependies: +Get dependencies: ```shell make vendor @@ -348,11 +347,12 @@ make vendor Build it: -``` +```shell make build ``` Run it: -``` + +```shell ./gpu-feature-discovery --output=$(pwd)/gfd ``` From 9e1c534cd305dd5caf13310ba92e67db2877f641 Mon Sep 17 00:00:00 2001 From: Chip Zoller Date: Sun, 25 Aug 2024 08:19:55 -0400 Subject: [PATCH 02/19] table Signed-off-by: Chip Zoller --- README.md | 17 ++++++++++++++++- docs/gpu-feature-discovery/README.md | 8 ++++---- 2 files changed, 20 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 05155b7e7..131946830 100644 --- a/README.md +++ b/README.md @@ -30,6 +30,7 @@ - [Deploying with gpu-feature-discovery for automatic node labels](#deploying-with-gpu-feature-discovery-for-automatic-node-labels) - [Deploying gpu-feature-discovery in standalone mode](#deploying-gpu-feature-discovery-in-standalone-mode) - [Deploying via `helm install` with a direct URL to the `helm` package](#deploying-via-helm-install-with-a-direct-url-to-the-helm-package) +- [Catalog of Labels](#catalog-of-labels) - [Building and Running Locally](#building-and-running-locally) - [With Docker](#with-docker) - [Build](#build) @@ -570,6 +571,20 @@ total memory and compute resources of the GPU. **Note**: As of now, the only supported resource available for MPS are `nvidia.com/gpu` resources and only with full GPUs. +## Catalog of Labels + +The NVIDIA device plugin reads and writes a number of different labels which it uses as either +configuration elements or informational elements. The below table documents and describes each +along with their use. See the related table [here](/docs/gpu-feature-discovery/README.md#generated-labels) for GFD labels. + +| Label Name | Value Type | Meaning | Example | +| ----------------------------------| ---------- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -------------- | +| nvidia.com/device-plugin.config | String | The name of the configuration to apply to the node. | my-mps-config | +| nvidia.com/gpu.sharing-strategy | String | The sharing strategy in use. Will be set to `none` by default. | time-slicing | +| nvidia.com/mig.capable | Boolean | If a device is currently in MIG mode. | false | +| nvidia.com/mps.capable | Boolean | If a device is currently in MPS mode. | false | +| nvidia.com/vgpu.present | Boolean | If vGPU is in use. | false | + ## Deployment via `helm` The preferred method to deploy the device plugin is as a daemonset using `helm`. @@ -881,7 +896,7 @@ That is, the `SHARED` annotation ensures that a `nodeSelector` can be used to attract pods to nodes that have shared GPUs on them. Since having `renameByDefault=true` already encodes the fact that the resource is -shared on the resource name , there is no need to annotate the product +shared on the resource name, there is no need to annotate the product name with `SHARED`. Users can already find the shared resources they need by simply requesting it in their pod spec. diff --git a/docs/gpu-feature-discovery/README.md b/docs/gpu-feature-discovery/README.md index 75df7851c..cd5d6af62 100644 --- a/docs/gpu-feature-discovery/README.md +++ b/docs/gpu-feature-discovery/README.md @@ -190,8 +190,7 @@ Environment variables override the command line options if they conflict. ## Generated Labels -This is the list of the labels generated by NVIDIA GPU Feature Discovery and -their meaning: +Below is the list of the labels generated by NVIDIA GPU Feature Discovery and their meaning. For a similar list of labels generated or used by the device plugin, see [here](/README.md#catalog-of-labels). | Label Name | Value Type | Meaning | Example | | -------------------------------| ---------- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -------------- | @@ -206,8 +205,9 @@ their meaning: | nvidia.com/gpu.count | Integer | Number of GPUs | 2 | | nvidia.com/gpu.family | String | Architecture family of the GPU | kepler | | nvidia.com/gpu.machine | String | Machine type | DGX-1 | -| nvidia.com/gpu.memory | Integer | Memory of the GPU in Mb | 2048 | -| nvidia.com/gpu.product | String | Model of the GPU | GeForce-GT-710 | +| nvidia.com/gpu.memory | Integer | Memory of the GPU in megabytes (MB) | 2048 | +| nvidia.com/gpu.product | String | Model of the GPU. May be modified by the device plugin if a sharing strategy is employed. | GeForce-GT-710 | +| nvidia.com/gpu.replicas | String | Number of GPU replicas available. Will be equal to the number of physical GPUs unless some sharing strategy is employed in which case the GPU count will be multiplied by replicas. | 4 | | nvidia.com/gpu.mode | String | Display or Compute Mode of the GPU. Details of the GPU modes can be found [here](https://docs.nvidia.com/grid/13.0/grid-gpumodeswitch-user-guide/index.html#compute-and-graphics-mode) | compute | Depending on the MIG strategy used, the following set of labels may also be From ba598c5a5c2d08f614ec4b4033275e1e047f8eb1 Mon Sep 17 00:00:00 2001 From: Chip Zoller Date: Sun, 25 Aug 2024 08:47:34 -0400 Subject: [PATCH 03/19] format Signed-off-by: Chip Zoller --- README.md | 11 +++++++---- docs/gpu-feature-discovery/README.md | 8 ++++++-- 2 files changed, 13 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 131946830..0db77dfd0 100644 --- a/README.md +++ b/README.md @@ -577,10 +577,13 @@ The NVIDIA device plugin reads and writes a number of different labels which it configuration elements or informational elements. The below table documents and describes each along with their use. See the related table [here](/docs/gpu-feature-discovery/README.md#generated-labels) for GFD labels. +> [!NOTE] +> Label values in Kubernetes are always of type string. The table's value type describes the type within string formatting. + | Label Name | Value Type | Meaning | Example | | ----------------------------------| ---------- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -------------- | -| nvidia.com/device-plugin.config | String | The name of the configuration to apply to the node. | my-mps-config | -| nvidia.com/gpu.sharing-strategy | String | The sharing strategy in use. Will be set to `none` by default. | time-slicing | +| nvidia.com/device-plugin.config | String | The name of the configuration to apply to the node. Not automatically written by the device plugin; must be manually assigned by the user with the specified config. See [here](#updating-per-node-configuration-with-a-node-label) for details. | my-mps-config | +| nvidia.com/gpu.sharing-strategy | String | The sharing strategy in use. Will be set to `none` if not sharing a GPU. Additional values are `mps` and `time-slicing`. | time-slicing | | nvidia.com/mig.capable | Boolean | If a device is currently in MIG mode. | false | | nvidia.com/mps.capable | Boolean | If a device is currently in MPS mode. | false | | nvidia.com/vgpu.present | Boolean | If vGPU is in use. | false | @@ -877,8 +880,8 @@ your cluster and do not wish for it to be pulled in by this installation, you can disable it with `nfd.enabled=false`. In addition to the standard node labels applied by GFD, the following label -will also be included when deploying the plugin with the time-slicing extensions -described [above](#shared-access-to-gpus-with-cuda-time-slicing). +will also be included when deploying the plugin with the time-slicing or MPS extensions +described [above](#shared-access-to-gpus). ``` nvidia.com/.replicas = diff --git a/docs/gpu-feature-discovery/README.md b/docs/gpu-feature-discovery/README.md index cd5d6af62..88d172448 100644 --- a/docs/gpu-feature-discovery/README.md +++ b/docs/gpu-feature-discovery/README.md @@ -190,7 +190,11 @@ Environment variables override the command line options if they conflict. ## Generated Labels -Below is the list of the labels generated by NVIDIA GPU Feature Discovery and their meaning. For a similar list of labels generated or used by the device plugin, see [here](/README.md#catalog-of-labels). +Below is the list of the labels generated by NVIDIA GPU Feature Discovery and their meaning. +For a similar list of labels generated or used by the device plugin, see [here](/README.md#catalog-of-labels). + +> [!NOTE] +> Label values in Kubernetes are always of type string. The table's value type describes the type within string formatting. | Label Name | Value Type | Meaning | Example | | -------------------------------| ---------- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -------------- | @@ -206,7 +210,7 @@ Below is the list of the labels generated by NVIDIA GPU Feature Discovery and th | nvidia.com/gpu.family | String | Architecture family of the GPU | kepler | | nvidia.com/gpu.machine | String | Machine type | DGX-1 | | nvidia.com/gpu.memory | Integer | Memory of the GPU in megabytes (MB) | 2048 | -| nvidia.com/gpu.product | String | Model of the GPU. May be modified by the device plugin if a sharing strategy is employed. | GeForce-GT-710 | +| nvidia.com/gpu.product | String | Model of the GPU. May be modified by the device plugin if a sharing strategy is employed depending on the config. | GeForce-GT-710 | | nvidia.com/gpu.replicas | String | Number of GPU replicas available. Will be equal to the number of physical GPUs unless some sharing strategy is employed in which case the GPU count will be multiplied by replicas. | 4 | | nvidia.com/gpu.mode | String | Display or Compute Mode of the GPU. Details of the GPU modes can be found [here](https://docs.nvidia.com/grid/13.0/grid-gpumodeswitch-user-guide/index.html#compute-and-graphics-mode) | compute | From a9b041e72ed1b86b770c2d2e0d2cb71223dc7a5e Mon Sep 17 00:00:00 2001 From: Chip Zoller Date: Sun, 25 Aug 2024 08:57:21 -0400 Subject: [PATCH 04/19] update Signed-off-by: Chip Zoller --- docs/gpu-feature-discovery/README.md | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/docs/gpu-feature-discovery/README.md b/docs/gpu-feature-discovery/README.md index 88d172448..39bde79cd 100644 --- a/docs/gpu-feature-discovery/README.md +++ b/docs/gpu-feature-discovery/README.md @@ -198,11 +198,18 @@ For a similar list of labels generated or used by the device plugin, see [here]( | Label Name | Value Type | Meaning | Example | | -------------------------------| ---------- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -------------- | -| nvidia.com/cuda.driver.major | Integer | Major of the version of NVIDIA driver | 418 | -| nvidia.com/cuda.driver.minor | Integer | Minor of the version of NVIDIA driver | 30 | -| nvidia.com/cuda.driver.rev | Integer | Revision of the version of NVIDIA driver | 40 | -| nvidia.com/cuda.runtime.major | Integer | Major of the version of CUDA | 10 | -| nvidia.com/cuda.runtime.minor | Integer | Minor of the version of CUDA | 1 | +| nvidia.com/cuda.driver.major | Integer | (Deprecated) Major of the version of NVIDIA driver | 418 | +| nvidia.com/cuda.driver.minor | Integer | (Deprecated) Minor of the version of NVIDIA driver | 30 | +| nvidia.com/cuda.driver.rev | Integer | (Deprecated) Revision of the version of NVIDIA driver | 40 | +| nvidia.com/cuda.driver-version.major | Integer | Major of the version of NVIDIA driver | 418 | +| nvidia.com/cuda.driver-version.minor | Integer | Minor of the version of NVIDIA driver | 418 | +| nvidia.com/cuda.driver-version.revision | Integer | Revision of the version of NVIDIA driver | 418 | +| nvidia.com/cuda.driver-version.full | Integer | Full of the version of NVIDIA driver | 418 | +| nvidia.com/cuda.runtime.major | Integer | (Deprecated) Major of the version of CUDA | 10 | +| nvidia.com/cuda.runtime.minor | Integer | (Deprecated) Minor of the version of CUDA | 1 | +| nvidia.com/cuda.runtime-version.major | Integer | Major of the version of CUDA | 418 | +| nvidia.com/cuda.runtime-version.minor | Integer | Minor of the version of CUDA | 418 | +| nvidia.com/cuda.runtime-version.full | Integer | Full of the version of CUDA | 418 | | nvidia.com/gfd.timestamp | Integer | Timestamp of the generated labels (optional) | 1555019244 | | nvidia.com/gpu.compute.major | Integer | Major of the compute capabilities | 3 | | nvidia.com/gpu.compute.minor | Integer | Minor of the compute capabilities | 3 | From c6461b320b9ac3b6a33a95f8aa06120e08ef79c5 Mon Sep 17 00:00:00 2001 From: chipzoller Date: Sun, 25 Aug 2024 17:29:32 -0400 Subject: [PATCH 05/19] add more vgpu Signed-off-by: chipzoller --- README.md | 2 ++ docs/gpu-feature-discovery/README.md | 4 ++-- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 0db77dfd0..f435a29ce 100644 --- a/README.md +++ b/README.md @@ -587,6 +587,8 @@ along with their use. See the related table [here](/docs/gpu-feature-discovery/R | nvidia.com/mig.capable | Boolean | If a device is currently in MIG mode. | false | | nvidia.com/mps.capable | Boolean | If a device is currently in MPS mode. | false | | nvidia.com/vgpu.present | Boolean | If vGPU is in use. | false | +| nvidia.com/vgpu.host-driver-version | String | Version of the vGPU host driver in use on the underlying hypervisor. | 10.11.12 | +| nvidia.com/vgpu.host-driver-branch | String | Branch of the vGPU host driver in use on the underlying hypervisor. | main | ## Deployment via `helm` diff --git a/docs/gpu-feature-discovery/README.md b/docs/gpu-feature-discovery/README.md index 39bde79cd..390765abe 100644 --- a/docs/gpu-feature-discovery/README.md +++ b/docs/gpu-feature-discovery/README.md @@ -204,12 +204,12 @@ For a similar list of labels generated or used by the device plugin, see [here]( | nvidia.com/cuda.driver-version.major | Integer | Major of the version of NVIDIA driver | 418 | | nvidia.com/cuda.driver-version.minor | Integer | Minor of the version of NVIDIA driver | 418 | | nvidia.com/cuda.driver-version.revision | Integer | Revision of the version of NVIDIA driver | 418 | -| nvidia.com/cuda.driver-version.full | Integer | Full of the version of NVIDIA driver | 418 | +| nvidia.com/cuda.driver-version.full | Integer | Full version number of NVIDIA driver | 418 | | nvidia.com/cuda.runtime.major | Integer | (Deprecated) Major of the version of CUDA | 10 | | nvidia.com/cuda.runtime.minor | Integer | (Deprecated) Minor of the version of CUDA | 1 | | nvidia.com/cuda.runtime-version.major | Integer | Major of the version of CUDA | 418 | | nvidia.com/cuda.runtime-version.minor | Integer | Minor of the version of CUDA | 418 | -| nvidia.com/cuda.runtime-version.full | Integer | Full of the version of CUDA | 418 | +| nvidia.com/cuda.runtime-version.full | Integer | Full version number of CUDA | 418 | | nvidia.com/gfd.timestamp | Integer | Timestamp of the generated labels (optional) | 1555019244 | | nvidia.com/gpu.compute.major | Integer | Major of the compute capabilities | 3 | | nvidia.com/gpu.compute.minor | Integer | Minor of the compute capabilities | 3 | From 82fb24c989c6545f77e195fef21749dde37569a2 Mon Sep 17 00:00:00 2001 From: chipzoller Date: Sun, 25 Aug 2024 17:42:08 -0400 Subject: [PATCH 06/19] correct GFD version statement Signed-off-by: chipzoller --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index f435a29ce..c09057e68 100644 --- a/README.md +++ b/README.md @@ -51,7 +51,7 @@ The NVIDIA device plugin for Kubernetes is a Daemonset that allows you to automa - Run GPU enabled containers in your Kubernetes cluster. This repository contains NVIDIA's official implementation of the [Kubernetes device plugin](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/). -As of v0.16.1 this repository also holds the implementation for GPU Feature Discovery labels, +As of v0.15.0 this repository also holds the implementation for GPU Feature Discovery labels, for further information on GPU Feature Discovery see [here](docs/gpu-feature-discovery/README.md). Please note that: From df6165b6bf03467d39047dde08b43292c0cbfadc Mon Sep 17 00:00:00 2001 From: chipzoller Date: Sun, 25 Aug 2024 17:43:49 -0400 Subject: [PATCH 07/19] lint/fix headings Signed-off-by: chipzoller --- README.md | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index c09057e68..cd0cfb3e5 100644 --- a/README.md +++ b/README.md @@ -98,6 +98,7 @@ If the `nvidia` runtime should be set as the default runtime (required for `dock must also be included in the commands above. If this is not done, a RuntimeClass needs to be defined. ##### Notes on `CRI-O` configuration + When running `kubernetes` with `CRI-O`, add the config file to set the `nvidia-container-runtime` as the default low-level OCI runtime under `/etc/crio/crio.conf.d/99-nvidia.conf`. This will take priority over the default @@ -224,6 +225,7 @@ All options inside the `plugin` section are specific to the plugin. All options outside of this section are shared. ### Configuration Option Details + **`MIG_STRATEGY`**: the desired strategy for exposing MIG devices on GPUs that support it @@ -638,7 +640,7 @@ if desired. Both methods are discussed in more detail below. The full set of values that can be set are found here: [here](https://github.com/NVIDIA/k8s-device-plugin/blob/v0.16.1/deployments/helm/nvidia-device-plugin/values.yaml). -#### Passing configuration to the plugin via a `ConfigMap`. +#### Passing configuration to the plugin via a `ConfigMap` In general, we provide a mechanism to pass _multiple_ configuration files to to the plugin's `helm` chart, with the ability to choose which configuration @@ -656,7 +658,8 @@ In both cases, the value `config.default` can be set to point to one of the named configs in the `ConfigMap` and provide a default configuration for nodes that have not been customized via a node label (more on this later). -##### Single Config File Example +##### Single Config File Example + As an example, create a valid config file on your local filesystem, such as the following: ```shell @@ -704,7 +707,7 @@ $ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ --set config.name=nvidia-plugin-configs ``` -##### Multiple Config File Example +##### Multiple Config File Example For multiple config files, the procedure is similar. @@ -988,6 +991,7 @@ easily be modified to work with any available tag or branch. ### With Docker #### Build + Option 1, pull the prebuilt image from [Docker Hub](https://hub.docker.com/r/nvidia/k8s-device-plugin): ```shell $ docker pull nvcr.io/nvidia/k8s-device-plugin:v0.16.1 @@ -1012,6 +1016,7 @@ $ docker build \ ``` #### Run + Without compatibility for the `CPUManager` static policy: ```shell $ docker run \ @@ -1036,11 +1041,13 @@ $ docker run \ ### Without Docker #### Build + ```shell $ C_INCLUDE_PATH=/usr/local/cuda/include LIBRARY_PATH=/usr/local/cuda/lib64 go build ``` #### Run + Without compatibility for the `CPUManager` static policy: ```shell $ ./k8s-device-plugin @@ -1056,6 +1063,7 @@ $ ./k8s-device-plugin --pass-device-specs See the [changelog](CHANGELOG.md) ## Issues and Contributing + [Checkout the Contributing document!](CONTRIBUTING.md) * You can report a bug by [filing a new issue](https://github.com/NVIDIA/k8s-device-plugin/issues/new) From 2d6a96f42d119d5fe63d7f5d3a6e635e87637907 Mon Sep 17 00:00:00 2001 From: chipzoller Date: Sun, 25 Aug 2024 17:44:14 -0400 Subject: [PATCH 08/19] 0.16.1 => 0.16.2 Signed-off-by: chipzoller --- README.md | 48 ++++++++++++++++++++++++------------------------ 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index cd0cfb3e5..14e91d7a4 100644 --- a/README.md +++ b/README.md @@ -136,7 +136,7 @@ Once you have configured the options above on all the GPU nodes in your cluster, you can enable GPU support by deploying the following Daemonset: ```shell -$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.1/deployments/static/nvidia-device-plugin.yml +$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.2/deployments/static/nvidia-device-plugin.yml ``` **Note:** This is a simple static daemonset meant to demonstrate the basic @@ -604,11 +604,11 @@ $ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin $ helm repo update ``` -Then verify that the latest release (`v0.16.1`) of the plugin is available: +Then verify that the latest release (`v0.16.2`) of the plugin is available: ``` $ helm search repo nvdp --devel NAME CHART VERSION APP VERSION DESCRIPTION -nvdp/nvidia-device-plugin 0.16.1 0.16.1 A Helm chart for ... +nvdp/nvidia-device-plugin 0.16.2 0.16.2 A Helm chart for ... ``` Once this repo is updated, you can begin installing packages from it to deploy @@ -619,7 +619,7 @@ The most basic installation command without any options is then: helm upgrade -i nvdp nvdp/nvidia-device-plugin \ --namespace nvidia-device-plugin \ --create-namespace \ - --version 0.16.1 + --version 0.16.2 ``` **Note:** You only need the to pass the `--devel` flag to `helm search repo` @@ -628,7 +628,7 @@ version (e.g. `-rc.1`). Full releases will be listed without this. ### Configuring the device plugin's `helm` chart -The `helm` chart for the latest release of the plugin (`v0.16.1`) includes +The `helm` chart for the latest release of the plugin (`v0.16.2`) includes a number of customizable values. Prior to `v0.12.0` the most commonly used values were those that had direct @@ -638,7 +638,7 @@ case of the original values is then to override an option from the `ConfigMap` if desired. Both methods are discussed in more detail below. The full set of values that can be set are found here: -[here](https://github.com/NVIDIA/k8s-device-plugin/blob/v0.16.1/deployments/helm/nvidia-device-plugin/values.yaml). +[here](https://github.com/NVIDIA/k8s-device-plugin/blob/v0.16.2/deployments/helm/nvidia-device-plugin/values.yaml). #### Passing configuration to the plugin via a `ConfigMap` @@ -679,7 +679,7 @@ EOF And deploy the device plugin via helm (pointing it at this config file and giving it a name): ``` $ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.1 \ + --version=0.16.2 \ --namespace nvidia-device-plugin \ --create-namespace \ --set-file config.map.config=/tmp/dp-example-config0.yaml @@ -701,7 +701,7 @@ $ kubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \ ``` ``` $ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.1 \ + --version=0.16.2 \ --namespace nvidia-device-plugin \ --create-namespace \ --set config.name=nvidia-plugin-configs @@ -730,7 +730,7 @@ EOF And redeploy the device plugin via helm (pointing it at both configs with a specified default). ``` $ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.1 \ + --version=0.16.2 \ --namespace nvidia-device-plugin \ --create-namespace \ --set config.default=config0 \ @@ -749,7 +749,7 @@ $ kubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \ ``` ``` $ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.1 \ + --version=0.16.2 \ --namespace nvidia-device-plugin \ --create-namespace \ --set config.default=config0 \ @@ -833,7 +833,7 @@ runtimeClassName: ``` Please take a look in the -[`values.yaml`](https://github.com/NVIDIA/k8s-device-plugin/blob/v0.16.1/deployments/helm/nvidia-device-plugin/values.yaml) +[`values.yaml`](https://github.com/NVIDIA/k8s-device-plugin/blob/v0.16.2/deployments/helm/nvidia-device-plugin/values.yaml) file to see the full set of overridable parameters for the device plugin. Examples of setting these options include: @@ -842,7 +842,7 @@ Enabling compatibility with the `CPUManager` and running with a request for 100ms of CPU time and a limit of 512MB of memory. ```shell $ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.1 \ + --version=0.16.2 \ --namespace nvidia-device-plugin \ --create-namespace \ --set compatWithCPUManager=true \ @@ -853,7 +853,7 @@ $ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ Enabling compatibility with the `CPUManager` and the `mixed` `migStrategy` ```shell $ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.1 \ + --version=0.16.2 \ --namespace nvidia-device-plugin \ --create-namespace \ --set compatWithCPUManager=true \ @@ -872,7 +872,7 @@ Discovery to perform this labeling. To enable it, simply set `gfd.enabled=true` during helm install. ```shell helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.1 \ + --version=0.16.2 \ --namespace nvidia-device-plugin \ --create-namespace \ --set gfd.enabled=true @@ -917,7 +917,7 @@ nvidia.com/gpu.product = A100-SXM4-40GB-MIG-1g.5gb-SHARED #### Deploying gpu-feature-discovery in standalone mode -As of v0.16.1, the device plugin's helm chart has integrated support to deploy +As of v0.16.2, the device plugin's helm chart has integrated support to deploy [`gpu-feature-discovery`](https://gitlab.com/nvidia/kubernetes/gpu-feature-discovery/-/tree/main) When gpu-feature-discovery in deploying standalone, begin by setting up the @@ -928,13 +928,13 @@ $ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin $ helm repo update ``` -Then verify that the latest release (`v0.16.1`) of the plugin is available +Then verify that the latest release (`v0.16.2`) of the plugin is available (Note that this includes the GFD chart): ```shell $ helm search repo nvdp --devel NAME CHART VERSION APP VERSION DESCRIPTION -nvdp/nvidia-device-plugin 0.16.1 0.16.1 A Helm chart for ... +nvdp/nvidia-device-plugin 0.16.2 0.16.2 A Helm chart for ... ``` Once this repo is updated, you can begin installing packages from it to deploy @@ -944,7 +944,7 @@ The most basic installation command without any options is then: ```shell $ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version 0.16.1 \ + --version 0.16.2 \ --namespace gpu-feature-discovery \ --create-namespace \ --set devicePlugin.enabled=false @@ -955,7 +955,7 @@ the default namespace. ```shell $ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.1 \ + --version=0.16.2 \ --set allowDefaultNamespace=true \ --set nfd.enabled=false \ --set migStrategy=mixed \ @@ -978,14 +978,14 @@ Using the default values for the flags: $ helm upgrade -i nvdp \ --namespace nvidia-device-plugin \ --create-namespace \ - https://nvidia.github.io/k8s-device-plugin/stable/nvidia-device-plugin-0.16.1.tgz + https://nvidia.github.io/k8s-device-plugin/stable/nvidia-device-plugin-0.16.2.tgz ``` ## Building and Running Locally The next sections are focused on building the device plugin locally and running it. It is intended purely for development and testing, and not required by most users. -It assumes you are pinning to the latest release tag (i.e. `v0.16.1`), but can +It assumes you are pinning to the latest release tag (i.e. `v0.16.2`), but can easily be modified to work with any available tag or branch. ### With Docker @@ -994,8 +994,8 @@ easily be modified to work with any available tag or branch. Option 1, pull the prebuilt image from [Docker Hub](https://hub.docker.com/r/nvidia/k8s-device-plugin): ```shell -$ docker pull nvcr.io/nvidia/k8s-device-plugin:v0.16.1 -$ docker tag nvcr.io/nvidia/k8s-device-plugin:v0.16.1 nvcr.io/nvidia/k8s-device-plugin:devel +$ docker pull nvcr.io/nvidia/k8s-device-plugin:v0.16.2 +$ docker tag nvcr.io/nvidia/k8s-device-plugin:v0.16.2 nvcr.io/nvidia/k8s-device-plugin:devel ``` Option 2, build without cloning the repository: @@ -1003,7 +1003,7 @@ Option 2, build without cloning the repository: $ docker build \ -t nvcr.io/nvidia/k8s-device-plugin:devel \ -f deployments/container/Dockerfile.ubuntu \ - https://github.com/NVIDIA/k8s-device-plugin.git#v0.16.1 + https://github.com/NVIDIA/k8s-device-plugin.git#v0.16.2 ``` Option 3, if you want to modify the code: From 732d3962b0e5b4dc1e6072e5478dd285e712a6c0 Mon Sep 17 00:00:00 2001 From: chipzoller Date: Sun, 25 Aug 2024 17:46:58 -0400 Subject: [PATCH 09/19] update ToC Signed-off-by: chipzoller --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 14e91d7a4..4a32c8db2 100644 --- a/README.md +++ b/README.md @@ -20,9 +20,10 @@ - [Shared Access to GPUs](#shared-access-to-gpus) - [With CUDA Time-Slicing](#with-cuda-time-slicing) - [With CUDA MPS](#with-cuda-mps) +- [Catalog of Labels](#catalog-of-labels) - [Deployment via `helm`](#deployment-via-helm) - [Configuring the device plugin's `helm` chart](#configuring-the-device-plugins-helm-chart) - - [Passing configuration to the plugin via a `ConfigMap`.](#passing-configuration-to-the-plugin-via-a-configmap) + - [Passing configuration to the plugin via a `ConfigMap`](#passing-configuration-to-the-plugin-via-a-configmap) - [Single Config File Example](#single-config-file-example) - [Multiple Config File Example](#multiple-config-file-example) - [Updating Per-Node Configuration With a Node Label](#updating-per-node-configuration-with-a-node-label) @@ -30,7 +31,6 @@ - [Deploying with gpu-feature-discovery for automatic node labels](#deploying-with-gpu-feature-discovery-for-automatic-node-labels) - [Deploying gpu-feature-discovery in standalone mode](#deploying-gpu-feature-discovery-in-standalone-mode) - [Deploying via `helm install` with a direct URL to the `helm` package](#deploying-via-helm-install-with-a-direct-url-to-the-helm-package) -- [Catalog of Labels](#catalog-of-labels) - [Building and Running Locally](#building-and-running-locally) - [With Docker](#with-docker) - [Build](#build) From 4812e4a83f4870b093c7b7c55d5c1edcf6ed1b38 Mon Sep 17 00:00:00 2001 From: chipzoller Date: Sun, 25 Aug 2024 19:25:03 -0400 Subject: [PATCH 10/19] linting Signed-off-by: chipzoller --- README.md | 298 +++++++++++++++++++++++++++++++----------------------- 1 file changed, 169 insertions(+), 129 deletions(-) diff --git a/README.md b/README.md index 4a32c8db2..a88c3e73c 100644 --- a/README.md +++ b/README.md @@ -46,6 +46,7 @@ ## About The NVIDIA device plugin for Kubernetes is a Daemonset that allows you to automatically: + - Expose the number of GPUs on each nodes of your cluster - Keep track of the health of your GPUs - Run GPU enabled containers in your Kubernetes cluster. @@ -55,20 +56,22 @@ As of v0.15.0 this repository also holds the implementation for GPU Feature Disc for further information on GPU Feature Discovery see [here](docs/gpu-feature-discovery/README.md). Please note that: + - The NVIDIA device plugin API is beta as of Kubernetes v1.10. - The NVIDIA device plugin is currently lacking - - Comprehensive GPU health checking features - - GPU cleanup features + - Comprehensive GPU health checking features + - GPU cleanup features - Support will only be provided for the official NVIDIA device plugin (and not for forks or other variants of this plugin). ## Prerequisites The list of prerequisites for running the NVIDIA device plugin is described below: -* NVIDIA drivers ~= 384.81 -* nvidia-docker >= 2.0 || nvidia-container-toolkit >= 1.7.0 (>= 1.11.0 to use integrated GPUs on Tegra-based systems) -* nvidia-container-runtime configured as the default low-level runtime -* Kubernetes version >= 1.10 + +- NVIDIA drivers ~= 384.81 +- nvidia-docker >= 2.0 || nvidia-container-toolkit >= 1.7.0 (>= 1.11.0 to use integrated GPUs on Tegra-based systems) +- nvidia-container-runtime configured as the default low-level runtime +- Kubernetes version >= 1.10 ## Quick Start @@ -86,11 +89,11 @@ Please see: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/in For instructions on installing and getting started with the NVIDIA Container Toolkit, refer to the [installation guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#installation-guide). - Also note the configuration instructions for: -* [`containerd`](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-containerd-for-kubernetes) -* [`CRI-O`](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-cri-o) -* [`docker` (Deprecated)](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker) + +- [`containerd`](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-containerd-for-kubernetes) +- [`CRI-O`](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-cri-o) +- [`docker` (Deprecated)](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker) Remembering to restart each runtime after applying the configuration changes. @@ -103,7 +106,8 @@ When running `kubernetes` with `CRI-O`, add the config file to set the `nvidia-container-runtime` as the default low-level OCI runtime under `/etc/crio/crio.conf.d/99-nvidia.conf`. This will take priority over the default `crun` config file at `/etc/crio/crio.conf.d/10-crun.conf`: -``` + +```conf [crio] [crio.runtime] @@ -115,19 +119,25 @@ When running `kubernetes` with `CRI-O`, add the config file to set the runtime_path = "/usr/bin/nvidia-container-runtime" runtime_type = "oci" ``` + As stated in the linked documentation, this file can automatically be generated with the nvidia-ctk command: + +```shell +sudo nvidia-ctk runtime configure --runtime=crio --set-as-default --config=/etc/crio/crio.conf.d/99-nvidia.conf ``` -$ sudo nvidia-ctk runtime configure --runtime=crio --set-as-default --config=/etc/crio/crio.conf.d/99-nvidia.conf -``` + `CRI-O` uses `crun` as default low-level OCI runtime so `crun` needs to be added to the runtimes of the `nvidia-container-runtime` in the config file at `/etc/nvidia-container-runtime/config.toml`: + ``` [nvidia-container-runtime] runtimes = ["crun", "docker-runc", "runc"] ``` + And then restart `CRI-O`: + ``` -$ sudo systemctl restart crio +sudo systemctl restart crio ``` ### Enabling GPU Support in Kubernetes @@ -136,7 +146,7 @@ Once you have configured the options above on all the GPU nodes in your cluster, you can enable GPU support by deploying the following Daemonset: ```shell -$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.2/deployments/static/nvidia-device-plugin.yml +kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.2/deployments/static/nvidia-device-plugin.yml ``` **Note:** This is a simple static daemonset meant to demonstrate the basic @@ -170,7 +180,7 @@ spec: EOF ``` -``` +```shell $ kubectl logs gpu-pod [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device @@ -293,18 +303,18 @@ options outside of this section are shared. The `DEVICE_LIST_STRATEGY` flag allows one to choose which strategy the plugin will use to advertise the list of GPUs allocated to a container. Possible values are: - * `envvar` (default): the `NVIDIA_VISIBLE_DEVICES` environment variable + - `envvar` (default): the `NVIDIA_VISIBLE_DEVICES` environment variable as described [here](https://github.com/NVIDIA/nvidia-container-runtime#nvidia_visible_devices) is used to select the devices that are to be injected by the NVIDIA Container Runtime. - * `volume-mounts`: the list of devices is passed as a set of volume mounts instead of as an environment variable + - `volume-mounts`: the list of devices is passed as a set of volume mounts instead of as an environment variable to instruct the NVIDIA Container Runtime to inject the devices. Details for the rationale behind this strategy can be found [here](https://docs.google.com/document/d/1uXVF-NWZQXgP1MLb87_kMkQvidpnkNWicdpO2l9g-fw/edit#heading=h.b3ti65rojfy5). - * `cdi-annotations`: CDI annotations are used to select the devices that are to be injected. + - `cdi-annotations`: CDI annotations are used to select the devices that are to be injected. Note that this does not require the NVIDIA Container Runtime, but does required a CDI-enabled container engine. - * `cdi-cri`: the `CDIDevices` CRI field is used to select the CDI devices that are to be injected. + - `cdi-cri`: the `CDIDevices` CRI field is used to select the CDI devices that are to be injected. This requires support in Kubernetes to forward these requests in the CRI to a CDI-enabled container engine. **`DEVICE_ID_STRATEGY`**: @@ -399,7 +409,7 @@ sharing: If this configuration were applied to a node with 8 GPUs on it, the plugin would now advertise 80 `nvidia.com/gpu` resources to Kubernetes instead of 8. -``` +```shell $ kubectl describe node ... Capacity: @@ -422,7 +432,7 @@ sharing: ... ``` -``` +```shell $ kubectl describe node ... Capacity: @@ -438,7 +448,7 @@ configurations and a user requested more than one `nvidia.com/gpu` or `nvidia.com/gpu.shared` resource in their pod spec, then the container would fail with the resulting error: -``` +```shell $ kubectl describe pod gpu-pod ... Events: @@ -465,11 +475,13 @@ As of now, the only supported resource available for time-slicing are configuring a node with the mixed MIG strategy. For example, the full set of time-sliceable resources on a T4 card would be: + ``` nvidia.com/gpu ``` And the full set of time-sliceable resources on an A100 40GB card would be: + ``` nvidia.com/gpu nvidia.com/mig-1g.5gb @@ -479,6 +491,7 @@ nvidia.com/mig-7g.40gb ``` Likewise, on an A100 80GB card, they would be: + ``` nvidia.com/gpu nvidia.com/mig-1g.10gb @@ -535,7 +548,7 @@ sharing: If this configuration were applied to a node with 8 GPUs on it, the plugin would now advertise 80 `nvidia.com/gpu` resources to Kubernetes instead of 8. -``` +```shell $ kubectl describe node ... Capacity: @@ -558,7 +571,7 @@ sharing: ... ``` -``` +```shell $ kubectl describe node ... Capacity: @@ -599,13 +612,15 @@ Instructions for installing `helm` can be found [here](https://helm.sh/docs/intro/install/). Begin by setting up the plugin's `helm` repository and updating it at follows: + ```shell -$ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin -$ helm repo update +helm repo add nvdp https://nvidia.github.io/k8s-device-plugin +helm repo update ``` Then verify that the latest release (`v0.16.2`) of the plugin is available: -``` + +```shell $ helm search repo nvdp --devel NAME CHART VERSION APP VERSION DESCRIPTION nvdp/nvidia-device-plugin 0.16.2 0.16.2 A Helm chart for ... @@ -615,7 +630,8 @@ Once this repo is updated, you can begin installing packages from it to deploy the `nvidia-device-plugin` helm chart. The most basic installation command without any options is then: -``` + +```shell helm upgrade -i nvdp nvdp/nvidia-device-plugin \ --namespace nvidia-device-plugin \ --create-namespace \ @@ -650,6 +666,7 @@ In this way, a single chart can be used to deploy each component, but custom configurations can be applied to different nodes throughout the cluster. There are two ways to provide a `ConfigMap` for use by the plugin: + 1. Via an external reference to a pre-defined `ConfigMap` 1. As a set of named config files to build an integrated `ConfigMap` associated with the chart @@ -677,12 +694,13 @@ EOF ``` And deploy the device plugin via helm (pointing it at this config file and giving it a name): -``` -$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.2 \ - --namespace nvidia-device-plugin \ - --create-namespace \ - --set-file config.map.config=/tmp/dp-example-config0.yaml + +```shell +helm upgrade -i nvdp nvdp/nvidia-device-plugin \ + --version=0.16.2 \ + --namespace nvidia-device-plugin \ + --create-namespace \ + --set-file config.map.config=/tmp/dp-example-config0.yaml ``` Under the hood this will deploy a `ConfigMap` associated with the plugin and put @@ -692,19 +710,22 @@ applied when the plugin comes online. If you don’t want the plugin’s helm chart to create the `ConfigMap` for you, you can also point it at a pre-created `ConfigMap` as follows: + +```shell +kubectl create ns nvidia-device-plugin ``` -$ kubectl create ns nvidia-device-plugin -``` -``` -$ kubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \ - --from-file=config=/tmp/dp-example-config0.yaml -``` + +```shell +kubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \ + --from-file=config=/tmp/dp-example-config0.yaml ``` -$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.2 \ - --namespace nvidia-device-plugin \ - --create-namespace \ - --set config.name=nvidia-plugin-configs + +```shell +helm upgrade -i nvdp nvdp/nvidia-device-plugin \ + --version=0.16.2 \ + --namespace nvidia-device-plugin \ + --create-namespace \ + --set config.name=nvidia-plugin-configs ``` ##### Multiple Config File Example @@ -728,32 +749,36 @@ EOF ``` And redeploy the device plugin via helm (pointing it at both configs with a specified default). -``` -$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.2 \ - --namespace nvidia-device-plugin \ - --create-namespace \ - --set config.default=config0 \ - --set-file config.map.config0=/tmp/dp-example-config0.yaml \ - --set-file config.map.config1=/tmp/dp-example-config1.yaml + +```shell +helm upgrade -i nvdp nvdp/nvidia-device-plugin \ + --version=0.16.2 \ + --namespace nvidia-device-plugin \ + --create-namespace \ + --set config.default=config0 \ + --set-file config.map.config0=/tmp/dp-example-config0.yaml \ + --set-file config.map.config1=/tmp/dp-example-config1.yaml ``` As before, this can also be done with a pre-created `ConfigMap` if desired: + +```shell +kubectl create ns nvidia-device-plugin ``` -$ kubectl create ns nvidia-device-plugin -``` -``` -$ kubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \ - --from-file=config0=/tmp/dp-example-config0.yaml \ - --from-file=config1=/tmp/dp-example-config1.yaml -``` + +```shell +kubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \ + --from-file=config0=/tmp/dp-example-config0.yaml \ + --from-file=config1=/tmp/dp-example-config1.yaml ``` -$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.2 \ - --namespace nvidia-device-plugin \ - --create-namespace \ - --set config.default=config0 \ - --set config.name=nvidia-plugin-configs + +```shell +helm upgrade -i nvdp nvdp/nvidia-device-plugin \ + --version=0.16.2 \ + --namespace nvidia-device-plugin \ + --create-namespace \ + --set config.default=config0 \ + --set config.name=nvidia-plugin-configs ``` **Note:** If the `config.default` flag is not explicitly set, then a default @@ -767,18 +792,20 @@ provided, it will be chosen as the default because there is no other option. With this setup, plugins on all nodes will have `config0` configured for them by default. However, the following label can be set to change which configuration is applied: -``` + +```shell kubectl label nodes –-overwrite \ - nvidia.com/device-plugin.config= + nvidia.com/device-plugin.config= ``` For example, applying a custom config for all nodes that have T4 GPUs installed on them might be: -``` + +```shell kubectl label node \ - --overwrite \ - --selector=nvidia.com/gpu.product=TESLA-T4 \ - nvidia.com/device-plugin.config=t4-config + --overwrite \ + --selector=nvidia.com/gpu.product=TESLA-T4 \ + nvidia.com/device-plugin.config=t4-config ``` **Note:** This label can be applied either _before_ or _after_ the plugin is @@ -840,24 +867,26 @@ Examples of setting these options include: Enabling compatibility with the `CPUManager` and running with a request for 100ms of CPU time and a limit of 512MB of memory. + ```shell -$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.2 \ - --namespace nvidia-device-plugin \ - --create-namespace \ - --set compatWithCPUManager=true \ - --set resources.requests.cpu=100m \ - --set resources.limits.memory=512Mi +helm upgrade -i nvdp nvdp/nvidia-device-plugin \ + --version=0.16.2 \ + --namespace nvidia-device-plugin \ + --create-namespace \ + --set compatWithCPUManager=true \ + --set resources.requests.cpu=100m \ + --set resources.limits.memory=512Mi ``` -Enabling compatibility with the `CPUManager` and the `mixed` `migStrategy` +Enabling compatibility with the `CPUManager` and the `mixed` `migStrategy`. + ```shell -$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.2 \ - --namespace nvidia-device-plugin \ - --create-namespace \ - --set compatWithCPUManager=true \ - --set migStrategy=mixed +helm upgrade -i nvdp nvdp/nvidia-device-plugin \ + --version=0.16.2 \ + --namespace nvidia-device-plugin \ + --create-namespace \ + --set compatWithCPUManager=true \ + --set migStrategy=mixed ``` #### Deploying with gpu-feature-discovery for automatic node labels @@ -870,12 +899,13 @@ set of GPUs available on a node. Under the hood, it leverages Node Feature Discovery to perform this labeling. To enable it, simply set `gfd.enabled=true` during helm install. + ```shell helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.2 \ - --namespace nvidia-device-plugin \ - --create-namespace \ - --set gfd.enabled=true + --version=0.16.2 \ + --namespace nvidia-device-plugin \ + --create-namespace \ + --set gfd.enabled=true ``` Under the hood this will also deploy @@ -894,6 +924,7 @@ nvidia.com/.replicas = Additionally, the `nvidia.com/.product` will be modified as follows if `renameByDefault=false`. + ``` nvidia.com/.product = -SHARED ``` @@ -911,6 +942,7 @@ simply requesting it in their pod spec. Note: When running with `renameByDefault=false` and `migStrategy=single` both the MIG profile name and the new `SHARED` annotation will be appended to the product name, e.g.: + ``` nvidia.com/gpu.product = A100-SXM4-40GB-MIG-1g.5gb-SHARED ``` @@ -924,15 +956,15 @@ When gpu-feature-discovery in deploying standalone, begin by setting up the plugin's `helm` repository and updating it at follows: ```shell -$ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin -$ helm repo update +helm repo add nvdp https://nvidia.github.io/k8s-device-plugin +helm repo update ``` Then verify that the latest release (`v0.16.2`) of the plugin is available (Note that this includes the GFD chart): ```shell -$ helm search repo nvdp --devel +helm search repo nvdp --devel NAME CHART VERSION APP VERSION DESCRIPTION nvdp/nvidia-device-plugin 0.16.2 0.16.2 A Helm chart for ... ``` @@ -943,7 +975,7 @@ the `gpu-feature-discovery` component in standalone mode. The most basic installation command without any options is then: ```shell -$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ +helm upgrade -i nvdp nvdp/nvidia-device-plugin \ --version 0.16.2 \ --namespace gpu-feature-discovery \ --create-namespace \ @@ -954,7 +986,7 @@ Disabling auto-deployment of NFD and running with a MIG strategy of 'mixed' in the default namespace. ```shell -$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ +helm upgrade -i nvdp nvdp/nvidia-device-plugin \ --version=0.16.2 \ --set allowDefaultNamespace=true \ --set nfd.enabled=false \ @@ -974,11 +1006,12 @@ The example below installs the same chart as the method above, except that it uses a direct URL to the `helm` chart instead of via the `helm` repo. Using the default values for the flags: + ```shell -$ helm upgrade -i nvdp \ - --namespace nvidia-device-plugin \ - --create-namespace \ - https://nvidia.github.io/k8s-device-plugin/stable/nvidia-device-plugin-0.16.2.tgz +helm upgrade -i nvdp \ + --namespace nvidia-device-plugin \ + --create-namespace \ + https://nvidia.github.io/k8s-device-plugin/stable/nvidia-device-plugin-0.16.2.tgz ``` ## Building and Running Locally @@ -993,49 +1026,54 @@ easily be modified to work with any available tag or branch. #### Build Option 1, pull the prebuilt image from [Docker Hub](https://hub.docker.com/r/nvidia/k8s-device-plugin): + ```shell -$ docker pull nvcr.io/nvidia/k8s-device-plugin:v0.16.2 -$ docker tag nvcr.io/nvidia/k8s-device-plugin:v0.16.2 nvcr.io/nvidia/k8s-device-plugin:devel +docker pull nvcr.io/nvidia/k8s-device-plugin:v0.16.2 +docker tag nvcr.io/nvidia/k8s-device-plugin:v0.16.2 nvcr.io/nvidia/k8s-device-plugin:devel ``` Option 2, build without cloning the repository: + ```shell -$ docker build \ - -t nvcr.io/nvidia/k8s-device-plugin:devel \ - -f deployments/container/Dockerfile.ubuntu \ - https://github.com/NVIDIA/k8s-device-plugin.git#v0.16.2 +docker build \ + -t nvcr.io/nvidia/k8s-device-plugin:devel \ + -f deployments/container/Dockerfile.ubuntu \ + https://github.com/NVIDIA/k8s-device-plugin.git#v0.16.2 ``` Option 3, if you want to modify the code: + ```shell -$ git clone https://github.com/NVIDIA/k8s-device-plugin.git && cd k8s-device-plugin -$ docker build \ - -t nvcr.io/nvidia/k8s-device-plugin:devel \ - -f deployments/container/Dockerfile.ubuntu \ - . +git clone https://github.com/NVIDIA/k8s-device-plugin.git && cd k8s-device-plugin +docker build \ + -t nvcr.io/nvidia/k8s-device-plugin:devel \ + -f deployments/container/Dockerfile.ubuntu \ + . ``` #### Run Without compatibility for the `CPUManager` static policy: + ```shell -$ docker run \ - -it \ - --security-opt=no-new-privileges \ - --cap-drop=ALL \ - --network=none \ - -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins \ - nvcr.io/nvidia/k8s-device-plugin:devel +docker run \ + -it \ + --security-opt=no-new-privileges \ + --cap-drop=ALL \ + --network=none \ + -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins \ + nvcr.io/nvidia/k8s-device-plugin:devel ``` With compatibility for the `CPUManager` static policy: + ```shell -$ docker run \ - -it \ - --privileged \ - --network=none \ - -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins \ - nvcr.io/nvidia/k8s-device-plugin:devel --pass-device-specs +docker run \ + -it \ + --privileged \ + --network=none \ + -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins \ + nvcr.io/nvidia/k8s-device-plugin:devel --pass-device-specs ``` ### Without Docker @@ -1043,19 +1081,21 @@ $ docker run \ #### Build ```shell -$ C_INCLUDE_PATH=/usr/local/cuda/include LIBRARY_PATH=/usr/local/cuda/lib64 go build +C_INCLUDE_PATH=/usr/local/cuda/include LIBRARY_PATH=/usr/local/cuda/lib64 go build ``` #### Run Without compatibility for the `CPUManager` static policy: + ```shell -$ ./k8s-device-plugin +./k8s-device-plugin ``` With compatibility for the `CPUManager` static policy: + ```shell -$ ./k8s-device-plugin --pass-device-specs +./k8s-device-plugin --pass-device-specs ``` ## Changelog @@ -1066,8 +1106,8 @@ See the [changelog](CHANGELOG.md) [Checkout the Contributing document!](CONTRIBUTING.md) -* You can report a bug by [filing a new issue](https://github.com/NVIDIA/k8s-device-plugin/issues/new) -* You can contribute by opening a [pull request](https://help.github.com/articles/using-pull-requests/) +- You can report a bug by [filing a new issue](https://github.com/NVIDIA/k8s-device-plugin/issues/new) +- You can contribute by opening a [pull request](https://help.github.com/articles/using-pull-requests/) ### Versioning From 8d67d0d7f96a643b15d8790ef59bc741d5c9bb7d Mon Sep 17 00:00:00 2001 From: chipzoller Date: Sun, 25 Aug 2024 19:25:50 -0400 Subject: [PATCH 11/19] no conf support Signed-off-by: chipzoller --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index a88c3e73c..85bfb5798 100644 --- a/README.md +++ b/README.md @@ -107,7 +107,7 @@ When running `kubernetes` with `CRI-O`, add the config file to set the `/etc/crio/crio.conf.d/99-nvidia.conf`. This will take priority over the default `crun` config file at `/etc/crio/crio.conf.d/10-crun.conf`: -```conf +``` [crio] [crio.runtime] From 119bedcfe96f75ed49672b9b3bed128d7e5086f2 Mon Sep 17 00:00:00 2001 From: chipzoller Date: Sun, 25 Aug 2024 19:29:40 -0400 Subject: [PATCH 12/19] fencing Signed-off-by: chipzoller --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 85bfb5798..a0e073624 100644 --- a/README.md +++ b/README.md @@ -107,7 +107,7 @@ When running `kubernetes` with `CRI-O`, add the config file to set the `/etc/crio/crio.conf.d/99-nvidia.conf`. This will take priority over the default `crun` config file at `/etc/crio/crio.conf.d/10-crun.conf`: -``` +```toml [crio] [crio.runtime] @@ -129,14 +129,14 @@ sudo nvidia-ctk runtime configure --runtime=crio --set-as-default --config=/etc/ `CRI-O` uses `crun` as default low-level OCI runtime so `crun` needs to be added to the runtimes of the `nvidia-container-runtime` in the config file at `/etc/nvidia-container-runtime/config.toml`: -``` +```toml [nvidia-container-runtime] runtimes = ["crun", "docker-runc", "runc"] ``` And then restart `CRI-O`: -``` +```shell sudo systemctl restart crio ``` From 38a967344a2ec701ff1ea08fb30193ae9591b4f3 Mon Sep 17 00:00:00 2001 From: chipzoller Date: Sun, 25 Aug 2024 19:30:58 -0400 Subject: [PATCH 13/19] use latest cuda-sample tag Signed-off-by: chipzoller --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index a0e073624..291aef7bc 100644 --- a/README.md +++ b/README.md @@ -169,7 +169,7 @@ spec: restartPolicy: Never containers: - name: cuda-container - image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 + image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0 resources: limits: nvidia.com/gpu: 1 # requesting 1 GPU From 1f1ff78b9340b9fec51137663ae0ab98f3194a8a Mon Sep 17 00:00:00 2001 From: chipzoller Date: Sun, 25 Aug 2024 20:06:55 -0400 Subject: [PATCH 14/19] nice link to NFD Signed-off-by: chipzoller --- README.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/README.md b/README.md index 291aef7bc..d345a1ed9 100644 --- a/README.md +++ b/README.md @@ -895,8 +895,7 @@ As of `v0.12.0`, the device plugin's helm chart has integrated support to deploy [`gpu-feature-discovery`](https://github.com/NVIDIA/gpu-feature-discovery) (GFD). You can use GFD to automatically generate labels for the -set of GPUs available on a node. Under the hood, it leverages Node Feature -Discovery to perform this labeling. +set of GPUs available on a node. Under the hood, it leverages [Node Feature Discovery](https://kubernetes-sigs.github.io/node-feature-discovery/stable/get-started/index.html) to perform this labeling. To enable it, simply set `gfd.enabled=true` during helm install. From ea66faf8136cfac6f031c4863cd8518e23815022 Mon Sep 17 00:00:00 2001 From: chipzoller Date: Sun, 25 Aug 2024 20:23:22 -0400 Subject: [PATCH 15/19] save Signed-off-by: chipzoller --- README.md | 4 +- docs/gpu-feature-discovery/README.md | 74 ++++++++++++++-------------- 2 files changed, 39 insertions(+), 39 deletions(-) diff --git a/README.md b/README.md index d345a1ed9..26602c9dd 100644 --- a/README.md +++ b/README.md @@ -948,8 +948,8 @@ nvidia.com/gpu.product = A100-SXM4-40GB-MIG-1g.5gb-SHARED #### Deploying gpu-feature-discovery in standalone mode -As of v0.16.2, the device plugin's helm chart has integrated support to deploy -[`gpu-feature-discovery`](https://gitlab.com/nvidia/kubernetes/gpu-feature-discovery/-/tree/main) +As of v0.15.0, the device plugin's helm chart has integrated support to deploy +[`gpu-feature-discovery`](/docs/gpu-feature-discovery/README.md#overview) When gpu-feature-discovery in deploying standalone, begin by setting up the plugin's `helm` repository and updating it at follows: diff --git a/docs/gpu-feature-discovery/README.md b/docs/gpu-feature-discovery/README.md index 390765abe..8ea4cedec 100644 --- a/docs/gpu-feature-discovery/README.md +++ b/docs/gpu-feature-discovery/README.md @@ -26,8 +26,7 @@ NVIDIA GPU Feature Discovery for Kubernetes is a software component that allows you to automatically generate labels for the set of GPUs available on a node. -It leverages the [Node Feature -Discovery](https://github.com/kubernetes-sigs/node-feature-discovery) +It leverages the [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery) to perform this labeling. ## Beta Version @@ -40,14 +39,15 @@ to ease the transition. The list of prerequisites for running the NVIDIA GPU Feature Discovery is described below: -* nvidia-docker version > 2.0 (see how to [install](https://github.com/NVIDIA/nvidia-docker) + +- nvidia-docker version > 2.0 (see how to [install](https://github.com/NVIDIA/nvidia-docker) and its [prerequisites](https://github.com/nvidia/nvidia-docker/wiki/Installation-\(version-2.0\)#prerequisites)) -* docker configured with nvidia as the [default runtime](https://github.com/NVIDIA/nvidia-docker/wiki/Advanced-topics#default-runtime). -* Kubernetes version >= 1.10 -* NVIDIA device plugin for Kubernetes (see how to [setup](https://github.com/NVIDIA/k8s-device-plugin)) -* NFD deployed on each node you want to label with the local source configured - * When deploying GPU feature discovery with helm (as described below) we provide a way to automatically deploy NFD for you - * To deploy NFD yourself, please see https://github.com/kubernetes-sigs/node-feature-discovery +- docker configured with nvidia as the [default runtime](https://github.com/NVIDIA/nvidia-docker/wiki/Advanced-topics#default-runtime). +- Kubernetes version >= 1.10 +- NVIDIA device plugin for Kubernetes (see how to [setup](https://github.com/NVIDIA/k8s-device-plugin)) +- NFD deployed on each node you want to label with the local source configured + - When deploying GPU feature discovery with helm (as described below) we provide a way to automatically deploy NFD for you + - To deploy NFD yourself, please see https://github.com/kubernetes-sigs/node-feature-discovery ## Quick Start @@ -62,7 +62,7 @@ is running on every node you want to label. NVIDIA GPU Feature Discovery use the `local` source so be sure to mount volumes. See https://github.com/kubernetes-sigs/node-feature-discovery for more details. -You also need to configure the `Node Feature Discovery` to only expose vendor +You also need to configure the Node Feature Discovery to only expose vendor IDs in the PCI source. To do so, please refer to the Node Feature Discovery documentation. @@ -70,7 +70,7 @@ The following command will deploy NFD with the minimum required set of parameters to run `gpu-feature-discovery`. ```shell -kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/nfd.yaml +kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.2/deployments/static/nfd.yaml ``` **Note:** This is a simple static daemonset meant to demonstrate the basic @@ -94,7 +94,7 @@ or as a Job. #### Daemonset ```shell -kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/gpu-feature-discovery-daemonset.yaml +kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.2/deployments/static/gpu-feature-discovery-daemonset.yaml ``` **Note:** This is a simple static daemonset meant to demonstrate the basic @@ -108,10 +108,10 @@ You must change the `NODE_NAME` value in the template to match the name of the node you want to label: ```shell -$ export NODE_NAME= -$ curl https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/gpu-feature-discovery-job.yaml.template \ +export NODE_NAME= +curl https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.2/deployments/static/gpu-feature-discovery-job.yaml.template \ | sed "s/NODE_NAME/${NODE_NAME}/" > gpu-feature-discovery-job.yaml -$ kubectl apply -f gpu-feature-discovery-job.yaml +kubectl apply -f gpu-feature-discovery-job.yaml ``` **Note:** This method should only be used for testing and not deployed in a @@ -251,6 +251,7 @@ is partitioned into 7 equal sized MIG devices (56 total). With this strategy, a separate set of labels for each MIG device type is generated. The name of each MIG device type is defines as follows: + ``` MIG_TYPE=mig-g..gb e.g. MIG_TYPE=mig-3g.20gb @@ -272,28 +273,27 @@ e.g. MIG_TYPE=mig-3g.20gb ## Deployment via `helm` -The preferred method to deploy `gpu-feature-discovery` is as a daemonset using `helm`. +The preferred method to deploy GFD is as a daemonset using `helm`. Instructions for installing `helm` can be found [here](https://helm.sh/docs/intro/install/). -As of `v0.15.0`, the device plugin's helm chart has integrated support to deploy -[`gpu-feature-discovery`](https://gitlab.com/nvidia/kubernetes/gpu-feature-discovery/-/tree/main) +As of `v0.15.0`, the device plugin's helm chart has integrated support to deploy GFD. -When gpu-feature-discovery in deploying standalone, begin by setting up the +When GFD is deployed standalone, begin by setting up the plugin's `helm` repository and updating it at follows: ```shell -$ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin -$ helm repo update +helm repo add nvdp https://nvidia.github.io/k8s-device-plugin +helm repo update ``` -Then verify that the latest release (`v0.15.0`) of the plugin is available -(Note that this includes the GFD chart): +Then verify that the latest release of the plugin is available +(Note that this includes GFD ): ```shell $ helm search repo nvdp --devel NAME CHART VERSION APP VERSION DESCRIPTION -nvdp/nvidia-device-plugin 0.15.0 0.15.0 A Helm chart for ... +nvdp/nvidia-device-plugin 0.16.2 0.16.2 A Helm chart for ... ``` Once this repo is updated, you can begin installing packages from it to deploy @@ -302,8 +302,8 @@ the `gpu-feature-discovery` component in standalone mode. The most basic installation command without any options is then: ```shell -$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version 0.15.0 \ +helm upgrade -i nvdp nvdp/nvidia-device-plugin \ + --version 0.16.2 \ --namespace gpu-feature-discovery \ --create-namespace \ --set devicePlugin.enabled=false @@ -313,12 +313,12 @@ Disabling auto-deployment of NFD and running with a MIG strategy of 'mixed' in the default namespace. ```shell -$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.15.0 \ - --set allowDefaultNamespace=true \ - --set nfd.enabled=false \ - --set migStrategy=mixed \ - --set devicePlugin.enabled=false +helm upgrade -i nvdp nvdp/nvidia-device-plugin \ + --version=0.16.2 \ + --set allowDefaultNamespace=true \ + --set nfd.enabled=false \ + --set migStrategy=mixed \ + --set devicePlugin.enabled=false ``` **Note:** You only need the to pass the `--devel` flag to `helm search repo` @@ -335,11 +335,11 @@ it uses a direct URL to the `helm` chart instead of via the `helm` repo. Using the default values for the flags: ```shell -$ helm upgrade -i nvdp \ - --namespace gpu-feature-discovery \ - --set devicePlugin.enabled=false \ - --create-namespace \ - https://nvidia.github.io/k8s-device-plugin/stable/nvidia-device-plugin-0.15.0.tgz +helm upgrade -i nvdp \ + --namespace gpu-feature-discovery \ + --set devicePlugin.enabled=false \ + --create-namespace \ + https://nvidia.github.io/k8s-device-plugin/stable/nvidia-device-plugin-0.16.2.tgz ``` ## Building and running locally on your native machine From 945c6714dd4cd9e17be3059bef7b9982cfc90131 Mon Sep 17 00:00:00 2001 From: chipzoller Date: Sun, 25 Aug 2024 20:29:06 -0400 Subject: [PATCH 16/19] update table Signed-off-by: chipzoller --- docs/gpu-feature-discovery/README.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/docs/gpu-feature-discovery/README.md b/docs/gpu-feature-discovery/README.md index 8ea4cedec..028eeee55 100644 --- a/docs/gpu-feature-discovery/README.md +++ b/docs/gpu-feature-discovery/README.md @@ -198,18 +198,18 @@ For a similar list of labels generated or used by the device plugin, see [here]( | Label Name | Value Type | Meaning | Example | | -------------------------------| ---------- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -------------- | -| nvidia.com/cuda.driver.major | Integer | (Deprecated) Major of the version of NVIDIA driver | 418 | -| nvidia.com/cuda.driver.minor | Integer | (Deprecated) Minor of the version of NVIDIA driver | 30 | -| nvidia.com/cuda.driver.rev | Integer | (Deprecated) Revision of the version of NVIDIA driver | 40 | -| nvidia.com/cuda.driver-version.major | Integer | Major of the version of NVIDIA driver | 418 | -| nvidia.com/cuda.driver-version.minor | Integer | Minor of the version of NVIDIA driver | 418 | -| nvidia.com/cuda.driver-version.revision | Integer | Revision of the version of NVIDIA driver | 418 | -| nvidia.com/cuda.driver-version.full | Integer | Full version number of NVIDIA driver | 418 | -| nvidia.com/cuda.runtime.major | Integer | (Deprecated) Major of the version of CUDA | 10 | -| nvidia.com/cuda.runtime.minor | Integer | (Deprecated) Minor of the version of CUDA | 1 | -| nvidia.com/cuda.runtime-version.major | Integer | Major of the version of CUDA | 418 | -| nvidia.com/cuda.runtime-version.minor | Integer | Minor of the version of CUDA | 418 | -| nvidia.com/cuda.runtime-version.full | Integer | Full version number of CUDA | 418 | +| nvidia.com/cuda.driver.major | Integer | (Deprecated) Major of the version of NVIDIA driver | 550 | +| nvidia.com/cuda.driver.minor | Integer | (Deprecated) Minor of the version of NVIDIA driver | 107 | +| nvidia.com/cuda.driver.rev | Integer | (Deprecated) Revision of the version of NVIDIA driver | 02 | +| nvidia.com/cuda.driver-version.major | Integer | Major of the version of NVIDIA driver | 550 | +| nvidia.com/cuda.driver-version.minor | Integer | Minor of the version of NVIDIA driver | 107 | +| nvidia.com/cuda.driver-version.revision | Integer | Revision of the version of NVIDIA driver | 02 | +| nvidia.com/cuda.driver-version.full | Integer | Full version number of NVIDIA driver | 550.107.02 | +| nvidia.com/cuda.runtime.major | Integer | (Deprecated) Major of the version of CUDA | 12 | +| nvidia.com/cuda.runtime.minor | Integer | (Deprecated) Minor of the version of CUDA | 5 | +| nvidia.com/cuda.runtime-version.major | Integer | Major of the version of CUDA | 12 | +| nvidia.com/cuda.runtime-version.minor | Integer | Minor of the version of CUDA | 5 | +| nvidia.com/cuda.runtime-version.full | Integer | Full version number of CUDA | 12.5 | | nvidia.com/gfd.timestamp | Integer | Timestamp of the generated labels (optional) | 1555019244 | | nvidia.com/gpu.compute.major | Integer | Major of the compute capabilities | 3 | | nvidia.com/gpu.compute.minor | Integer | Minor of the compute capabilities | 3 | From 22704235a98aa7f8fadd7718e790617f12bf1857 Mon Sep 17 00:00:00 2001 From: chipzoller Date: Sun, 25 Aug 2024 20:32:31 -0400 Subject: [PATCH 17/19] GFD tweaks Signed-off-by: chipzoller --- docs/gpu-feature-discovery/README.md | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/docs/gpu-feature-discovery/README.md b/docs/gpu-feature-discovery/README.md index 028eeee55..a7773c08e 100644 --- a/docs/gpu-feature-discovery/README.md +++ b/docs/gpu-feature-discovery/README.md @@ -237,7 +237,7 @@ is partitioned into 7 equal sized MIG devices (56 total). | nvidia.com/mig.strategy | String | MIG strategy in use | single | | nvidia.com/gpu.product (overridden) | String | Model of the GPU (with MIG info added) | A100-SXM4-40GB-MIG-1g.5gb | | nvidia.com/gpu.count (overridden) | Integer | Number of MIG devices | 56 | -| nvidia.com/gpu.memory (overridden) | Integer | Memory of each MIG device in Mb | 5120 | +| nvidia.com/gpu.memory (overridden) | Integer | Memory of each MIG device in megabytes (MB) | 5120 | | nvidia.com/gpu.multiprocessors | Integer | Number of Multiprocessors for MIG device | 14 | | nvidia.com/gpu.slices.gi | Integer | Number of GPU Instance slices | 1 | | nvidia.com/gpu.slices.ci | Integer | Number of Compute Instance slices | 1 | @@ -261,7 +261,7 @@ e.g. MIG_TYPE=mig-3g.20gb | ------------------------------------ | ---------- | ---------------------------------------- | -------------- | | nvidia.com/mig.strategy | String | MIG strategy in use | mixed | | nvidia.com/MIG\_TYPE.count | Integer | Number of MIG devices of this type | 2 | -| nvidia.com/MIG\_TYPE.memory | Integer | Memory of MIG device type in Mb | 10240 | +| nvidia.com/MIG\_TYPE.memory | Integer | Memory of MIG device type in megabytes (MB) | 10240 | | nvidia.com/MIG\_TYPE.multiprocessors | Integer | Number of Multiprocessors for MIG device | 14 | | nvidia.com/MIG\_TYPE.slices.ci | Integer | Number of GPU Instance slices | 1 | | nvidia.com/MIG\_TYPE.slices.gi | Integer | Number of Compute Instance slices | 1 | @@ -279,8 +279,7 @@ Instructions for installing `helm` can be found As of `v0.15.0`, the device plugin's helm chart has integrated support to deploy GFD. -When GFD is deployed standalone, begin by setting up the -plugin's `helm` repository and updating it at follows: +To deploy GFD standalone, begin by setting up the plugin's `helm` repository and updating it as follows: ```shell helm repo add nvdp https://nvidia.github.io/k8s-device-plugin @@ -296,8 +295,7 @@ NAME CHART VERSION APP VERSION DESCRIPTION nvdp/nvidia-device-plugin 0.16.2 0.16.2 A Helm chart for ... ``` -Once this repo is updated, you can begin installing packages from it to deploy -the `gpu-feature-discovery` component in standalone mode. +Once this repo is updated, you can begin installing packages from it to deploy GFD in standalone mode. The most basic installation command without any options is then: @@ -310,7 +308,7 @@ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ ``` Disabling auto-deployment of NFD and running with a MIG strategy of 'mixed' in -the default namespace. +the default namespace: ```shell helm upgrade -i nvdp nvdp/nvidia-device-plugin \ @@ -327,10 +325,10 @@ version (e.g. `-rc.1`). Full releases will be listed without this. ### Deploying via `helm install` with a direct URL to the `helm` package -If you prefer not to install from the `nvidia-device-plugin` `helm` repo, you can -run `helm install` directly against the tarball of the plugin's `helm` package. +If you prefer not to install from the `nvidia-device-plugin` helm repo, you can +run `helm install` directly against the tarball of the plugin's helm package. The example below installs the same chart as the method above, except that -it uses a direct URL to the `helm` chart instead of via the `helm` repo. +it uses a direct URL to the helm chart instead of via the helm repo. Using the default values for the flags: From 6ca2ba32210c29e8bf7d782e8c9cc1c90a36d74c Mon Sep 17 00:00:00 2001 From: chipzoller Date: Sun, 25 Aug 2024 20:45:13 -0400 Subject: [PATCH 18/19] table value updates Signed-off-by: chipzoller --- docs/gpu-feature-discovery/README.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/gpu-feature-discovery/README.md b/docs/gpu-feature-discovery/README.md index a7773c08e..f7c524200 100644 --- a/docs/gpu-feature-discovery/README.md +++ b/docs/gpu-feature-discovery/README.md @@ -210,16 +210,16 @@ For a similar list of labels generated or used by the device plugin, see [here]( | nvidia.com/cuda.runtime-version.major | Integer | Major of the version of CUDA | 12 | | nvidia.com/cuda.runtime-version.minor | Integer | Minor of the version of CUDA | 5 | | nvidia.com/cuda.runtime-version.full | Integer | Full version number of CUDA | 12.5 | -| nvidia.com/gfd.timestamp | Integer | Timestamp of the generated labels (optional) | 1555019244 | -| nvidia.com/gpu.compute.major | Integer | Major of the compute capabilities | 3 | -| nvidia.com/gpu.compute.minor | Integer | Minor of the compute capabilities | 3 | +| nvidia.com/gfd.timestamp | Integer | Timestamp of the generated labels (optional) | 1724632719 | +| nvidia.com/gpu.compute.major | Integer | Major of the compute capabilities | 7 | +| nvidia.com/gpu.compute.minor | Integer | Minor of the compute capabilities | 5 | | nvidia.com/gpu.count | Integer | Number of GPUs | 2 | -| nvidia.com/gpu.family | String | Architecture family of the GPU | kepler | -| nvidia.com/gpu.machine | String | Machine type | DGX-1 | -| nvidia.com/gpu.memory | Integer | Memory of the GPU in megabytes (MB) | 2048 | -| nvidia.com/gpu.product | String | Model of the GPU. May be modified by the device plugin if a sharing strategy is employed depending on the config. | GeForce-GT-710 | +| nvidia.com/gpu.family | String | Architecture family of the GPU | turing | +| nvidia.com/gpu.machine | String | Machine type. If in a public cloud provider, value may be set to the instance type. | DGX-1 | +| nvidia.com/gpu.memory | Integer | Memory of the GPU in megabytes (MB) | 15360 | +| nvidia.com/gpu.product | String | Model of the GPU. May be modified by the device plugin if a sharing strategy is employed depending on the config. | Tesla-T4 | | nvidia.com/gpu.replicas | String | Number of GPU replicas available. Will be equal to the number of physical GPUs unless some sharing strategy is employed in which case the GPU count will be multiplied by replicas. | 4 | -| nvidia.com/gpu.mode | String | Display or Compute Mode of the GPU. Details of the GPU modes can be found [here](https://docs.nvidia.com/grid/13.0/grid-gpumodeswitch-user-guide/index.html#compute-and-graphics-mode) | compute | +| nvidia.com/gpu.mode | String | Mode of the GPU. Can be either "compute" or "display". Details of the GPU modes can be found [here](https://docs.nvidia.com/grid/13.0/grid-gpumodeswitch-user-guide/index.html#compute-and-graphics-mode) | compute | Depending on the MIG strategy used, the following set of labels may also be available (or override the default values for some of the labels listed above): From 63a2aef570e981a402817bced2e2aa9cb475267b Mon Sep 17 00:00:00 2001 From: chipzoller Date: Mon, 26 Aug 2024 07:57:30 -0400 Subject: [PATCH 19/19] fix Signed-off-by: chipzoller --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 26602c9dd..31b11d741 100644 --- a/README.md +++ b/README.md @@ -160,7 +160,7 @@ With the daemonset deployed, NVIDIA GPUs can now be requested by a container using the `nvidia.com/gpu` resource type: ```yaml -$ cat <