Skip to content

Commit

Permalink
Merge pull request canonical#157 from jocado/cdi
Browse files Browse the repository at this point in the history
NVIDIA support - improvements relating to runtime config and CDI
  • Loading branch information
lucaskanashiro authored Apr 24, 2024
2 parents c75eca3 + 0e066af commit f6e8a7a
Show file tree
Hide file tree
Showing 9 changed files with 190 additions and 118 deletions.
29 changes: 25 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,19 +58,40 @@ Docker should function normally, with the following caveats:

## NVIDIA support on Ubuntu Core 22

If the system is found to have an NVIDIA graphics card available, the nvidia container toolkit will be setup and configured to enable use of the local GPU from docker. This can be used to enable use of CUDA from a docker container, for instance.
If the system is found to have an nvidia graphics card available, the nvidia container toolkit will be setup and configured to enable use of the local GPU from docker. This can be used to enable use of CUDA from a docker container, for instance.

This requires connection of the graphics-core22 content interface provided by the nvidia-core22 snap, which should be automatically connected.
This requires connection of the graphics-core22 content interface provided by the nvidia-core22 snap, which should be automatically connected once installed.

To enable proper use of the GPU within docker, the nvidia runtime must be used. By default, the nvidia runtime will be configured to use ([CDI](https://github.com/cncf-tags/container-device-interface)) mode, and a the appropriate nvidia CDI config will be automatically created for the system. You just need to specify the nvidia runtime when running a container.

Example usage:

```shell
docker run --rm --gpus all {cuda-container-image-name}
docker run --rm --runtime nvidia {cuda-container-image-name}
```

### Custom NVIDIA runtime config

If you want to make some adjustments to the automatically generated runtime config, you can use the `nvidia-support.runtime.config-override` snap config to completely replace it.

```shell
snap set docker nvidia-support.runtime.config-override="$(cat cutom-nvidia-config.toml)"
```

### CDI device naming strategy

By default, the `device-name-strategy` for the CDI config will use `index`. Optionally, you can specify an alternative from the currently supported:
* `index`
* `uuid`
* `type-index`

```shell
snap set docker nvidia-support.cdi.device-name-strategy=uuid
```

### Disable NVIDIA support

Use of NVIDIA hardware or not should be automatic, but you may wish to specifically disable it. You can do so via the following snap config:
Setting up the nvidia support should be automatic the hardware is present, but you may wish to specifically disable it so that setup is not even attempted. You can do so via the following snap config:
```shell
snap set docker nvidia-support.disabled=true
```
Expand Down
83 changes: 0 additions & 83 deletions bin/nvidia-container-toolkit

This file was deleted.

94 changes: 94 additions & 0 deletions nvidia/lib
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# nvidia toolkit related setup funtions #

U_MACHINE="$(uname -m)"
U_OS="$(uname -o)"
U_KERNEL="${U_OS##*/}"
U_USERLAND="${U_OS%%/*}"

ARCH_TRIPLET="${U_MACHINE}-${U_KERNEL,,}-${U_USERLAND,,}"

NVIDIA_SUPPORT_DISABLED="$(snapctl get nvidia-support.disabled)"

device_wait() {

COUNT=0
SLEEP=3
TRIES=10

echo "Waiting for device to become available: ${1}"

while [ ${COUNT} -le ${TRIES} ] ; do
echo "Checking device: ${COUNT}/${TRIES}"
test -c "${1}"
if [ $? -eq 0 ] ; then
echo "Device found"
return 0
fi
sleep $SLEEP
COUNT=$(($COUNT + 1))
done

echo "Device not found"
return 1

}

# Check if hardware is present - just exit if not #
nvidia_hw_ensure() {
lspci -d 10de: | grep -q 'NVIDIA Corporation' || exit 0
echo "NVIDIA hardware detected: $(lspci -d 10de:)"
}

# Create any data dirs if missing #
ensure_nvidia_data_dirs() {
mkdir -p "${SNAP_DATA}/etc/cdi"
mkdir -p "${SNAP_DATA}/etc/nvidia-container-runtime"
}

# Generate the CDI config #
cdi_generate () {
# Allow configured device-name-strategy, or default to index [ default in nvidia-ctk ] #
CDI_DEVICE_NAME_STRATEGY="$(snapctl get nvidia-support.cdi.device-name-strategy)"
CDI_DEVICE_NAME_STRATEGY="${CDI_DEVICE_NAME_STRATEGY:-index}"

PATH="${PATH}:${SNAP}/graphics/bin" "${SNAP}/usr/bin/nvidia-ctk" cdi generate --nvidia-ctk-path "${SNAP}/usr/bin/nvidia-ctk" --library-search-path "${SNAP}/graphics/lib/${ARCH_TRIPLET}" --device-name-strategy "${CDI_DEVICE_NAME_STRATEGY}" --output "${SNAP_DATA}/etc/cdi/nvidia.yaml"
}

# Create the nvidia runtime config, either snap default or custom #
nvidia_runtime_config () {
RUNTIME_CONFIG_OVERRIDE="$(snapctl get nvidia-support.runtime.config-override)"

# Custom #
if [ -n "${RUNTIME_CONFIG_OVERRIDE}" ] ; then
echo "${RUNTIME_CONFIG_OVERRIDE}" > "${SNAP_DATA}/etc/nvidia-container-runtime/config.toml"
# Default - opinionated, but most viable option for now #
else
rm -f "${SNAP_DATA}/etc/nvidia-container-runtime/config.toml"
"${SNAP}/usr/bin/nvidia-ctk" config --in-place --set nvidia-container-runtime.mode=cdi
fi
}

# Generate the dockerd runtime config #
docker_runtime_configure () {
"${SNAP}/usr/bin/nvidia-ctk" runtime configure --runtime=docker --runtime-path "${SNAP}/usr/bin/nvidia-container-runtime" --config "${SNAP_DATA}/config/daemon.json"
}

# Setup failure recovery #
setup_fail () {
echo "WARNING: Conainter Toolkit setup seemed to fail with an error"

# Remove nvidia runtime config, if it exists #
jq -r 'del(.runtimes.nvidia)' "${SNAP_DATA}/config/daemon.json" > "${SNAP_DATA}/config/daemon.json.new"

# If it was removed [ there was a change ], copy in the new config, remove CDI config, and set service restart flag #
if ! cmp "${SNAP_DATA}/config/daemon.json"{,.new} >/dev/null ; then
mv "${SNAP_DATA}/config/daemon.json"{.new,}
rm -f "${SNAP_DATA}/etc/cdi/nvidia.yaml"
rm -f "${SNAP_DATA}/etc/nvidia-container-runtime/config.toml"
fi
}

# Info #
setup_info () {
echo "Conainter Toolkit setup complete"
}
32 changes: 32 additions & 0 deletions nvidia/nvidia-container-toolkit
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
#!/bin/bash

set -eu

. "${SNAP}/usr/share/nvidia-container-toolkit/lib"

# Just exit if NVIDIA support is disabled #
[ "${NVIDIA_SUPPORT_DISABLED}" != "true" ] || exit 0

# Ensure nvidia support setup correctly, and only if hardware is preset and correct #
if snapctl is-connected graphics-core22 ; then

# Connection hooks are run early - copy the config file from $SNAP into $SNAP_DATA if it doesn't exist
if [ ! -f "$SNAP_DATA/config/daemon.json" ]; then
mkdir -p "$SNAP_DATA/config"
cp "$SNAP/config/daemon.json" "$SNAP_DATA/config/daemon.json"
fi

# Ensure hardware present #
nvidia_hw_ensure

# As service order is not guaranteed outside of snap - wait a bit for nvidia assemble to complete #
device_wait /dev/nvidiactl || exit 0
echo "NVIDIA ready"

# Ensure the data dirs exist #
ensure_nvidia_data_dirs

# Setup nvidia support, but do not exit on failure #
cdi_generate && nvidia_runtime_config && docker_runtime_configure && setup_info || setup_fail

fi
11 changes: 7 additions & 4 deletions snap/hooks/configure
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,11 @@ set -eux
HOOK_LOG="${SNAP_COMMON}/hooks/${SNAP_REVISION}/$(basename "$0").log" && mkdir -p "${HOOK_LOG%/*}"
exec &> >(tee -a "${HOOK_LOG}")

. "${SNAP}/usr/share/nvidia-container-toolkit/lib"

# Flag to trigger service restart if any condition requires it #
SVC_RESTART=false


# Check if nvidia support is disabled #
NVIDIA_SUPPORT_DISABLED="$(snapctl get nvidia-support.disabled)"

if [ "${NVIDIA_SUPPORT_DISABLED}" == "true" ] ; then

# Remove nvidia runtime config, if it exists #
Expand All @@ -21,6 +19,7 @@ if [ "${NVIDIA_SUPPORT_DISABLED}" == "true" ] ; then
if ! cmp "${SNAP_DATA}/config/daemon.json"{,.new} >/dev/null ; then
mv "${SNAP_DATA}/config/daemon.json"{.new,}
rm -f "${SNAP_DATA}/etc/cdi/nvidia.yaml"
rm -f "${SNAP_DATA}/etc/nvidia-container-runtime/config.toml"
SVC_RESTART=true
fi

Expand All @@ -31,4 +30,8 @@ fi
if $SVC_RESTART ; then
snapctl stop "${SNAP_NAME}"
snapctl start "${SNAP_NAME}"
# Otherwise, just restart the nvidia-container-toolkit #
else
snapctl stop "${SNAP_NAME}.nvidia-container-toolkit"
snapctl start "${SNAP_NAME}.nvidia-container-toolkit"
fi
20 changes: 13 additions & 7 deletions snap/hooks/connect-plug-graphics-core22
Original file line number Diff line number Diff line change
@@ -1,22 +1,28 @@
#!/bin/bash

set -eu
set -eux

HOOK_LOG="${SNAP_COMMON}/hooks/${SNAP_REVISION}/$(basename "$0").log" && mkdir -p "${HOOK_LOG%/*}"
exec &> >(tee -a "${HOOK_LOG}")

. "${SNAP}/usr/share/nvidia-container-toolkit/lib"

# Just exit if NVIDIA support is disabled #
NVIDIA_SUPPORT_DISABLED="$(snapctl get nvidia-support.disabled)"
[ "${NVIDIA_SUPPORT_DISABLED}" != "true" ] || exit 0

# Ensure nvidia support setup correctly #
if snapctl is-connected graphics-core22 ; then

# Only set this up if the system has an nvidia card available #
grep -Eq "^nvidia" /proc/modules || exit 0
test -c /dev/nvidiactl || exit 0
# Ensure hardware present #
echo "NVIDIA detected"

# Ensure hardware present #
nvidia_hw_ensure

# Restart services to reflect any changes if required #
# Have to called by name - nvidia-container-toolkik is inactive and won't be included with just snap name #
snapctl restart "${SNAP_NAME}.nvidia-container-toolkit"
# If oneshot services are inactive they don't respond to restart, so stop/start #
snapctl stop "${SNAP_NAME}.nvidia-container-toolkit"
snapctl start "${SNAP_NAME}.nvidia-container-toolkit"
snapctl restart "${SNAP_NAME}.dockerd"

fi
18 changes: 11 additions & 7 deletions snap/hooks/install
Original file line number Diff line number Diff line change
@@ -1,11 +1,15 @@
#!/bin/bash -e
#!/bin/bash

# copy the config file from $SNAP into $SNAP_COMMON if it doesn't exist
if [ ! -f "$SNAP_DATA/config/daemon.json" ]; then
mkdir -p "$SNAP_DATA/config"
cp "$SNAP/config/daemon.json" "$SNAP_DATA/config/daemon.json"
fi
set -eu

. "${SNAP}/usr/share/nvidia-container-toolkit/lib"

# ensure the layouts dir for /etc/{stuff} exists
mkdir -p "$SNAP_DATA/etc/docker"
mkdir -p "$SNAP_DATA/etc/nvidia-container-runtime"
mkdir -p "$SNAP_DATA/config"
ensure_nvidia_data_dirs

# copy the config file from $SNAP into $SNAP_DATA if it doesn't exist
if [ ! -f "$SNAP_DATA/config/daemon.json" ]; then
cp "$SNAP/config/daemon.json" "$SNAP_DATA/config/daemon.json"
fi
11 changes: 0 additions & 11 deletions snap/hooks/post-refresh

This file was deleted.

1 change: 1 addition & 0 deletions snap/hooks/post-refresh
Loading

0 comments on commit f6e8a7a

Please sign in to comment.