Skip to content

Commit

Permalink
Merge branch 'NVIDIA:master' into master
Browse files Browse the repository at this point in the history
  • Loading branch information
stephandooper authored May 26, 2023
2 parents c4f7ef1 + 7fb8f2c commit d3271c5
Show file tree
Hide file tree
Showing 37 changed files with 683 additions and 163 deletions.
41 changes: 41 additions & 0 deletions .github/workflows/codeql.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
name: "CodeQL"

on:
push:
branches: [ "master", "release-20.02", "release-20.06", "release-20.08", "release-20.10", "release-20.11", "release-20.12", "release-21.03", "release-21.05", "release-21.06", "release-21.09", "release-21.12", "release-22.01", "release-22.04" ]
pull_request:
branches: [ "master" ]
schedule:
- cron: "38 16 * * 2"

jobs:
analyze:
name: Analyze
runs-on: ubuntu-latest
permissions:
actions: read
contents: read
security-events: write

strategy:
fail-fast: false
matrix:
language: [ python ]

steps:
- name: Checkout
uses: actions/checkout@v3

- name: Initialize CodeQL
uses: github/codeql-action/init@v2
with:
languages: ${{ matrix.language }}
queries: +security-and-quality

- name: Autobuild
uses: github/codeql-action/autobuild@v2

- name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@v2
with:
category: "/language:${{ matrix.language }}"
4 changes: 4 additions & 0 deletions config.example/group_vars/slurm-cluster.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,10 @@ slurm_db_password: AlsoReplaceWithASecurePasswordInTheVault
#slurm_max_job_timelimit: INFINITE
#slurm_default_job_timelimit:

# Auto-detect GPUs using NVML
# See docs/slurm-cluster/nvml.md for details.
slurm_autodetect_nvml: true

# Ensure hosts file generation only runs across slurm cluster
hosts_add_ansible_managed_hosts_groups: ["slurm-cluster"]

Expand Down
2 changes: 1 addition & 1 deletion config.example/helm/monitoring-no-persist.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ prometheus:
- role: endpoints
namespaces:
names:
- gpu-operator-resources
- gpu-operator
- monitoring
relabel_configs:
- source_labels: [__meta_kubernetes_pod_node_name]
Expand Down
2 changes: 1 addition & 1 deletion config.example/helm/monitoring.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ prometheus:
- role: endpoints
namespaces:
names:
- gpu-operator-resources
- gpu-operator
- monitoring
relabel_configs:
- source_labels: [__meta_kubernetes_pod_node_name]
Expand Down
87 changes: 85 additions & 2 deletions docs/airgap/mirror-apt-repos.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,13 @@ sudo apt update
sudo apt install apt-mirror
```

The package in the repository is not maintained anymore and has some issues. There are community supported forks. Get updated version of the `apt-mirror`:
```
git clone https://github.com/Stifler6996/apt-mirror
cd apt-mirror
sudo cp -f apt-mirror /usr/bin/
```

After installing `apt-mirror`, edit the `/etc/apt/mirror.list` file make the following changes:

- Set the `base_path` to the desired download path for your mirror (here, `/var/repos`)
Expand All @@ -169,6 +176,57 @@ deb https://download.docker.com/linux/ubuntu bionic stable
deb https://nvidia.github.io/nvidia-docker/ubuntu20.04/amd64 /
```

The full mirror.list file for Deepops:

```
############# config ##################
#
set base_path /var/repos
#
# set mirror_path $base_path/mirror
# set skel_path $base_path/skel
# set var_path $base_path/var
# set cleanscript $var_path/clean.sh
# set defaultarch <running host architecture>
# set postmirror_script $var_path/postmirror.sh
# set run_postmirror 0
set nthreads 20
set _tilde 0
#
############# end config ##############
deb http://archive.ubuntu.com/ubuntu focal main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu focal-security main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu focal-updates main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu focal-proposed main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu focal-backports main restricted universe multiverse
deb http://ppa.launchpad.net/maas/2.9/ubuntu focal main
deb http://archive.canonical.com/ubuntu focal partner
deb-src http://archive.ubuntu.com/ubuntu focal main restricted universe multiverse
deb-src http://archive.ubuntu.com/ubuntu focal-security main restricted universe multiverse
deb-src http://archive.ubuntu.com/ubuntu focal-updates main restricted universe multiverse
deb-src http://archive.ubuntu.com/ubuntu focal-proposed main restricted universe multiverse
deb-src http://archive.ubuntu.com/ubuntu focal-backports main restricted universe multiverse
deb https://download.docker.com/linux/ubuntu focal stable
deb https://nvidia.github.io/nvidia-docker/ubuntu20.04/amd64 /
deb https://nvidia.github.io/libnvidia-container/stable/ubuntu20.04/amd64 /
deb https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu20.04/amd64 /
deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 /
deb http://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/ focal common dgx
deb http://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/ focal-updates common dgx
clean http://archive.ubuntu.com/ubuntu
clean https://download.docker.com
clean https://nvidia.github.io
clean https://developer.download.nvidia.com
clean http://repo.download.nvidia.com
clean http://ppa.launchpad.net
```

Then create the target directory and run `apt-mirror`:

```bash
Expand Down Expand Up @@ -217,7 +275,32 @@ sudo cp -r /var/repos/mirror/nvidia.github.com/nvidia-docker/ /var/www/html/repo
At this point, the downloaded package repositories should be available on your offline network via the package server.
You can then add these downloaded repos to the `/etc/apt/sources.list` configuration on the servers that need to install the packages, e.g.:

Line added to `/etc/apt/sources.list`:

```
deb http://repo-server/ubuntu focal main restricted universe multiverse
deb http://repo-server/ubuntu focal-updates main restricted universe multiverse
deb http://repo-server/ubuntu focal-backports main restricted universe multiverse
deb http://repo-server/ubuntu focal-security main restricted universe multiverse
```

Lines added to `/etc/apt/sources.list.d/dgx.list`:

```
deb http://repo-server/baseos/ubuntu/focal/x86_64/ focal common dgx
deb http://repo-server/baseos/ubuntu/focal/x86_64/ focal-updates common dgx
```

Lines added to `/etc/apt/sources.list.d/cuda-compute-repo.list`:

```
deb http://repo-server/cuda/repos/ubuntu2004/x86_64/ /
```

Lines add to `/etc/apt/sources.list.d/nvidia-docker.list`:

```
# Line added to /etc/apt/sources.list
deb http://repo-server/repos/nvidia-docker/ubuntu20.04/amd64 /
deb [trusted=yes] http://repo-server/libnvidia-container/stable/ubuntu20.04/amd64 /
deb [trusted=yes] http://repo-server/nvidia-container-runtime/stable/ubuntu20.04/amd64 /
deb [trusted=yes] http://repo-server/nvidia-docker/ubuntu20.04/amd64 /
```
12 changes: 9 additions & 3 deletions docs/airgap/mirror-docker-images.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,10 @@ ansible-playbook playbooks/container/docker.yml
Then, for each image you want to download, you should pull the image from the remote registry and save it to a local file.
In this example, we're saving all our Docker images to `/tmp/images`:

```bash
docker pull nvidia/cuda:11.1-devel-ubuntu20.04
docker save nvidia/cuda:11.1-devel-ubuntu20.04 > /tmp/images/nvidia-cuda-11.1-devel-ubuntu20.04.tar
```
$ docker pull nvidia/cuda:11.1-devel-ubuntu20.04
$ docker save -o /tmp/images/nvidia-cuda-11.1-devel-ubuntu20.04.tar nvidia/cuda:11.1-devel-ubuntu20.04
>>>>>>> e1d0a775 (Airgap documentation update.)
```

Additionally, you should download and save the [`registry` image](https://hub.docker.com/_/registry) so that you can deploy a local registry on the offline network.
Expand Down Expand Up @@ -75,8 +76,13 @@ Additionally, we assume that the `registry` image was included when you transfer

Load the registry image into the Docker image cache of your container registry host:

<<<<<<< HEAD
```bash
docker load < /tmp/images/registry-2.7.tar
=======
```
$ docker load -i /tmp/images/registry-2.7.tar
>>>>>>> e1d0a775 (Airgap documentation update.)
```
Then create a Docker volume to store your container images:
Expand Down
2 changes: 2 additions & 0 deletions docs/slurm-cluster/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,8 @@ Instructions for deploying a GPU cluster with Slurm
> `slurm_enable_ha: true` in `config/group_vars/slurm-cluster.yml`. For more information about HA Slurm deployments,
> see: https://slurm.schedmd.com/quickstart_admin.html#HA

4. If running on a cluster where you intend to configure [Multi-Instance GPU](https://www.nvidia.com/en-us/technologies/multi-instance-gpu/), consult the [Slurm NVML documentation](./nvml.md).

4. Verify the configuration.

```bash
Expand Down
Loading

0 comments on commit d3271c5

Please sign in to comment.