Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/master'
Browse files Browse the repository at this point in the history
  • Loading branch information
stephandooper committed Sep 20, 2023
2 parents 2e18821 + d248b65 commit b6ff68f
Show file tree
Hide file tree
Showing 47 changed files with 185 additions and 240 deletions.
27 changes: 3 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ Infrastructure automation tools for Kubernetes and Slurm clusters with NVIDIA GP
- [DeepOps](#deepops)
- [Table of Contents](#table-of-contents)
- [Overview](#overview)
- [Releases Notes](#releases-notes)
- [Deployment Requirements](#deployment-requirements)
- [Provisioning System](#provisioning-system)
- [Cluster System](#cluster-system)
Expand All @@ -29,27 +28,7 @@ The DeepOps project encapsulates best practices in the deployment of GPU server
- An existing cluster that needs a resource manager / batch scheduler, where DeepOps is used to install Slurm or Kubernetes
- A single machine where no scheduler is desired, only NVIDIA drivers, Docker, and the NVIDIA Container Runtime

## Releases Notes

Latest release: [DeepOps 22.08 Release](https://github.com/NVIDIA/deepops/releases/tag/22.08)

- Kubernetes Default Components:

- [kubernetes](https://github.com/kubernetes/kubernetes) v1.22.8
- [etcd](https://github.com/coreos/etcd) v3.5.0
- [docker](https://www.docker.com/) v20.10
- [containerd](https://containerd.io/) v1.5.8
- [cri-o](http://cri-o.io/) v1.22
- [calico](https://github.com/projectcalico/calico) v3.20.3
- [dashboard](https://github.com/kubernetes/dashboard/tree/master) v2.0.3
- [dashboard metrics scraper](https://github.com/kubernetes-sigs/dashboard-metrics-scraper/tree/master) v1.0.4
- [nvidia gpu operator](https://github.com/NVIDIA/gpu-operator/tree/master) 1.10.0

- Slurm Default Components:

- [slurm](https://github.com/SchedMD/slurm/tree/master) 21.08.8-2
- [Singularity](https://github.com/apptainer/singularity/tree/master) 3.7.3
- [docker](https://www.docker.com/) v20.10
Latest release: [DeepOps 23.08 Release](https://github.com/NVIDIA/deepops/releases/tag/23.08)

It is recommended to use the latest release branch for stable code (linked above). All development takes place on the master branch, which is generally [functional](docs/deepops/testing.md) but may change significantly between releases.

Expand All @@ -60,15 +39,15 @@ It is recommended to use the latest release branch for stable code (linked above
The provisioning system is used to orchestrate the running of all playbooks and one will be needed when instantiating Kubernetes or Slurm clusters. Supported operating systems which are tested and supported include:

- NVIDIA DGX OS 4, 5
- Ubuntu 18.04 LTS, 20.04 LTS
- Ubuntu 18.04 LTS, 20.04, 22.04 LTS
- CentOS 7, 8

### Cluster System

The cluster nodes will follow the requirements described by Slurm or Kubernetes. You may also use a cluster node as a provisioning system but it is not required.

- NVIDIA DGX OS 4, 5
- Ubuntu 18.04 LTS, 20.04 LTS
- Ubuntu 18.04 LTS, 20.04, 22.04 LTS
- CentOS 7, 8

You may also install a supported operating system on all servers via a 3rd-party solution (i.e. [MAAS](https://maas.io/), [Foreman](https://www.theforeman.org/)) or utilize the provided [OS install container](docs/pxe/minimal-pxe-container.md).
Expand Down
2 changes: 1 addition & 1 deletion config.example/files/kubeflow/dex-config-map.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,6 @@ data:
staticClients:
# https://github.com/dexidp/dex/pull/1664
- idEnv: OIDC_CLIENT_ID
redirectURIs: ["/login/oidc"]
redirectURIs: ["/login/oidc", "/authservice/oidc/callback"]
name: 'Dex Login Application'
secretEnv: OIDC_CLIENT_SECRET
16 changes: 8 additions & 8 deletions config.example/group_vars/all.yml
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,9 @@ sftp_chroot: false
################################################################################
# NVIDIA GPU configuration
# Playbook: nvidia-cuda
cuda_version: cuda-toolkit-11-5
# Install latest version by default,
# if you want a specific version, use i.e. cuda-toolkit=12.2.0-1
# cuda_version: cuda-toolkit

# DGX-specific vars may be used to target specific models,
# because available versions for DGX may differ from the generic repo
Expand All @@ -146,9 +148,9 @@ nvidia_driver_force_install: false
# Docker configuration
# Playbook: docker, nvidia-docker, k8s-cluster
#
# For supported Docker versions, see: kubespray/roles/container-engine/docker/vars/*
# For supported Docker versions, see: submodules/kubespray/roles/container-engine/docker/vars/*
docker_install: yes
docker_version: '20.10'
# docker_version: 'latest'
docker_dns_servers_strict: no
docker_storage_options: -s overlay2
#docker_options: "--bip=192.168.99.1/24"
Expand Down Expand Up @@ -196,7 +198,9 @@ enroot_environ_config_files_dgx:
# Singularity configuration
# Playbook: singularity, slurm-cluster
# Singularity target version
singularity_version: "3.7.3"
# set an alternate singularity version here;
# see roles/singularity_wrapper/defaults/main.yml for default
# singularity_version:
singularity_conf_path: "/etc/singularity/singularity.conf"
bind_paths: []
# example:
Expand Down Expand Up @@ -275,10 +279,6 @@ deepops_dir: /opt/deepops
# Roles: K8s GPU operator, GPU plugin, OpenShift/K8s
deepops_venv: '{{ deepops_dir }}/venv'

# OpenMPI
# Playbook: openmpi
openmpi_version: 4.0.3

# Disable cloud-init
deepops_disable_cloud_init: true

Expand Down
17 changes: 12 additions & 5 deletions config.example/group_vars/k8s-cluster.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ kube_kubeadm_apiserver_extra_args:
kubectl_localhost: false
kubeconfig_localhost: true
helm_enabled: true
tiller_node_selectors: "node-role.kubernetes.io/master=''"
tiller_node_selectors: "node-role.kubernetes.io/control-plane=''"

## Container runtime
## docker for docker, crio for cri-o and containerd for containerd.
Expand Down Expand Up @@ -69,10 +69,17 @@ docker_insecure_registries: "{{ groups['kube-master']|map('regex_replace', '^(.*
crio_insecure_registries: "{{ groups['kube-master']|map('regex_replace', '^(.*)$', '\\1:5000')|list + ['registry.local:31500']}}"
docker_registry_mirrors: "{{ groups['kube-master'] | map('regex_replace', '^(.*)$', 'http://\\1:5000') | list }}"

# TODO: Add support in containerd for automatically setting up registry
# mirrors, not just the k8s-local registry
containerd_insecure_registries:
"registry.local:31500": "http://registry.local:31500"
# TODO: The presence of an insecure local containerd registry in K8s v1.24+ seems to be causing an issue, add support for this back when the issue is fixed
# BUG: https://github.com/kubernetes-sigs/kubespray/issues/9956
## TODO: Add support in containerd for automatically setting up registry
## mirrors, not just the k8s-local registry
#containerd_insecure_registries:
# "registry.local:31500": "http://registry.local:31500"

# Workaround an issue where kubespray defaults are causing containerd failures
# https://github.com/kubernetes-sigs/cri-tools/issues/436
# https://github.com/kubernetes-sigs/cri-tools/issues/710
containerd_snapshotter: "native"

# Work-around for https://github.com/kubernetes-sigs/kubespray/issues/8529
nerdctl_extra_flags: " --insecure-registry"
Expand Down
14 changes: 7 additions & 7 deletions config.example/group_vars/slurm-cluster.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
################################################################################
# Slurm job scheduler configuration
# Playbook: slurm, slurm-cluster, slurm-perf, slurm-perf-cluster, slurm-validation
slurm_version: "22.05.2"
# slurm_version: ""
slurm_install_prefix: /usr/local
pmix_install_prefix: /opt/deepops/pmix
hwloc_install_prefix: /opt/deepops/hwloc
Expand Down Expand Up @@ -137,10 +137,10 @@ sm_install_host: "slurm-master[0]"
slurm_install_hpcsdk: true

# Select the version of HPC SDK to download
hpcsdk_major_version: "22"
hpcsdk_minor_version: "1"
hpcsdk_file_cuda: "11.5"
hpcsdk_arch: "x86_64"
#hpcsdk_major_version: ""
#hpcsdk_minor_version: ""
#hpcsdk_file_cuda: ""
#hpcsdk_arch: "x86_64"

# In a Slurm cluster, default to setting up HPC SDK as modules rather than in
# the default user environment
Expand All @@ -156,7 +156,7 @@ hpcsdk_install_in_path: false
# this can help you get started.
################################################################################
slurm_cluster_install_openmpi: false
openmpi_version: 4.0.4
#openmpi_version:
openmpi_install_prefix: "/usr/local"
openmpi_configure: "./configure --prefix={{ openmpi_install_prefix }} --disable-dependency-tracking --disable-getpwuid --with-pmix={{ pmix_install_prefix }} --with-hwloc={{ hwloc_install_prefix }} --with-pmi={{ slurm_install_prefix }} --with-slurm={{ slurm_install_prefix }} --with-libevent=/usr"

Expand Down Expand Up @@ -185,7 +185,7 @@ allow_user_set_gpu_clocks: no
################################################################################
slurm_install_enroot: true
slurm_install_pyxis: true
slurm_pyxis_version: 0.11.1
#slurm_pyxis_version:

# /run is default partition of pyxis runtime_path
resize_run_partition: false
Expand Down
26 changes: 26 additions & 0 deletions config.example/helm/metallb-resources.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# This was autogenerated by MetalLB's custom resource generator.
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
creationTimestamp: null
name: default
namespace: deepops-loadbalancer
# Default address range matches private network for the virtual cluster
# defined in virtual/.
# You should set this address range based on your site's infrastructure.
spec:
addresses:
- 10.0.0.100-10.0.0.110
status: {}
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
creationTimestamp: null
name: l2advertisement1
namespace: deepops-loadbalancer
spec:
ipAddressPools:
- default
status: {}
---
12 changes: 1 addition & 11 deletions config.example/helm/metallb.yml
Original file line number Diff line number Diff line change
@@ -1,14 +1,4 @@
---
# Default address range matches private network for the virtual cluster
# defined in virtual/.
# You should set this address range based on your site's infrastructure.
configInline:
address-pools:
- name: default
protocol: layer2
addresses:
- 10.0.0.100-10.0.0.110

controller:
nodeSelector:
node-role.kubernetes.io/master: ""
node-role.kubernetes.io/control-plane: ""
8 changes: 4 additions & 4 deletions config.example/helm/monitoring-no-persist.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
prometheusOperator:
nodeSelector:
node-role.kubernetes.io/master: ""
node-role.kubernetes.io/control-plane: ""

prometheus:
ingress:
Expand All @@ -27,7 +27,7 @@ prometheus:
action: replace
target_label: kubernetes_node
nodeSelector:
node-role.kubernetes.io/master: ""
node-role.kubernetes.io/control-plane: ""
service:
type: NodePort
nodePort: 30500
Expand All @@ -54,7 +54,7 @@ alertmanager:
nginx.ingress.kubernetes.io/rewrite-target: /
alertmanagerSpec:
nodeSelector:
node-role.kubernetes.io/master: ""
node-role.kubernetes.io/control-plane: ""
service:
type: NodePort
nodePort: 30400
Expand All @@ -69,7 +69,7 @@ grafana:
nginx.ingress.kubernetes.io/ssl-redirect: "false"
nginx.ingress.kubernetes.io/rewrite-target: /
nodeSelector:
node-role.kubernetes.io/master: ""
node-role.kubernetes.io/control-plane: ""
service:
type: NodePort
nodePort: 30200
Expand Down
8 changes: 4 additions & 4 deletions config.example/helm/monitoring.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
prometheusOperator:
nodeSelector:
node-role.kubernetes.io/master: ""
node-role.kubernetes.io/control-plane: ""

prometheus:
ingress:
Expand Down Expand Up @@ -37,7 +37,7 @@ prometheus:
requests:
storage: 10Gi
nodeSelector:
node-role.kubernetes.io/master: ""
node-role.kubernetes.io/control-plane: ""
service:
type: NodePort
nodePort: 30500
Expand All @@ -64,7 +64,7 @@ alertmanager:
nginx.ingress.kubernetes.io/rewrite-target: /
alertmanagerSpec:
nodeSelector:
node-role.kubernetes.io/master: ""
node-role.kubernetes.io/control-plane: ""
service:
type: NodePort
nodePort: 30400
Expand All @@ -79,7 +79,7 @@ grafana:
nginx.ingress.kubernetes.io/ssl-redirect: "false"
nginx.ingress.kubernetes.io/rewrite-target: /
nodeSelector:
node-role.kubernetes.io/master: ""
node-role.kubernetes.io/control-plane: ""
service:
type: NodePort
nodePort: 30200
Expand Down
2 changes: 1 addition & 1 deletion docs/k8s-cluster/helm.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,5 +21,5 @@ If the the value of `helm_enabled` was set to `false` in the `config/kube.yml` f
```bash
kubectl create sa tiller --namespace kube-system
kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller
helm init --service-account tiller --node-selectors node-role.kubernetes.io/master=true
helm init --service-account tiller --node-selectors node-role.kubernetes.io/control-plane=true
```
5 changes: 3 additions & 2 deletions playbooks/container/docker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,12 @@
- docker
vars_files:
# include kubespray-defaults here so that we can set the facts using the
# kubespray 0040-set_facts.yml tasks
# kubespray 0020-set_facts.yml tasks
- ../../submodules/kubespray/roles/kubespray-defaults/defaults/main.yaml
- ../../submodules/kubespray/roles/kubernetes/preinstall/defaults/main.yml
tasks:
- name: include kubespray task to set facts required for docker role
include: ../../submodules/kubespray/roles/kubernetes/preinstall/tasks/0040-set_facts.yml
include: ../../submodules/kubespray/roles/kubernetes/preinstall/tasks/0020-set_facts.yml
when: docker_install | default('yes')
- name: remove docker overrides, specifically to deal with conflicting options from DGX OS
file:
Expand Down
2 changes: 1 addition & 1 deletion playbooks/k8s-cluster.yml
Original file line number Diff line number Diff line change
Expand Up @@ -267,7 +267,7 @@
command: "/usr/local/bin/helm repo add 'stable' 'https://charts.helm.sh/stable' --force-update"
delegate_to: localhost
- name: kubeadm | Remove taint for master with node role
command: "{{ artifacts_dir }}/kubectl --kubeconfig {{ artifacts_dir }}/admin.conf taint node {{ inventory_hostname }} node-role.kubernetes.io/master:NoSchedule-"
command: "{{ artifacts_dir }}/kubectl --kubeconfig {{ artifacts_dir }}/admin.conf taint node {{ inventory_hostname }} node-role.kubernetes.io/control-plane:NoSchedule-"
delegate_to: localhost
failed_when: false # Taint will not be present if kube-master also under kube-node

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ persistence:
{% if container_registry_persistence_enabled %}size: "{{ container_registry_storage_size }}"{% endif %}

nodeSelector:
node-role.kubernetes.io/master: ""
node-role.kubernetes.io/control-plane: ""
service:
type: "{{ container_registry_service_type }}"
{% if container_registry_service_type == "NodePort" %}nodePort: "{{ container_registry_node_port }}"{% endif %}
Expand Down
2 changes: 1 addition & 1 deletion roles/nhc/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
nhc_version: "1.4.2"
nhc_version: "1.4.3"
nhc_src_url: "https://github.com/mej/nhc/releases/download/{{ nhc_version }}/lbnl-nhc-{{ nhc_version }}.tar.xz"
nhc_install_dir: "/usr"
nhc_config_dir: "/etc"
Expand Down
6 changes: 6 additions & 0 deletions roles/nhc/molecule/default/molecule.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,12 @@ platforms:
- /sys/fs/cgroup:/sys/fs/cgroup:ro
privileged: true
pre_build_image: true
- name: nhc-ubuntu-2204
image: geerlingguy/docker-ubuntu2204-ansible
volumes:
- /sys/fs/cgroup:/sys/fs/cgroup:ro
privileged: true
pre_build_image: true
- name: nhc-centos-7
image: geerlingguy/docker-centos7-ansible
volumes:
Expand Down
5 changes: 5 additions & 0 deletions roles/nhc/vars/ubuntu-22.04.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
nhc_build_deps:
- build-essential

nhc_ssh_daemon: "sshd:"
4 changes: 2 additions & 2 deletions roles/nvidia-gpu-operator/defaults/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ gpu_operator_nvaie_helm_repo: "https://helm.ngc.nvidia.com/nvaie"
gpu_operator_nvaie_chart_name: "nvaie/gpu-operator"

# NVAIE GPU Operator may require different version, check NGC enterprise collection.
gpu_operator_chart_version: "v22.9.2"
gpu_operator_chart_version: "v23.3.2"

k8s_gpu_mig_strategy: "mixed"

Expand All @@ -33,7 +33,7 @@ gpu_operator_grid_config_dir: "{{ deepops_dir }}/gpu_operator"
# Defaults from https://github.com/NVIDIA/gpu-operator/blob/master/deployments/gpu-operator/values.yaml
gpu_operator_default_runtime: "containerd"
gpu_operator_driver_registry: "nvcr.io/nvidia"
gpu_operator_driver_version: "525.85.12"
gpu_operator_driver_version: "525.105.17"

# This enables/disables NVAIE
gpu_operator_nvaie_enable: false
Expand Down
2 changes: 1 addition & 1 deletion roles/nvidia-k8s-gpu-device-plugin/defaults/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@
k8s_gpu_plugin_helm_repo: "https://nvidia.github.io/k8s-device-plugin"
k8s_gpu_plugin_chart_name: "nvdp/nvidia-device-plugin"
k8s_gpu_plugin_release_name: "nvidia-device-plugin"
k8s_gpu_plugin_chart_version: "0.13.0"
k8s_gpu_plugin_chart_version: "0.14.0"
k8s_gpu_plugin_init_error: "false"
k8s_gpu_mig_strategy: "mixed"
2 changes: 1 addition & 1 deletion roles/nvidia-k8s-gpu-feature-discovery/defaults/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@
k8s_gpu_feature_discovery_helm_repo: "https://nvidia.github.io/gpu-feature-discovery"
k8s_gpu_feature_discovery_chart_name: "nvgfd/gpu-feature-discovery"
k8s_gpu_feature_discovery_release_name: "gpu-feature-discovery"
k8s_gpu_feature_discovery_chart_version: "0.7.0"
k8s_gpu_feature_discovery_chart_version: "0.8.0"
k8s_gpu_mig_strategy: "mixed"
Loading

0 comments on commit b6ff68f

Please sign in to comment.