Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU and jupyterhub monitoring #237

Merged
merged 27 commits into from
Jun 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
a9a11aa
Adding slurm-job-exporter
guilbaults Oct 6, 2022
a30eb60
Define profile::metrics::slurm_exporter
cmd-ntrf May 10, 2023
e967fbd
Add gpu monitoring
cmd-ntrf May 10, 2023
3bf1ea5
Fix arrow indentation in metrics
cmd-ntrf May 10, 2023
090562b
Fix profile::gpu::monitoring
cmd-ntrf May 11, 2023
fb7555b
Fix arrow in metrics.pp
cmd-ntrf May 10, 2023
08cd07b
Add python3-psutil as requirement for slurm-job-exporter
cmd-ntrf May 10, 2023
0b98fe9
Change provider of slurm-job-exporter to fix dep issue
cmd-ntrf Jan 18, 2023
1158b31
Define a variable for slurm-job-exporter version
cmd-ntrf Jan 18, 2023
b5b0f60
Install prometheus_client with yum instead of pip
cmd-ntrf Jan 18, 2023
bec1c1b
Add comments on metrics class
cmd-ntrf Jan 18, 2023
213548c
Move prometheus scrape config to hieradata
cmd-ntrf Jan 19, 2023
4590e3a
Update node_exporter version
cmd-ntrf Jan 19, 2023
7f43467
Bump puppet-prometheus
cmd-ntrf Jan 19, 2023
efda95a
Add scraping of jupyterhub prometheus exporter
cmd-ntrf Jan 19, 2023
611a758
Lint metrics.pp
cmd-ntrf May 10, 2023
1572f87
Add epel yumrepo to python3-prometheus_client install
cmd-ntrf Jan 19, 2023
ea5de3e
Change source of prometheus-slurm-exporter
cmd-ntrf May 10, 2023
faaf7d3
Add requirement of Wait_for['slurmctldhost_set'] to slurm-exporter
cmd-ntrf Jan 23, 2023
6ebf0eb
Fix prometheus-slurm-exporter path
cmd-ntrf May 11, 2023
aeb9be2
Add missing python3 require to nvidia-ml-py
cmd-ntrf May 11, 2023
39ee1f0
Use py3_version when install nvidia-ml-py
cmd-ntrf May 15, 2023
dbb6d7b
Add datacenter-gpu-manager when using gpu passthrough
cmd-ntrf May 11, 2023
436f052
Remove version of datacenter-gpu-manager
cmd-ntrf May 11, 2023
abbd89c
Move service nvidia-dcgm in profile::gpu
cmd-ntrf May 11, 2023
0698c7f
Create a parameter for nvidia-ml-py version
cmd-ntrf Jun 26, 2023
cf3c268
Bump puppet-prometheus to 12.5.0
cmd-ntrf Jun 26, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Puppetfile
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ mod 'puppet-fail2ban', '3.3.0'
mod 'puppet-healthcheck', '1.0.1'
mod 'puppet-logrotate', '5.0.0'
mod 'puppet-nodejs', '8.1.0'
mod 'puppet-prometheus', '10.2.0'
mod 'puppet-prometheus', '12.5.0'
mod 'puppet-selinux', '3.2.0'
mod 'puppet-squid', '3.0.0'
mod 'puppet-staging', '3.2.0'
Expand Down
1 change: 1 addition & 0 deletions bootstrap.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ PATH=$PATH:/opt/puppetlabs/puppet/bin
PKCS7_KEY="/etc/puppetlabs/puppet/eyaml/boot_public_key.pkcs7.pem"
ENC_CMD="eyaml encrypt -o block --pkcs7-public-key=${PKCS7_KEY}"
(
$ENC_CMD -l 'jupyterhub::prometheus_token' -s $(uuidgen)
$ENC_CMD -l 'profile::consul::acl_api_token' -s $(uuidgen)
$ENC_CMD -l 'profile::slurm::base::munge_key' -s $(openssl rand 1024 | openssl enc -A -base64)
$ENC_CMD -l 'profile::slurm::accounting::password' -s $(openssl rand -base64 9)
Expand Down
109 changes: 106 additions & 3 deletions data/common.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,112 @@ squid::extra_config_sections:
maximum_object_size: "131072 KB"

profile::base::version: 12.5.0
prometheus::alerts:
groups:
- name: 'recorder.rules'
rules:
- record: slurm_job:allocated_core:count
expr: count(slurm_job_core_usage_total)
- record: slurm_job:allocated_core:count_user_account
expr: count(slurm_job_core_usage_total) by (user,account)
- record: slurm_job:used_core:sum
expr: sum(rate(slurm_job_core_usage_total{}[2m]) / 1000000000)
- record: slurm_job:used_core:sum_user_account
expr: sum(rate(slurm_job_core_usage_total{}[2m]) / 1000000000) by (user,account)
- record: slurm_job:allocated_memory:sum
expr: sum(slurm_job_memory_limit{})
- record: slurm_job:allocated_memory:sum_user_account
expr: sum(slurm_job_memory_limit{}) by (user,account)
- record: slurm_job:rss_memory:sum
expr: sum(slurm_job_memory_rss)
- record: slurm_job:rss_memory:sum_user_account
expr: sum(slurm_job_memory_rss) by (user, account)
- record: slurm_job:max_memory:sum_user_account
expr: sum(slurm_job_memory_max) by (user, account)
- record: slurm_job:allocated_gpu:count
expr: count(slurm_job_utilization_gpu)
- record: slurm_job:allocated_gpu:count_user_account
expr: count(slurm_job_utilization_gpu) by (user, account)
- record: slurm_job:used_gpu:sum
expr: sum(slurm_job_utilization_gpu) / 100
- record: slurm_job:used_gpu:sum_user_account
expr: sum(slurm_job_utilization_gpu) by (user,account) / 100
- record: slurm_job:non_idle_gpu:sum_user_account
expr: count(slurm_job_utilization_gpu > 0) by (user,account)
- record: slurm_job:power_gpu:sum
expr: sum(slurm_job_power_gpu)
- record: slurm_job:power_gpu:sum_user_account
expr: sum(slurm_job_power_gpu) by (user,account)

prometheus::node_exporter::version: 1.5.0
prometheus::server::version: 2.39.0
prometheus::server::scrape_configs:
- job_name: node
scrape_interval: 10s
scrape_timeout: 10s
honor_labels: true
consul_sd_configs:
- server: 127.0.0.1:8500
token: "%{hiera('profile::consul::acl_api_token')}"
relabel_configs:
- source_labels:
- __meta_consul_tags
regex: '.*,node-exporter,.*'
action: keep
- source_labels:
- __meta_consul_node
target_label: instance
- job_name: slurm_job
scrape_interval: 10s
scrape_timeout: 10s
honor_labels: true
consul_sd_configs:
- server: 127.0.0.1:8500
token: "%{hiera('profile::consul::acl_api_token')}"
relabel_configs:
- source_labels:
- __meta_consul_tags
regex: '.*,slurm-job-exporter,.*'
action: keep
- source_labels:
- __meta_consul_node
target_label: instance
- job_name: prometheus-slurm-exporter
scrape_interval: 10s
scrape_timeout: 10s
honor_labels: true
consul_sd_configs:
- server: 127.0.0.1:8500
token: "%{hiera('profile::consul::acl_api_token')}"
relabel_configs:
- source_labels:
- __meta_consul_tags
regex: '.*,slurm-exporter,.*'
action: keep
- source_labels:
- __meta_consul_node
target_label: instance
- job_name: jupyterhub
scrape_interval: 10s
scrape_timeout: 10s
honor_labels: true
authorization:
type: Bearer
credentials: "%{hiera('jupyterhub::prometheus_token')}"
consul_sd_configs:
- server: 127.0.0.1:8500
token: "%{hiera('profile::consul::acl_api_token')}"
relabel_configs:
- source_labels:
- __meta_consul_tags
regex: '.*,jupyterhub,.*'
action: keep
- source_labels:
- __meta_consul_node
target_label: instance

prometheus::storage_retention: '48h'
prometheus::storage_retention_size: '5GB'

profile::squid::server::port: 3128
profile::squid::server::cache_size: 4096
Expand All @@ -102,9 +208,6 @@ profile::slurm::base::slurm_version: '21.08'
profile::slurm::base::os_reserved_memory: 512
profile::slurm::controller::autoscale_version: '0.2.3'

prometheus::storage_retention: '48h'
prometheus::storage_retention_size: '5GB'

profile::accounts::project_regex: '(ctb|def|rpp|rrg)-[a-z0-9_-]*'
profile::users::ldap::access_tags: ['login:sshd', 'node:sshd', 'proxy:jupyterhub-login']
profile::users::ldap::users:
Expand Down
5 changes: 4 additions & 1 deletion manifests/site.pp
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@

include profile::base
include profile::users::local
include profile::metrics::exporter
include profile::sssd::client
include profile::metrics::node_exporter

if 'login' in $instance_tags {
include profile::fail2ban
Expand All @@ -26,6 +26,7 @@
include profile::freeipa::server

include profile::metrics::server
include profile::metrics::slurm_exporter
include profile::rsyslog::server
include profile::squid::server
include profile::slurm::controller
Expand All @@ -49,6 +50,8 @@
include profile::ssh::hostbased_auth::client
include profile::ssh::hostbased_auth::server

include profile::metrics::slurm_job_exporter

Class['profile::nfs::client'] -> Service['slurmd']
Class['profile::gpu'] -> Service['slurmd']
}
Expand Down
16 changes: 16 additions & 0 deletions site/profile/files/metrics/prometheus-slurm-exporter.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
[Unit]
Description=Exporter for slurm stats
After=network.target

[Service]
User=slurm
Group=slurm
Type=simple
ExecStart=/usr/bin/prometheus-slurm-exporter --collector.partition --listen-address=":8081"
PIDFile=/var/run/prometheus-slurm-exporter/prometheus-slurm-exporter.pid
KillMode=process
Environment=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/puppetlabs/bin:/opt/software/slurm/bin:/root/bin
Restart=always

[Install]
WantedBy=multi-user.target
20 changes: 20 additions & 0 deletions site/profile/manifests/gpu.pp
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@
ensure => 'running',
enable => true,
}
service { 'nvidia-dcgm':
ensure => 'running',
enable => true,
}
} else {
service { 'nvidia-gridd':
ensure => 'running',
Expand Down Expand Up @@ -101,6 +105,9 @@
notify => Exec['nvidia-symlink'],
}

# Used by slurm-job-exporter to export GPU metrics
-> package { 'datacenter-gpu-manager': }

-> file { '/run/nvidia-persistenced':
ensure => directory,
owner => 'nvidia-persistenced',
Expand All @@ -125,13 +132,26 @@

class profile::gpu::install::vgpu (
Enum['rpm', 'bin', 'none'] $installer = 'none',
String $nvidia_ml_py_version = '11.515.75',
) {
if $installer == 'rpm' {
include profile::gpu::install::vgpu::rpm
} elsif $installer == 'bin' {
# install from binary installer
include profile::gpu::install::vgpu::bin
}

# Used by slurm-job-exporter to export GPU metrics
# DCGM does not work with GRID VGPU, most of the stats are missing
ensure_packages(['python3'], { ensure => 'present' })
$py3_version = lookup('os::redhat::python3::version')

exec { 'pip install nvidia-ml-py':
command => "/usr/bin/pip${py3_version} install --force-reinstall nvidia-ml-py==${nvidia_ml_py_version}",
creates => "/usr/local/lib/python${py3_version}/site-packages/pynvml.py",
before => Service['slurm-job-exporter'],
require => Package['python3'],
}
}

class profile::gpu::install::vgpu::rpm (
Expand Down
6 changes: 6 additions & 0 deletions site/profile/manifests/jupyterhub.pp
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,12 @@
),
}
include profile::slurm::submitter

consul::service { 'jupyterhub':
port => 8081,
tags => ['jupyterhub'],
token => lookup('profile::consul::acl_api_token'),
}
}

class profile::jupyterhub::node {
Expand Down
113 changes: 84 additions & 29 deletions site/profile/manifests/metrics.pp
Original file line number Diff line number Diff line change
@@ -1,39 +1,94 @@
class profile::metrics::exporter {
# Configure a Prometheus exporter that exports server usage metrics, for example:
# - CPU usage
# - memory usage
# It should run on every server of the cluster.
class profile::metrics::node_exporter {
include prometheus::node_exporter
consul::service { 'node-exporter':
port => 9100,
tags => ['monitor'],
tags => ['node-exporter'],
token => lookup('profile::consul::acl_api_token'),
}
}

class profile::metrics::server {
class { 'prometheus::server':
version => '2.11.1',
scrape_configs => [
{
'job_name' => 'consul',
'scrape_interval' => '10s',
'scrape_timeout' => '10s',
'honor_labels' => true,
'consul_sd_configs' => [
{
'server' => '127.0.0.1:8500',
'token' => lookup('profile::consul::acl_api_token')
},
],
'relabel_configs' => [
{
'source_labels' => ['__meta_consul_tags'],
'regex' => '.*,monitor,.*',
'action' => 'keep'
},
{
'source_labels' => ['__meta_consul_node'],
'target_label' => 'instance'
}
],
},
# Configure a Prometheus exporter that exports the Slurm compute node metrics, for example:
# - job memory usage
# - job memory max
# - job memory limit
# - job core usage total
# - job process count
# - job threads count
# - job power gpu
# This exporter needs to run on compute nodes.
# @param version The version of the slurm job exporter to install
class profile::metrics::slurm_job_exporter (String $version = '0.0.10') {
consul::service { 'slurm-job-exporter':
port => 9798,
tags => ['slurm-job-exporter'],
token => lookup('profile::consul::acl_api_token'),
}

$el = $facts['os']['release']['major']
package { 'python3-prometheus_client':
require => Yumrepo['epel'],
}
package { 'slurm-job-exporter':
source => "https://github.com/guilbaults/slurm-job-exporter/releases/download/v${version}/slurm-job-exporter-${version}-1.el${el}.noarch.rpm",
provider => 'yum',
}

service { 'slurm-job-exporter':
ensure => 'running',
enable => true,
require => [
Package['slurm-job-exporter'],
Package['python3-prometheus_client'],
],
}
}

# Configure a Prometheus exporter that exports the Slurm scheduling metrics, for example:
# - allocated nodes
# - allocated gpus
# - pending jobs
# - completed jobs
# This exporter typically runs on the Slurm controller server, but it can run on any server
# with a functional Slurm command-line installation.
class profile::metrics::slurm_exporter {
consul::service { 'slurm-exporter':
port => 8081,
tags => ['slurm-exporter'],
token => lookup('profile::consul::acl_api_token'),
}

$slurm_exporter_url = 'https://download.copr.fedorainfracloud.org/results/cmdntrf/prometheus-slurm-exporter/'
yumrepo { 'prometheus-slurm-exporter-copr-repo':
enabled => true,
descr => 'Copr repo for prometheus-slurm-exporter owned by cmdntrf',
baseurl => "${slurm_exporter_url}/epel-\$releasever-\$basearch/",
skip_if_unavailable => true,
gpgcheck => 1,
gpgkey => "${slurm_exporter_url}/pubkey.gpg",
repo_gpgcheck => 0,
}
-> package { 'prometheus-slurm-exporter': }

file { '/etc/systemd/system/prometheus-slurm-exporter.service':
source => 'puppet:///modules/profile/metrics/prometheus-slurm-exporter.service',
notify => Service['prometheus-slurm-exporter'],
}

service { 'prometheus-slurm-exporter':
ensure => 'running',
enable => true,
require => [
Package['prometheus-slurm-exporter'],
File['/etc/systemd/system/prometheus-slurm-exporter.service'],
Wait_for['slurmctldhost_set'],
],
}
}

class profile::metrics::server {
include prometheus::server
}