/metrics
end-point is exposing metrics from ssv node to prometheus.
Prometheus should also hit /health
end-point in order to collect the health check metrics.
Even if prometheus is not configured, the end-point can simply be polled by a simple HTTP client
(it doesn't contain metrics)
See the configuration of a local prometheus service.
MetricsAPIPort
is used to enable prometheus metrics collection:
Example:
MetricsAPIPort: 15000
Or as env variable:
METRICS_API_PORT=15000
go_*
metrics byprometheus
ssv:node_status
Health check status of operator nodessv:eth1:node_status
Health check status of eth1 nodessv:beacon:node_status
Health check status of beacon nodessv:network:connected_peers{pubKey}
Count connected peers for a validatorssv:network:ibft_decided_messages_outbound{topic}
Count IBFT decided messages outboundssv:network:ibft_messages_outbound{topic}
Count IBFT messages outboundssv:network:net_messages_inbound{topic}
Count incoming network messagesssv:validator:ibft_highest_decided{lambda}
The highest decided sequence numberssv:validator:ibft_round{lambda}
IBFTs roundssv:validator:ibft_stage{lambda}
IBFTs stagessv:validator:ibft_current_slot{pubKey}
Current running slotssv:validator:running_ibfts_count{pubKey}
Count running IBFTs by validator pub keyssv:validator:running_ibfts_count_all
Count all running IBFTs
In order to setup a grafana dashboard do the following:
- Enable metrics (
MetricsAPIPort
) - Setup Prometheus as mentioned in the beginning of this document and add as data source
- Job name assumed to be '
ssv
'
- Job name assumed to be '
- Import dashboards to Grafana:
- Align dashboard variables:
instance
- container name, used in 'instance' field for metrics coming from prometheus.
In the given dashboard, instances names are:ssv-node-v2-<i>
, make sure to change according to your setupvalidator_dashboard_id
- exist only in operator dashboard, points to validator dashboard
Note: In order to show Process Health
panels, the following K8S metrics should be exposed:
kubelet_volume_stats_used_bytes
container_cpu_usage_seconds_total
container_memory_working_set_bytes
Health check route is available on GET /health
.
In case the node is healthy it returns an HTTP Code 200
with empty response:
$ curl http://localhost:15000/health
If the node is not healthy, the corresponding errors will be returned with HTTP Code 500
:
$ curl http://localhost:15000/health
{"errors": ["could not sync eth1 events"]}
Profiling can be enabled via config:
EnableProfile: true
All the default pprof
routes are available via HTTP:
$ curl http://localhost:15000/debug/pprof/goroutine?minutes\=20 --output goroutines.tar.gz
Open with Go CLI:
$ go tool pprof goroutines.tar.gz
Or with Web UI:
$ go tool pprof -web goroutines.tar.gz
Another option is to visualize results in web UI directly:
$ go tool pprof -web http://localhost:15001/debug/pprof/heap?minutes=5