Skip to content

Latest commit

 

History

History
44 lines (34 loc) · 2.08 KB

troubleshooting.md

File metadata and controls

44 lines (34 loc) · 2.08 KB

Troubleshooting Hybrid Slurm Deployments

Logging

The logs from VMs created by the hybrid configuration will be populated under /var/log/slurm/*.log, a selection of pertinent logs are described below:

  • slurmctld.log: The logging information for the slurm controller daemon. Any issues with the config or permissions will be logged here.
  • slurmd.log: The logging information for the slurm daemon on the compute nodes. Any issues with the config or permissions on the compute node can be found here. Note: These logs require SSH'ing to the compute nodes and viewing them directly.
  • resume.log: Output from the resume.py script that is used by hybrid partitions to create the burst VM instances. Any issues creating new compute VM nodes will be logged here.

In addition, any startup failures can be tracked through the logs at /var/log/messages for centos/rhel based images and /var/log/syslog for debian/ubuntu based images. Instructions for viewing these logs can be found in Google Cloud docs.

Debug Settings

If the standard logging information is not sufficient, it is possible to increase the verbosity of the slurmctld and slurmd logs with the following variables in the slurm.conf file:

For more information on which logging level to select, click the links above to view the official SchedMD documentation.

In addition to the logging levels, specific debug flags can be set that are relevant to the hybrid configuration. Specifically, the Power DebugFlag will provide useful information about the power saving functionality used to implement the hybrid partitions.