Install metrics exporter packages on your HyperPod cluster - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Install metrics exporter packages on your HyperPod cluster

In the base configuration lifecycle scripts that the SageMaker HyperPod team provides also includes installation of various metric exporter packages. To activate the installation step, the only thing you need to do is to set the parameter enable_observability=True in the config.py file. The lifecycle scripts are designed to bootstrap your cluster with the following open-source metric exporter packages.

Name Script deployment target node Exporter description
Slurm exporter for Prometheus Head (controller) node

Exports Slurm Accounting metrics.

Elastic Fabric Adapter (EFA) node exporter

Compute node

Exports metrics from cluster nodes and EFA. The package is a fork of the Prometheus node exporter.

NVIDIA Data Center GPU Management (DCGM) exporter

Compute node

Exports NVIDIA DCGM metrics about health and performance of NVIDIA GPUs.

With enable_observability=True in the config.py file, the following installation step is activated in the lifecycle_script.py script.

# Install metric exporting software and Prometheus for observability if Config.enable_observability: if node_type == SlurmNodeType.COMPUTE_NODE: ExecuteBashScript("./utils/install_docker.sh").run() ExecuteBashScript("./utils/install_dcgm_exporter.sh").run() ExecuteBashScript("./utils/install_efa_node_exporter.sh").run() if node_type == SlurmNodeType.HEAD_NODE: wait_for_scontrol() ExecuteBashScript("./utils/install_docker.sh").run() ExecuteBashScript("./utils/install_slurm_exporter.sh").run() ExecuteBashScript("./utils/install_prometheus.sh").run()

On the compute nodes, the script installs the NVIDIA Data Center GPU Management (DCGM) exporter and the Elastic Fabric Adapter (EFA) node exporter. The DCGM exporter is an exporter for Prometheus that collects metrics from NVIDIA GPUs, enabling monitoring of GPU usage, performance, and health. The EFA node exporter, on the other hand, gathers metrics related to the EFA network interface, which is essential for low-latency and high-bandwidth communication in HPC clusters.

On the head node, the script installs the Slurm exporter for Prometheus and the Prometheus open-source software. The Slurm exporter provides Prometheus with metrics related to Slurm jobs, partitions, and node states.

Note that the lifecycle scripts are designed to install all the exporter packages as docker containers, so the Docker package also should be installed on both the head and compute nodes. The scripts for these components are conveniently provided in the utils folder of the Awsome Distributed Training GitHub repository.

After you have successfully set up your HyperPod cluster installed with the exporter packages, proceed to the next topic to finish setting up Amazon Managed Service for Prometheus and Amazon Managed Grafana.