

# Amazon SageMaker HyperPod Slurm metrics
Slurm metrics

Amazon SageMaker HyperPod provides a set of Amazon CloudWatch metrics that you can use to monitor the health and performance of your HyperPod clusters. These metrics are collected from the Slurm workload manager running on your HyperPod clusters and are available in the `/aws/sagemaker/Clusters` CloudWatch namespace.

## Cluster level metrics


The following cluster-level metrics are available for HyperPod. These metrics use the `ClusterId` dimension to identify the specific HyperPod cluster.


| CloudWatch metric name | Notes | Amazon EKS Container Insights metric name | 
| --- | --- | --- | 
| cluster\$1node\$1count | Total number of nodes in the cluster | cluster\$1node\$1count | 
| cluster\$1idle\$1node\$1count | Number of idle nodes in the cluster | N/A | 
| cluster\$1failed\$1node\$1count | Number of failed nodes in the cluster | cluster\$1failed\$1node\$1count | 
| cluster\$1cpu\$1count | Total CPU cores in the cluster | node\$1cpu\$1limit | 
| cluster\$1idle\$1cpu\$1count | Number of idle CPU cores in the cluster | N/A | 
| cluster\$1gpu\$1count | Total GPUs in the cluster | node\$1gpu\$1limit | 
| cluster\$1idle\$1gpu\$1count | Number of idle GPUs in the cluster | N/A | 
| cluster\$1running\$1task\$1count | Number of running Slurm jobs in the cluster | N/A | 
| cluster\$1pending\$1task\$1count | Number of pending Slurm jobs in the cluster | N/A | 
| cluster\$1preempted\$1task\$1count | Number of preempted Slurm jobs in the cluster | N/A | 
| cluster\$1avg\$1task\$1wait\$1time | Average wait time for Slurm jobs in the cluster | N/A | 
| cluster\$1max\$1task\$1wait\$1time | Maximum wait time for Slurm jobs in the cluster | N/A | 

## Instance level metrics


The following instance-level metrics are available for HyperPod. These metrics also use the `ClusterId` dimension to identify the specific HyperPod cluster.


| CloudWatch metric name | Notes | Amazon EKS Container Insights metric name | 
| --- | --- | --- | 
| node\$1gpu\$1utilization | Average GPU utilization across all instances | node\$1gpu\$1utilization | 
| node\$1gpu\$1memory\$1utilization | Average GPU memory utilization across all instances | node\$1gpu\$1memory\$1utilization | 
| node\$1cpu\$1utilization | Average CPU utilization across all instances | node\$1cpu\$1utilization | 
| node\$1memory\$1utilization | Average memory utilization across all instances | node\$1memory\$1utilization | 