# SageMaker HyperPod cluster resources monitoring
Cluster resources monitoring

To achieve comprehensive observability into your SageMaker HyperPod cluster resources and software components, integrate the cluster with [Amazon Managed Service for Prometheus](https://docs.amazonaws.cn/prometheus/latest/userguide/what-is-Amazon-Managed-Service-Prometheus.html) and [Amazon Managed Grafana](https://docs.amazonaws.cn/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html). The integration with Amazon Managed Service for Prometheus enables the export of metrics related to your HyperPod cluster resources, providing insights into their performance, utilization, and health. The integration with Amazon Managed Grafana enables the visualization of these metrics through various Grafana dashboards that offer intuitive interface for monitoring and analyzing the cluster's behavior. By leveraging these services, you gain a centralized and unified view of your HyperPod cluster, facilitating proactive monitoring, troubleshooting, and optimization of your distributed training workloads.

**Tip**  
To find practical examples and solutions, see also the [SageMaker HyperPod workshop](https://catalog.workshops.aws/sagemaker-hyperpod).

![\[An overview of configuring SageMaker HyperPod with Amazon Managed Service for Prometheus and Amazon Managed Grafana.\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/images/hyperpod-observability-architecture.png)


Figure: This architecture diagram shows an overview of configuring SageMaker HyperPod with Amazon Managed Service for Prometheus and Amazon Managed Grafana.

Proceed to the following topics to set up for SageMaker HyperPod cluster observability.

**Topics**
+ [

# Prerequisites for SageMaker HyperPod cluster observability
](sagemaker-hyperpod-cluster-observability-slurm-prerequisites.md)
+ [

# Installing metrics exporter packages on your HyperPod cluster
](sagemaker-hyperpod-cluster-observability-slurm-install-exporters.md)
+ [

# Validating Prometheus setup on the head node of a HyperPod cluster
](sagemaker-hyperpod-cluster-observability-slurm-validate-prometheus-setup.md)
+ [

# Setting up an Amazon Managed Grafana workspace
](sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws.md)
+ [

# Exported metrics reference
](sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference.md)
+ [

# Amazon SageMaker HyperPod Slurm metrics
](smcluster-slurm-metrics.md)

# Prerequisites for SageMaker HyperPod cluster observability
Prerequisites

Before proceeding with the steps to [Installing metrics exporter packages on your HyperPod cluster](sagemaker-hyperpod-cluster-observability-slurm-install-exporters.md), ensure that the following prerequisites are met.

## Enable IAM Identity Center


To enable observability for your SageMaker HyperPod cluster, you must first enable IAM Identity Center. This is a prerequisite for deploying an Amazon CloudFormation stack that sets up the Amazon Managed Grafana workspace and Amazon Managed Service for Prometheus. Both of these services also require the IAM Identity Center for authentication and authorization, ensuring secure user access and management of the monitoring infrastructure.

For detailed guidance on enabling IAM Identity Center, see the [Enabling IAM Identity Center](https://docs.amazonaws.cn/singlesignon/latest/userguide/get-set-up-for-idc.html) section in the *Amazon IAM Identity Center User Guide*. 

After successfully enabling IAM Identity Center, set up a user account that will serve as the administrative user throughout the following configuration precedures.

## Create and deploy an Amazon CloudFormation stack for SageMaker HyperPod observability


Create and deploy a CloudFormation stack for SageMaker HyperPod observability to monitor HyperPod cluster metrics in real time using Amazon Managed Service for Prometheus and Amazon Managed Grafana. To deploy the stack, note that you also should enable your [IAM Identity Center](https://console.amazonaws.cn/singlesignon) beforehand.

Use the sample CloudFormation script [https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/4.prometheus-grafana/cluster-observability.yaml](https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/4.prometheus-grafana/cluster-observability.yaml) that helps you set up Amazon VPC subnets, Amazon FSx for Lustre file systems, Amazon S3 buckets, and IAM roles required to create a HyperPod cluster observability stack.

# Installing metrics exporter packages on your HyperPod cluster
Installing metrics exporter packages

In the [base configuration lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) that the SageMaker HyperPod team provides also includes installation of various metric exporter packages. To activate the installation step, the only thing you need to do is to set the parameter `enable_observability=True` in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py) file. The lifecycle scripts are designed to bootstrap your cluster with the following open-source metric exporter packages.


|  |  |  | 
| --- |--- |--- |
| Name | Script deployment target node | Exporter description | 
| [Slurm exporter for Prometheus](https://github.com/vpenso/prometheus-slurm-exporter) | Head (controller) node |  Exports Slurm Accounting metrics.  | 
|  [Elastic Fabric Adapter (EFA) node exporter](https://github.com/aws-samples/awsome-distributed-training/tree/main/4.validation_and_observability/3.efa-node-exporter)  |  Compute node  |  Exports metrics from cluster nodes and EFA. The package is a fork of the [Prometheus node exporter](https://github.com/prometheus/node_exporter).  | 
|  [NVIDIA Data Center GPU Management (DCGM) exporter](https://github.com/NVIDIA/dcgm-exporter)  | Compute node |  Exports NVIDIA DCGM metrics about health and performance of NVIDIA GPUs.  | 

With `enable_observability=True` in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py) file, the following installation step is activated in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py) script. 

```
# Install metric exporting software and Prometheus for observability
if Config.enable_observability:
    if node_type == SlurmNodeType.COMPUTE_NODE:
        ExecuteBashScript("./utils/install_docker.sh").run()
        ExecuteBashScript("./utils/install_dcgm_exporter.sh").run()
        ExecuteBashScript("./utils/install_efa_node_exporter.sh").run()

    if node_type == SlurmNodeType.HEAD_NODE:
        wait_for_scontrol()
        ExecuteBashScript("./utils/install_docker.sh").run()
        ExecuteBashScript("./utils/install_slurm_exporter.sh").run()
        ExecuteBashScript("./utils/install_prometheus.sh").run()
```

On the compute nodes, the script installs the NVIDIA Data Center GPU Management (DCGM) exporter and the Elastic Fabric Adapter (EFA) node exporter. The DCGM exporter is an exporter for Prometheus that collects metrics from NVIDIA GPUs, enabling monitoring of GPU usage, performance, and health. The EFA node exporter, on the other hand, gathers metrics related to the EFA network interface, which is essential for low-latency and high-bandwidth communication in HPC clusters.

On the head node, the script installs the Slurm exporter for Prometheus and the [Prometheus open-source software](https://prometheus.io/docs/introduction/overview/). The Slurm exporter provides Prometheus with metrics related to Slurm jobs, partitions, and node states.

Note that the lifecycle scripts are designed to install all the exporter packages as docker containers, so the Docker package also should be installed on both the head and compute nodes. The scripts for these components are conveniently provided in the [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils) folder of the *Awsome Distributed Training GitHub repository*.

After you have successfully set up your HyperPod cluster installed with the exporter packages, proceed to the next topic to finish setting up Amazon Managed Service for Prometheus and Amazon Managed Grafana.

# Validating Prometheus setup on the head node of a HyperPod cluster
Validating Prometheus setup

After you have successfully set up your HyperPod cluster installed with the exporter packages, check if Prometheus is properly set up on the head node of your HyperPod cluster.

1. Connect to the head node of your cluster. For instructions on accessing a node, see [Accessing your SageMaker HyperPod cluster nodes](sagemaker-hyperpod-run-jobs-slurm-access-nodes.md).

1. Run the following command to verify the Prometheus config and service file created by the lifecycle script `install_prometheus.sh` is running on the controller node. The output should show the Active status as **active (running)**.

   ```
   $ sudo systemctl status prometheus
   • prometheus service - Prometheus Exporter
   Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; preset:disabled)
   Active: active (running) since DAY YYYY-MM-DD HH:MM:SS UTC; Ss ago
   Main PID: 12345 (prometheus)
   Tasks: 7 (limit: 9281)
   Memory: 35M
   CPU: 234ms
   CGroup: /system.slice/prometheus.service
           -12345 /usr/bin/prometheus--config.file=/etc/prometheus/prometheus.yml
   ```

1. Validate the Prometheus configuration file as follows. The output must be similar to the following, with three exporter configured with the right compute node IP addresses.

   ```
   $ cat /etc/prometheus/prometheus.yml
   global:
     scrape_interval: 15s
     evaluation_interval: 15s
     scrape_timeout: 15s
   
   scrape_configs:
     - job_name: 'slurm_exporter'
       static_configs:
         - targets:
             - 'localhost:8080'
     - job_name: 'dcgm_exporter'
       static_configs:
         - targets:
             - '<ComputeNodeIP>:9400'
             - '<ComputeNodeIP>:9400'
     - job_name: 'efa_node_exporter'
       static_configs:
         - targets:
             - '<ComputeNodeIP>:9100'
             - '<ComputeNodeIP>:9100'
   
   remote_write:
     - url: <AMPReoteWriteURL>
       queue_config:
         max_samples_per_send: 1000
         max_shards: 200
         capacity: 2500
       sigv4:
         region: <Region>
   ```

1. To test if Prometheus is exporting Slurm, DCGM, and EFA metrics properly, run the following `curl` command for Prometheus on port `:9090` on the head node.

   ```
   $ curl -s http://localhost:9090/metrics | grep -E 'slurm|dcgm|efa'
   ```

   With the metrics exported to Amazon Managed Service for Prometheus Workspace through the Prometheus remote write configuration from the controller node, you can proceed to the next topic to set up Amazon Managed Grafana dashboards to display the metrics.

# Setting up an Amazon Managed Grafana workspace
Setting up a Grafana workspace

Create a new Amazon Managed Grafana workspace or update an existing Amazon Managed Grafana workspace with Amazon Managed Service for Prometheus as the data source.

**Topics**
+ [

## Create a Grafana workspace and set Amazon Managed Service for Prometheus as a data source
](#sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws-create)
+ [

## Open the Grafana workspace and finish setting up the data source
](#sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws-connect-data-source)
+ [

## Import open-source Grafana dashboards
](#sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws-import-dashboards)

## Create a Grafana workspace and set Amazon Managed Service for Prometheus as a data source


To visualize metrics from Amazon Managed Service for Prometheus, create an Amazon Managed Grafana workspace and set it up to use Amazon Managed Service for Prometheus as a data source.

1. To create a Grafana workspace, follow the instructions at [Creating a workspace](https://docs.amazonaws.cn/grafana/latest/userguide/AMG-create-workspace.html#creating-workspace) in the *Amazon Managed Service for Prometheus User Guide*.

   1. In Step 13, select Amazon Managed Service for Prometheus as the data source.

   1. In Step 17, you can add the admin user and also other users in your IAM Identity Center.

For more information, see also the following resources.
+ [Set up Amazon Managed Grafana for use with Amazon Managed Service for Prometheus](https://docs.amazonaws.cn/prometheus/latest/userguide/AMP-amg.html) in the *Amazon Managed Service for Prometheus User Guide*
+ [Use Amazon data source configuration to add Amazon Managed Service for Prometheus as a data source](https://docs.amazonaws.cn/grafana/latest/userguide/AMP-adding-AWS-config.html) in the *Amazon Managed Grafana User Guide*

## Open the Grafana workspace and finish setting up the data source


After you have successfully created or updated an Amazon Managed Grafana workspace, select the workspace URL to open the workspace. This prompts you to enter a user name and the password of the user that you have set up in IAM Identity Center. You should log in using the admin user to finish setting up the workspace.

1. In the workspace **Home** page, choose **Apps**, **Amazon Data Sources**, and **Data sources**.

1. In the **Data sources** page, and choose the **Data sources** tab.

1. For **Service**, choose Amazon Managed Service for Prometheus.

1. In the **Browse and provision data sources** section, choose the Amazon region where you provisioned an Amazon Managed Service for Prometheus workspace.

1. From the list of data sources in the selected Region, choose the one for Amazon Managed Service for Prometheus. Make sure that you check the resource ID and the resource alias of the Amazon Managed Service for Prometheus workspace that you have set up for HyperPod observability stack.

## Import open-source Grafana dashboards


After you've successfully set up your Amazon Managed Grafana workspace with Amazon Managed Service for Prometheus as the data source, you'll start collecting metrics to Prometheus, and then should start seeing the various dashboards showing charts, information, and more. The Grafana open source software provides various dashboards, and you can import them into Amazon Managed Grafana.

**To import open-source Grafana dashboards to Amazon Managed Grafana**

1. In the **Home** page of your Amazon Managed Grafana workspace, choose **Dashboards**.

1. Choose the drop down menu button with the UI text **New**, and select **Import**.

1. Paste the URL to the [Slurm Dashboard](https://grafana.com/grafana/dashboards/4323-slurm-dashboard/).

   ```
   https://grafana.com/grafana/dashboards/4323-slurm-dashboard/
   ```

1. Select **Load**.

1. Repeat the previous steps to import the following dashboards.

   1. [Node Exporter Full Dashboard](https://grafana.com/grafana/dashboards/1860-node-exporter-full/)

      ```
      https://grafana.com/grafana/dashboards/1860-node-exporter-full/
      ```

   1. [NVIDIA DCGM Exporter Dashboard](https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/)

      ```
      https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/
      ```

   1. [EFA Metrics Dashboard](https://grafana.com/grafana/dashboards/20579-efa-metrics-dev/)

      ```
      https://grafana.com/grafana/dashboards/20579-efa-metrics-dev/
      ```

   1. [FSx for Lustre Metrics Dashboard](https://grafana.com/grafana/dashboards/20906-fsx-lustre/)

      ```
      https://grafana.com/grafana/dashboards/20906-fsx-lustre/
      ```

# Exported metrics reference
Exported metrics reference

The following sections present comprehensive lists of metrics exported from SageMaker HyperPod to Amazon Managed Service for Prometheus upon the successful configuration of the Amazon CloudFormation stack for SageMaker HyperPod observability. You can start monitoring these metrics visualized in the Amazon Managed Grafana dashboards.

## Slurm exporter dashboard


Provides visualized information of Slurm clusters on SageMaker HyperPod.

**Types of metrics**
+ **Cluster Overview:** Displaying the total number of nodes, jobs, and their states.
+ **Job Metrics:** Visualizing job counts and states over time.
+ **Node Metrics:** Showing node states, allocation, and available resources.
+ **Partition Metrics:** Monitoring partition-specific metrics such as CPU, memory, and GPU utilization.
+ **Job Efficiency:** Calculating job efficiency based on resources utilized.

**List of metrics**


| Metric name | Description | 
| --- | --- | 
| slurm\$1job\$1count | Total number of jobs in the Slurm cluster | 
| slurm\$1job\$1state\$1count | Count of jobs in each state (e.g., running, pending, completed) | 
| slurm\$1node\$1count  | Total number of nodes in the Slurm cluster | 
| slurm\$1node\$1state\$1count  | Count of nodes in each state (e.g., idle, alloc, mix) | 
| slurm\$1partition\$1node\$1count  | Count of nodes in each partition | 
| slurm\$1partition\$1job\$1count  | Count of jobs in each partition | 
| slurm\$1partition\$1alloc\$1cpus  | Total number of allocated CPUs in each partition | 
| slurm\$1partition\$1free\$1cpus  | Total number of available CPUs in each partition | 
| slurm\$1partition\$1alloc\$1memory  | Total allocated memory in each partition | 
| slurm\$1partition\$1free\$1memory  | Total available memory in each partition | 
| slurm\$1partition\$1alloc\$1gpus  | Total allocated GPUs in each partition | 
| slurm\$1partition\$1free\$1gpus  | Total available GPUs in each partition | 

## Node exporter dashboard


Provides visualized information of system metrics collected by the [Prometheus node exporter](https://github.com/prometheus/node_exporter) from the HyperPod cluster nodes.

**Types of metrics**
+ **System overview:** Displaying CPU load averages and memory usage.
+ **Memory metrics:** Visualizing memory utilization including total memory, free memory, and swap space.
+ **Disk usage:** Monitoring disk space utilization and availability.
+ **Network traffic:** Showing network bytes received and transmitted over time.
+ **File system metrics:** Analyzing file system usage and availability.
+ **Disk I/O metrics:** Visualizing disk read and write activity.

**List of metrics**

For a complete list of metrics exported, see the [Node exporter ](https://github.com/prometheus/node_exporter?tab=readme-ov-file#enabled-by-default) and [procfs](https://github.com/prometheus/procfs?tab=readme-ov-file) GitHub repositories. The following table shows a subset of the metrics that provides insights into system resource utilization such as CPU load, memory usage, disk space, and network activity.


| Metric name | Description | 
| --- | --- | 
|  node\$1load1  | 1-minute load average | 
|  node\$1load5  | 5-minute load average | 
|  node\$1load15  | 15-minute load average | 
|  node\$1memory\$1MemTotal  | Total system memory | 
|  node\$1memory\$1MemFree  | Free system memory | 
|  node\$1memory\$1MemAvailable  | Available memory for allocation to processes | 
|  node\$1memory\$1Buffers  | Memory used by the kernel for buffering | 
|  node\$1memory\$1Cached  | Memory used by the kernel for caching file system data | 
|  node\$1memory\$1SwapTotal  | Total swap space available | 
|  node\$1memory\$1SwapFree  | Free swap space | 
|  node\$1memory\$1SwapCached  | Memory that once was swapped out, is swapped back in but still in swap | 
|  node\$1filesystem\$1avail\$1bytes  | Available disk space in bytes | 
|  node\$1filesystem\$1size\$1bytes  | Total disk space in bytes | 
|  node\$1filesystem\$1free\$1bytes  | Free disk space in bytes | 
|  node\$1network\$1receive\$1bytes  | Network bytes received | 
|  node\$1network\$1transmit\$1bytes  | Network bytes transmitted | 
|  node\$1disk\$1read\$1bytes  | Disk bytes read | 
|  node\$1disk\$1written\$1bytes  | Disk bytes written | 

## NVIDIA DCGM exporter dashboard


Provides visualized information of NVIDIA GPU metrics collected by the [NVIDIA DCGM exporter](https://github.com/NVIDIA/dcgm-exporter).

**Types of metrics**
+ **GPU Overview:** Displaying GPU utilization, temperatures, power usage, and memory usage. 
+ **Temperature Metrics:** Visualizing GPU temperatures over time. 
+ **Power Usage:** Monitoring GPU power draw and power usage trends. 
+ **Memory Utilization:** Analyzing GPU memory usage including used, free, and total memory. 
+ **Fan Speed:** Showing GPU fan speeds and variations. 
+ **ECC Errors:** Tracking GPU memory ECC errors and pending errors.

**List of metrics**

The following table shows a list of the metrics that provides insights into the NVIDIA GPU health and performance, including clock frequencies, temperatures, power usage, memory utilization, fan speeds, and error metrics.


| Metric name | Description | 
| --- | --- | 
|  DCGM\$1FI\$1DEV\$1SM\$1CLOCK  | SM clock frequency (in MHz) | 
|  DCGM\$1FI\$1DEV\$1MEM\$1CLOCK  | Memory clock frequency (in MHz) | 
|  DCGM\$1FI\$1DEV\$1MEMORY\$1TEMP  | Memory temperature (in C) | 
|  DCGM\$1FI\$1DEV\$1GPU\$1TEMP  | GPU temperature (in C) | 
|  DCGM\$1FI\$1DEV\$1POWER\$1USAGE  | Power draw (in W) | 
|  DCGM\$1FI\$1DEV\$1TOTAL\$1ENERGY\$1CONSUMPTION  | Total energy consumption since boot (in mJ) | 
|  DCGM\$1FI\$1DEV\$1PCIE\$1REPLAY\$1COUNTER  | Total number of PCIe retries | 
|  DCGM\$1FI\$1DEV\$1MEM\$1COPY\$1UTIL  | Memory utilization (in %) | 
|  DCGM\$1FI\$1DEV\$1ENC\$1UTIL  | Encoder utilization (in %) | 
|  DCGM\$1FI\$1DEV\$1DEC\$1UTIL  | Decoder utilization (in %) | 
|  DCGM\$1FI\$1DEV\$1XID\$1ERRORS  | Value of the last XID error encountered | 
|  DCGM\$1FI\$1DEV\$1FB\$1FREE  | Frame buffer memory free (in MiB) | 
|  DCGM\$1FI\$1DEV\$1FB\$1USED  | Frame buffer memory used (in MiB) | 
|  DCGM\$1FI\$1DEV\$1NVLINK\$1BANDWIDTH\$1TOTAL  | Total number of NVLink bandwidth counters for all lanes | 
|  DCGM\$1FI\$1DEV\$1VGPU\$1LICENSE\$1STATUS  | vGPU License status | 
|  DCGM\$1FI\$1DEV\$1UNCORRECTABLE\$1REMAPPED\$1ROWS  | Number of remapped rows for uncorrectable errors | 
|  DCGM\$1FI\$1DEV\$1CORRECTABLE\$1REMAPPED\$1ROWS  | Number of remapped rows for correctable errors | 
|  DCGM\$1FI\$1DEV\$1ROW\$1REMAP\$1FAILURE  | Whether remapping of rows has failed | 

## EFA metrics dashboard


Provides visualized information of the metrics from [Amazon Elastic Fabric Adapter (EFA)](https://docs.amazonaws.cn/AWSEC2/latest/UserGuide/efa.html) equipped on P instances collected by the [EFA node exporter](https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/3.efa-node-exporter/README.md).

**Types of metrics**
+ **EFA error metrics:** Visualizing errors such as allocation errors, command errors, and memory map errors.
+ **EFA network traffic:** Monitoring received and transmitted bytes, packets, and work requests.
+ **EFA RDMA performance:** Analyzing RDMA read and write operations, including bytes transferred and error rates.
+ **EFA port lifespan:** Displaying the lifespan of EFA ports over time.
+ **EFA keep-alive packets:** Tracking the number of keep-alive packets received.

**List of metrics**

The following table shows a list of the metrics that provides insights into various aspects of EFA operation, including errors, completed commands, network traffic, and resource utilization.


| Metric name | Description | 
| --- | --- | 
|  node\$1amazonefa\$1info  | Non-numeric data from /sys/class/infiniband/, value is always 1. | 
|  node\$1amazonefa\$1lifespan  | Lifespan of the port | 
|  node\$1amazonefa\$1rdma\$1read\$1bytes  | Number of bytes read with RDMA | 
|  node\$1amazonefa\$1rdma\$1read\$1resp\$1bytes  | Number of read response bytes with RDMA | 
|  node\$1amazonefa\$1rdma\$1read\$1wr\$1err  | Number of read write errors with RDMA | 
|  node\$1amazonefa\$1rdma\$1read\$1wrs  | Number of read rs with RDMA | 
|  node\$1amazonefa\$1rdma\$1write\$1bytes  | Number of bytes written with RDMA | 
|  node\$1amazonefa\$1rdma\$1write\$1recv\$1bytes  | Number of bytes written and received with RDMA | 
|  node\$1amazonefa\$1rdma\$1write\$1wr\$1err  | Number of bytes written with error RDMA | 
|  node\$1amazonefa\$1rdma\$1write\$1wrs  | Number of bytes written wrs RDMA | 
|  node\$1amazonefa\$1recv\$1bytes  | Number of bytes received | 
|  node\$1amazonefa\$1recv\$1wrs  | Number of bytes received wrs | 
|  node\$1amazonefa\$1rx\$1bytes  | Number of bytes received | 
|  node\$1amazonefa\$1rx\$1drops  | Number of packets dropped | 
|  node\$1amazonefa\$1rx\$1pkts  | Number of packets received | 
|  node\$1amazonefa\$1send\$1bytes  | Number of bytes sent | 
|  node\$1amazonefa\$1send\$1wrs  | Number of wrs sent | 
|  node\$1amazonefa\$1tx\$1bytes  | Number of bytes transmitted | 
|  node\$1amazonefa\$1tx\$1pkts  | Number of packets transmitted | 

## FSx for Lustre metrics dashboard


Provides visualized information of the [metrics from Amazon FSx for Lustre](https://docs.amazonaws.cn/fsx/latest/LustreGuide/monitoring-cloudwatch.html) file system collected by [Amazon CloudWatch](https://docs.amazonaws.cn/fsx/latest/LustreGuide/monitoring-cloudwatch.html).

**Note**  
The Grafana FSx for Lustre dashboard utilizes Amazon CloudWatch as its data source, which differs from the other dashboards that you have configured to use Amazon Managed Service for Prometheus. To ensure accurate monitoring and visualization of metrics related to your FSx for Lustre file system, configure the FSx for Lustre dashboard to use Amazon CloudWatch as the data source, specifying the same Amazon Web Services Region where your FSx for Lustre file system is deployed.

**Types of metrics**
+ **DataReadBytes:** The number of bytes for file system read operations.
+ **DataWriteBytes:** The number of bytes for file system write operations.
+ **DataReadOperations:** The number of read operations.
+ **DataWriteOperations:** The number of write operations.
+ **MetadataOperations:** The number of meta data operations.
+ **FreeDataStorageCapacity:** The amount of available storage capacity.

# Amazon SageMaker HyperPod Slurm metrics
Slurm metrics

Amazon SageMaker HyperPod provides a set of Amazon CloudWatch metrics that you can use to monitor the health and performance of your HyperPod clusters. These metrics are collected from the Slurm workload manager running on your HyperPod clusters and are available in the `/aws/sagemaker/Clusters` CloudWatch namespace.

## Cluster level metrics


The following cluster-level metrics are available for HyperPod. These metrics use the `ClusterId` dimension to identify the specific HyperPod cluster.


| CloudWatch metric name | Notes | Amazon EKS Container Insights metric name | 
| --- | --- | --- | 
| cluster\$1node\$1count | Total number of nodes in the cluster | cluster\$1node\$1count | 
| cluster\$1idle\$1node\$1count | Number of idle nodes in the cluster | N/A | 
| cluster\$1failed\$1node\$1count | Number of failed nodes in the cluster | cluster\$1failed\$1node\$1count | 
| cluster\$1cpu\$1count | Total CPU cores in the cluster | node\$1cpu\$1limit | 
| cluster\$1idle\$1cpu\$1count | Number of idle CPU cores in the cluster | N/A | 
| cluster\$1gpu\$1count | Total GPUs in the cluster | node\$1gpu\$1limit | 
| cluster\$1idle\$1gpu\$1count | Number of idle GPUs in the cluster | N/A | 
| cluster\$1running\$1task\$1count | Number of running Slurm jobs in the cluster | N/A | 
| cluster\$1pending\$1task\$1count | Number of pending Slurm jobs in the cluster | N/A | 
| cluster\$1preempted\$1task\$1count | Number of preempted Slurm jobs in the cluster | N/A | 
| cluster\$1avg\$1task\$1wait\$1time | Average wait time for Slurm jobs in the cluster | N/A | 
| cluster\$1max\$1task\$1wait\$1time | Maximum wait time for Slurm jobs in the cluster | N/A | 

## Instance level metrics


The following instance-level metrics are available for HyperPod. These metrics also use the `ClusterId` dimension to identify the specific HyperPod cluster.


| CloudWatch metric name | Notes | Amazon EKS Container Insights metric name | 
| --- | --- | --- | 
| node\$1gpu\$1utilization | Average GPU utilization across all instances | node\$1gpu\$1utilization | 
| node\$1gpu\$1memory\$1utilization | Average GPU memory utilization across all instances | node\$1gpu\$1memory\$1utilization | 
| node\$1cpu\$1utilization | Average CPU utilization across all instances | node\$1cpu\$1utilization | 
| node\$1memory\$1utilization | Average memory utilization across all instances | node\$1memory\$1utilization |