# Cluster and task observability


There are two options for monitoring SageMaker HyperPod clusters:

**The SageMaker HyperPod observability add-on**—SageMaker HyperPod provides a comprehensive, out-of-the-box dashboard that gives you insights into foundation model (FM) development tasks and cluster resources. This unified observability solution automatically publishes key metrics to Amazon Managed Service for Prometheus and displays them in Amazon Managed Grafana dashboards. The dashboards are optimized specifically for FM development with deep coverage of hardware health, resource utilization, and task-level performance. With this add-on, you can consolidate health and performance data from NVIDIA DCGM, instance-level Kubernetes node exporters, Elastic Fabric Adapter, integrated file systems, Kubernetes APIs, Kueue, and SageMaker HyperPod task operators.

**Amazon CloudWatch Insights**—Amazon CloudWatch Insights collects metrics for compute resources, such as CPU, memory, disk, and network. Container Insights also provides diagnostic information, such as container restart failures, to help you isolate issues and resolve them quickly. You can also set CloudWatch alarms on metrics that Container Insights collects.

**Topics**
+ [

# Amazon SageMaker HyperPod observability with Amazon Managed Grafana and Amazon Managed Service for Prometheus
](sagemaker-hyperpod-observability-addon.md)
+ [

# Observability with Amazon CloudWatch
](sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci.md)

# Amazon SageMaker HyperPod observability with Amazon Managed Grafana and Amazon Managed Service for Prometheus
Observability with Grafana and Prometheus

Amazon SageMaker HyperPod (SageMaker HyperPod) provides a comprehensive, out-of-the-box dashboard that gives you insights into foundation model (FM) development tasks and cluster resources. This unified observability solution automatically publishes key metrics to Amazon Managed Service for Prometheus and displays them in Amazon Managed Grafana dashboards. The dashboards are optimized specifically for FM development with deep coverage of hardware health, resource utilization, and task-level performance. With this add-on, you can consolidate health and performance data from NVIDIA DCGM, instance-level Kubernetes node exporters, Elastic Fabric Adapter, integrated file systems, Kubernetes APIs, Kueue, and SageMaker HyperPod task operators.

## Restricted Instance Group (RIG) support


The observability add-on also supports clusters that contain Restricted Instance Groups. In RIG clusters, the add-on automatically adapts its deployment strategy to comply with the network isolation and security constraints of restricted nodes. DaemonSet components (node exporter, DCGM exporter, EFA exporter, Neuron monitor, and node collector) run on both standard and restricted nodes. Deployment components (central collector, Kube State Metrics, and Training Metrics Agent) are scheduled with boundary-aware logic to respect network isolation between instance groups. Container log collection with Fluent Bit is not available on restricted nodes.

For information about setting up the add-on on clusters with Restricted Instance Groups, see [Setting up the SageMaker HyperPod observability add-on](hyperpod-observability-addon-setup.md).

**Topics**
+ [

## Restricted Instance Group (RIG) support
](#hyperpod-observability-addon-rig-support)
+ [

# Setting up the SageMaker HyperPod observability add-on
](hyperpod-observability-addon-setup.md)
+ [

# Amazon SageMaker HyperPod observability dashboards
](hyperpod-observability-addon-viewing-dashboards.md)
+ [

# Exploring SageMaker HyperPod cluster metrics in Amazon Managed Grafana
](hyperpod-observability-addon-exploring-metrics.md)
+ [

# Customizing SageMaker HyperPod cluster metrics dashboards and alerts
](hyperpod-observability-addon-customizing.md)
+ [

# Creating custom SageMaker HyperPod cluster metrics
](hyperpod-observability-addon-custom-metrics.md)
+ [

# SageMaker HyperPod cluster metrics
](hyperpod-observability-cluster-metrics.md)
+ [

# Preconfigured alerts
](hyperpod-observability-addon-alerts.md)
+ [

# Troubleshooting the Amazon SageMaker HyperPod observability add-on
](hyperpod-observability-addon-troubleshooting.md)

# Setting up the SageMaker HyperPod observability add-on
Setup

The following list describes the prerequisites for setting up the observability add-on.

To have metrics for your Amazon SageMaker HyperPod (SageMaker HyperPod) cluster sent to a Amazon Managed Service for Prometheus workspace and to optionally view them in Amazon Managed Grafana, first attach the following managed policies and permissions to your console role.
+ To use Amazon Managed Grafana, enable Amazon IAM Identity Center (IAM Identity Center) in an Amazon Web Services Region where Amazon Managed Grafana is available. For instructions, see [Getting started with IAM Identity Center](https://docs.amazonaws.cn/singlesignon/latest/userguide/getting-started.html) in the *Amazon IAM Identity Center User Guide*. For a list of Amazon Web Services Regions where Amazon Managed Grafana is available, see [Supported Regions](https://docs.amazonaws.cn/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html#AMG-supported-Regions) in the *Amazon Managed Grafana User Guide*.
+ Create at least one user in IAM Identity Center.
+ Ensure that the [Amazon EKS Pod Identity Agent](https://docs.aws.amazon.com/eks/latest/userguide/workloads-add-ons-available-eks.html#add-ons-pod-id) add-on is installed in your Amazon EKS cluster. The Amazon EKS Pod Identity Agent add-on makes it possible for the SageMaker HyperPod observability add-on to get the credentials to interact with Amazon Managed Service for Prometheus and CloudWatch Logs. To check whether your Amazon EKS cluster has the add-on, go to the Amazon EKS console, and check your cluster's **Add-ons** tab. For information about how to install the add-on if it's not installed, see [Create add-on (Amazon Web Services Management Console)](https://docs.amazonaws.cn/eks/latest/userguide/creating-an-add-on.html#_create_add_on_console) in the *Amazon EKS User Guide*.
**Note**  
The Amazon EKS Pod Identity Agent is required for standard instance groups. For Restricted Instance Groups (RIG), the Pod Identity Agent is not available due to network isolation constraints. The cluster's instance group execution IAM role is used to interact with Amazon Managed Service for Prometheus. For information about how to configure that role, see [Additional prerequisites for Restricted Instance Groups](#hyperpod-observability-addon-rig-prerequisites).
+ Ensure that you have at least one node in your SageMaker HyperPod cluster before installing SageMaker HyperPod observability add-on. The smallest Amazon EC2 instance type that works in this case is `4xlarge`. This minimum node size requirement ensures that the node can accommodate all the pods that the SageMaker HyperPod observability add-on creates alongside any other already running pods on the cluster.
+ Add the following policies and permissions to your role.
  + [Amazon managed policy: AmazonSageMakerHyperPodObservabilityAdminAccess](security-iam-awsmanpol-AmazonSageMakerHyperPodObservabilityAdminAccess.md)
  + [Amazon managed policy: AWSGrafanaWorkspacePermissionManagementV2](https://docs.amazonaws.cn/grafana/latest/userguide/security-iam-awsmanpol.html#security-iam-awsmanpol-AWSGrafanaWorkspacePermissionManagementV2)
  + [Amazon managed policy: AmazonSageMakerFullAccess](https://docs.amazonaws.cn/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html)
  + Additional permissions to set up required IAM roles for Amazon Managed Grafana and Amazon Elastic Kubernetes Service add-on access:

------
#### [ JSON ]

****  

    ```
    {
        "Version":"2012-10-17",		 	 	 
        "Statement": [
            {
                "Sid": "CreateRoleAccess",
                "Effect": "Allow",
                "Action": [
                    "iam:CreateRole",
                    "iam:CreatePolicy",
                    "iam:AttachRolePolicy",
                    "iam:ListRoles"
                ],
                "Resource": [
                    "arn:aws-cn:iam::*:role/service-role/AmazonSageMakerHyperPodObservabilityGrafanaAccess*",
                    "arn:aws-cn:iam::*:role/service-role/AmazonSageMakerHyperPodObservabilityAddonAccess*",
                    "arn:aws-cn:iam::*:policy/service-role/HyperPodObservabilityAddonPolicy*",
                    "arn:aws-cn:iam::*:policy/service-role/HyperPodObservabilityGrafanaPolicy*"
                ]
            }
        ]
    }
    ```

------
  + Additional permissions needed to manage IAM Identity Center users for Amazon Managed Grafana:

------
#### [ JSON ]

****  

    ```
    {
        "Version":"2012-10-17",		 	 	 
        "Statement": [
            {
                "Sid": "SSOAccess",
                "Effect": "Allow",
                "Action": [
                    "sso:ListProfileAssociations",
                    "sso-directory:SearchUsers",
                    "sso-directory:SearchGroups",
                    "sso:AssociateProfile",
                    "sso:DisassociateProfile"
                ],
                "Resource": [
                    "*"
                ]
            }
        ]
    }
    ```

------

## Additional prerequisites for Restricted Instance Groups


If your cluster contains Restricted Instance Groups, the instance group execution role must have permissions to write metrics to Amazon Managed Service for Prometheus. When you use **Quick setup** to create your cluster with observability enabled, these permissions are added to the execution role automatically.

If you are using **Custom setup** or adding observability to an existing RIG cluster, ensure that the execution role for each Restricted Instance Group has the following permissions:

```
{
    "Version": "2012-10-17", 		 	 	 
    "Statement": [
        {
            "Sid": "PrometheusAccess",
            "Effect": "Allow",
            "Action": "aps:RemoteWrite",
            "Resource": "arn:aws:aps:us-east-1:account_id:workspace/workspace-ID"
        }
    ]
}
```

Replace *us-east-1*, *account\$1id*, and *workspace-ID* with your Amazon Web Services Region, account ID, and Amazon Managed Service for Prometheus workspace ID.

After you ensure that you have met the above prerequisites, you can install the observability add-on.

**To quickly install the observability add-on**

1. Open the Amazon SageMaker AI console at [https://console.amazonaws.cn/sagemaker/](https://console.amazonaws.cn/sagemaker/).

1. Go to your cluster's details page.

1. On the **Dashboard** tab, locate the add-on named **HyperPod Monitoring & Observability**, and choose **Quick install**.

**To do a custom-install of the observability add-on**

1. Go to your cluster's details page.

1. On the **Dashboard** tab, locate the add-on named **HyperPod Monitoring & Observability**, and choose **Custom install**.

1. Specify the metrics categories that you want to see. For more information about these metrics categories, see [SageMaker HyperPod cluster metrics](hyperpod-observability-cluster-metrics.md).

1. Specify whether you want to enable Amazon CloudWatch Logs.

1. Specify whether you want the service to create a new Amazon Managed Service for Prometheus workspace.

1. To be able to view the metrics in Amazon Managed Grafana dashboards, check the box labeled **Use an Amazon Managed Grafana workspace**. You can specify your own workspace or let the service create a new one for you. 
**Note**  
Amazon Managed Grafana isn't available in all Amazon Web Services Regions in which Amazon Managed Service for Prometheus is available. However, you can set up a Grafana workspace in any Amazon Web Services Region and configure it to get metrics data from a Prometheus workspace that resides in a different Amazon Web Services Region. For information, see [Use Amazon data source configuration to add Amazon Managed Service for Prometheus as a data source](https://docs.amazonaws.cn/grafana/latest/userguide/AMP-adding-AWS-config.html) and [Connect to Amazon Managed Service for Prometheus and open-source Prometheus data sources](https://docs.amazonaws.cn/grafana/latest/userguide/prometheus-data-source.html). 

# Amazon SageMaker HyperPod observability dashboards
Dashboards

This topic describes how to view metrics dashboards for your Amazon SageMaker HyperPod (SageMaker HyperPod) clusters and how to add new users to a dashboard. The topic also describes the different types of dashboards.

## Accessing dashboards


To view your SageMaker HyperPod cluster's metrics in Amazon Managed Grafana, perform the following steps:

1. Open the Amazon SageMaker AI console at [https://console.amazonaws.cn/sagemaker/](https://console.amazonaws.cn/sagemaker/).

1. Go to your cluster's details page.

1. On the **Dashboard** tab, locate the **HyperPod Observability** section, and choose **Open dashboard in Grafana**.

## Adding new users to a Amazon Managed Grafana workspace


For information about how to add users to a Amazon Managed Grafana workspace, see [Use Amazon IAM Identity Center with your Amazon Managed Grafana workspace](https://docs.amazonaws.cn/grafana/latest/userguide/authentication-in-AMG-SSO.html) in the *Amazon Managed Grafana User Guide*.

## Observability dashboards


The SageMaker HyperPod observability add-on provides six interconnected dashboards in your default Amazon Managed Grafana workspace. Each dashboard provides in-depth insights about different resources and tasks in the clusters for various users such as data scientists, machine learning engineers, and administrators.

### Task dashboard


The Task dashboard provides comprehensive monitoring and visualization of resource utilization metrics for SageMaker HyperPod tasks. The main panel displays a detailed table grouping resource usage by parent tasks, showing CPU, GPU, and memory utilization across pods. Interactive time-series graphs track CPU usage, system memory consumption, GPU utilization percentages, and GPU memory usage for selected pods, allowing you to monitor performance trends over time. The dashboard features powerful filtering capabilities through variables like cluster name, namespace, task type, and specific pods, making it easy to drill down into specific workloads. This monitoring solution is essential for optimizing resource allocation and maintaining performance of machine learning workloads on SageMaker HyperPod.

### Training dashboard


The training dashboard provides comprehensive monitoring of training task health, reliability, and fault management metrics. The dashboard features key performance indicators including task creation counts, success rates, and uptime percentages, along with detailed tracking of both automatic and manual restart events. It offers detailed visualizations of fault patterns through pie charts and heatmaps that break down incidents by type and remediation latency, enabling you to identify recurring issues and optimize task reliability. The interface includes real-time monitoring of critical metrics like system recovery times and fault detection latencies, making it an essential tool for maintaining high availability of training workloads. Additionally, the dashboard's 24-hour trailing window provides historical context for analyzing trends and patterns in training task performance, helping teams proactively address potential issues before they impact production workloads.

### Inference dashboard


The inference dashboard provides comprehensive monitoring of model deployment performance and health metrics across multiple dimensions. It features a detailed overview of active deployments, real-time monitoring of request rates, success percentages, and latency metrics, enabling you to track model serving performance and identify potential bottlenecks. The dashboard includes specialized panels for both general inference metrics and token-specific metrics for language models, such as time to first token (TTFT) and token throughput, making it particularly valuable for monitoring large language model deployments. Additionally, it provides infrastructure insights through pod and node allocation tracking, while offering detailed error analysis capabilities to help maintain high availability and performance of inference workloads.

### Cluster dashboard


The cluster dashboard provides a comprehensive view of cluster health and performance, offering real-time visibility into compute, memory, network, and storage resources across your Amazon SageMaker HyperPod (SageMaker HyperPod) environment. At a glance, you can view critical metrics including total instances, GPU utilization, memory usage, and network performance through an intuitive interface that automatically updates data every few seconds. The dashboard is organized into logical sections, starting with a high-level cluster overview that displays key metrics such as healthy instance percentage and total resource counts, followed by detailed sections for GPU performance, memory utilization, network statistics, and storage metrics. Each section features interactive graphs and panels that allow you to drill down into specific metrics, with customizable time ranges and filtering options by cluster name, instance, or GPU ID.

### File system dashboard


The file-system dashboard provides comprehensive visibility into file system (Amazon FSx for Lustre) performance and health metrics. The dashboard displays critical storage metrics including free capacity, deduplication savings, CPU/memory utilization, disk IOPS, throughput, and client connections across multiple visualizations. It makes it possible for you to monitor both system-level performance indicators like CPU and memory usage, as well as storage-specific metrics such as read/write operations and disk utilization patterns. The interface includes alert monitoring capabilities and detailed time-series graphs for tracking performance trends over time, making it valuable for proactive maintenance and capacity planning. Additionally, through its comprehensive metrics coverage, the dashboard helps identify potential bottlenecks, optimize storage performance, and ensure reliable file system operations for SageMaker HyperPod workloads.

### GPU partition dashboard


To monitor GPU partition-specific metrics when using Multi-Instance GPU (MIG) configurations, you need to install or upgrade to the latest version of the SageMaker HyperPod Observability addon. This addon provides comprehensive monitoring capabilities, including MIG-specific metrics such as partition count, memory usage, and compute utilization per GPU partition.

If you already have SageMaker HyperPod Observability installed but need MIG metrics support, simply update the addon to the latest version. This process is non-disruptive and maintains your existing monitoring configuration.

SageMaker HyperPod automatically exposes MIG-specific metrics, including:
+ `nvidia_mig_instance_count`: Number of MIG instances per profile
+ `nvidia_mig_memory_usage`: Memory utilization per MIG instance
+ `nvidia_mig_compute_utilization`: Compute utilization per MIG instance

### Cluster Logs dashboard


The Cluster Logs dashboard provides a centralized view of CloudWatch Logs for your SageMaker HyperPod cluster. The dashboard queries the `/aws/sagemaker/Clusters/{cluster-name}/{cluster-id}` log group and displays log events with filtering capabilities by instance ID, log stream name, log level (ERROR, WARN, INFO, DEBUG), and free-text search. The dashboard includes an events timeline showing log event distribution over time, a total events counter, a searched events timeline for filtered results, and a detailed logs panel with full log messages, timestamps, and log stream metadata. This dashboard uses CloudWatch as its data source and is useful for debugging cluster issues, monitoring instance health events, and investigating training job failures.

# Exploring SageMaker HyperPod cluster metrics in Amazon Managed Grafana
Exploring metrics

After you connect Amazon Managed Grafana to your Amazon Managed Service for Prometheus workspace, you can use Grafana's query editor and visualization tools to explore your metrics data. Amazon Managed Grafana provides multiple ways to interact with Prometheus data, including a comprehensive query editor for building PromQL expressions, a metrics browser for discovering available metrics and labels, and templating capabilities for creating dynamic dashboards. You can perform range queries to visualize time series data over periods and instant queries to retrieve the latest values, with options to format results as time series graphs, tables, or heatmaps. For detailed information about configuring query settings, using the metrics browser, and leveraging templating features, see [Using the Prometheus data source](https://docs.amazonaws.cn/grafana/latest/userguide/using-prometheus-datasource.html).

# Customizing SageMaker HyperPod cluster metrics dashboards and alerts
Customizing dashboards

Amazon Managed Grafana makes it possible for you to create comprehensive dashboards that visualize your data through panels containing queries connected to your data sources. You can build dashboards from scratch, import existing ones, or export your creations for sharing and backup purposes. Grafana dashboards support dynamic functionality through variables that replace hard-coded values in queries, making your visualizations more flexible and interactive. You can also enhance your dashboards with features like annotations, library panels for reusability, version history management, and custom links to create a complete monitoring and observability solution. For step-by-step guidance on creating, importing, configuring, and managing dashboards, see [Building dashboards](https://docs.amazonaws.cn/grafana/latest/userguide/v10-dash-building-dashboards.html).

# Creating custom SageMaker HyperPod cluster metrics
Custom metrics

The Amazon SageMaker HyperPod (SageMaker HyperPod) observability add-on provides hundreds of health, performance, and efficiency metrics out-of-the-box. In addition to those metrics, you might need to monitor custom metrics specific to your applications or business needs that aren't captured by default metrics, such as model-specific performance indicators, data processing statistics, or application-specific measurements. To address this need, you can implement custom metrics collection using OpenTelemetry by integrating a Python code snippet into your application.

To create custom metrics, first run the following shell command to install the core OpenTelemetry components needed to instrument Python applications for observability. This installation makes it possible for Python applications that run on SageMaker HyperPod clusters to emit custom telemetry data. That data gets collected by the OpenTelemetry collector and forwarded to the observability infrastructure.

```
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc
```

The following example script configures an OpenTelemetry metrics pipeline that automatically tags metrics with pod and node information, ensuring proper attribution within your cluster, and sends these metrics to the SageMaker HyperPod built-in observability stack every second. The script establishes a connection to the SageMaker HyperPod metrics collector, sets up appropriate resource attributes for identification, and provides a meter interface through which you can create various types of metrics (counters, gauges, or histograms) to track any aspect of your application's performance. Custom metrics integrate with the SageMaker HyperPod monitoring dashboards alongside system metrics. This integration allows for comprehensive observability through a single interface where you can create custom alerts, visualizations, and reports to monitor your workload's complete performance profile.

```
import os
from opentelemetry import metrics
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource

# Get hostname/pod name
hostname = os.uname()[1]
node_name = os.getenv('NODE_NAME', 'unknown')

collector_endpoint = "hyperpod-otel-collector.hyperpod-observability:4317"

# Configure the OTLP exporter
exporter = OTLPMetricExporter(
    endpoint=collector_endpoint,
    insecure=True,
    timeout=5000  # 5 seconds timeout
)

reader = PeriodicExportingMetricReader(
    exporter,
    export_interval_millis=1000
)

resource = Resource.create({
    "service.name": "metric-test",
    "pod.name": hostname,
    "node.name": node_name
})

meter_provider = MeterProvider(
    metric_readers=[reader],
    resource=resource
)
metrics.set_meter_provider(meter_provider)

# Create a meter
meter = metrics.get_meter("test-meter")

# Create a counter
counter = meter.create_counter(
    name="test.counter",
    description="A test counter"
)

counter.add(1, {"pod": hostname, "node": node_name})
```

# SageMaker HyperPod cluster metrics
Cluster metrics

Amazon SageMaker HyperPod (SageMaker HyperPod) publishes various metrics across 9 distinct categories to your Amazon Managed Service for Prometheus workspace. Not all metrics are enabled by default or displayed in your Amazon Managed Grafana workspace. The following table shows which metrics are enabled by default when you install the observability add-on, which categories have additional metrics that can be enabled for more granular cluster information, and where they appear in the Amazon Managed Grafana workspace.


| Metric category | Enabled by default? | Additional advanced metrics available? | Available under which Grafana dashboards? | 
| --- | --- | --- | --- | 
| Training metrics | Yes | Yes | Training | 
| Inference metrics | Yes | No | Inference | 
| Task governance metrics | No | Yes | None. Query your Amazon Managed Service for Prometheus workspace to build your own dashboard. | 
| Scaling metrics | No | Yes | None. Query your Amazon Managed Service for Prometheus workspace to build your own dashboard. | 
| Cluster metrics | Yes | Yes | Cluster | 
| Instance metrics | Yes | Yes | Cluster | 
| Accelerated compute metrics | Yes | Yes | Task, Cluster | 
| Network metrics | No | Yes | Cluster | 
| File system | Yes | No | File system | 

The following tables describe the metrics available for monitoring your SageMaker HyperPod cluster, organized by category.

## Metrics availability on Restricted Instance Groups


When your cluster contains Restricted Instance Groups, most metrics categories are available on restricted nodes with the following exceptions and considerations. You can also set up alerting on any metric of your choice.


| Metric category | Available on RIG nodes? | Notes | 
| --- | --- | --- | 
| Training metrics | Yes | Kubeflow and Kubernetes pod metrics are collected. Advanced training KPI metrics (from Training Metrics Agent) are not available from the RIG nodes. | 
| Inference metrics | No | Inference workloads are not supported on Restricted Instance Groups. | 
| Task governance metrics | No | Kueue metrics are collected from the standard nodes only, if any. | 
| Scaling metrics | No | KEDA metrics are collected from the standard nodes only, if any. | 
| Cluster metrics | Yes | Kube State Metrics and API server metrics are available. Kube State Metrics is preferentially scheduled on standard nodes but can run on restricted nodes in RIG-only clusters. | 
| Instance metrics | Yes | Node Exporter and cAdvisor metrics are collected on all nodes including restricted nodes. | 
| Accelerated compute metrics | Yes | DCGM Exporter runs on GPU-enabled restricted nodes. Neuron Monitor runs on Neuron-enabled restricted nodes when advanced mode is enabled. | 
| Network metrics | Yes | EFA Exporter runs on EFA-enabled restricted nodes when advanced mode is enabled. | 
| File system metrics | Yes | FSx for Lustre cluster utilization metrics are supported on Restricted Instance Groups. | 

**Note**  
Container log collection with Fluent Bit is not deployed on restricted nodes. Cluster logs from restricted nodes are available through the SageMaker HyperPod platform independently of the observability add-on. You can view these logs in the Cluster Logs dashboard.

## Training metrics


Use these metrics to track the performance of training tasks executed on the SageMaker HyperPod cluster.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| Kubeflow metrics | [https://github.com/kubeflow/trainer](https://github.com/kubeflow/trainer) | Yes | Kubeflow | 
| Kubernetes pod metrics | [https://github.com/kubernetes/kube-state-metrics](https://github.com/kubernetes/kube-state-metrics) | Yes | Kubernetes | 
| training\$1uptime\$1percentage | Percentage of training time out of the total window size | No | SageMaker HyperPod training operator | 
| training\$1manual\$1recovery\$1count | Total number of manual restarts performed on the job | No | SageMaker HyperPod training operator | 
| training\$1manual\$1downtime\$1ms | Total time in milliseconds the job was down due to manual interventions | No | SageMaker HyperPod training operator | 
| training\$1auto\$1recovery\$1count | Total number of automatic recoveries | No | SageMaker HyperPod training operator | 
| training\$1auto\$1recovery\$1downtime | Total infrastructure overhead time in milliseconds during fault recovery | No | SageMaker HyperPod training operator | 
| training\$1fault\$1count | Total number of faults encountered during training | No | SageMaker HyperPod training operator | 
| training\$1fault\$1type\$1count | Distribution of faults by type | No | SageMaker HyperPod training operator | 
| training\$1fault\$1recovery\$1time\$1ms | Recovery time in milliseconds for each type of fault | No | SageMaker HyperPod training operator | 
| training\$1time\$1ms | Total time in milliseconds spent in actual training | No | SageMaker HyperPod training operator | 

## Inference metrics


Use these metrics to track the performance of inference tasks on the SageMaker HyperPod cluster.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| model\$1invocations\$1total | Total number of invocation requests to the model | Yes | SageMaker HyperPod inference operator | 
| model\$1errors\$1total | Total number of errors during model invocation | Yes | SageMaker HyperPod inference operator | 
| model\$1concurrent\$1requests | Active concurrent model requests | Yes | SageMaker HyperPod inference operator | 
| model\$1latency\$1milliseconds | Model invocation latency in milliseconds | Yes | SageMaker HyperPod inference operator | 
| model\$1ttfb\$1milliseconds | Model time to first byte latency in milliseconds | Yes | SageMaker HyperPod inference operator | 
| TGI | These metrics can be used to monitor the performance of TGI, auto-scale deployment and to help identify bottlenecks. For a detailed list of metrics, see [https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md](https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md). | Yes | Model container | 
| LMI | These metrics can be used to monitor the performance of LMI, and to help identify bottlenecks. For a detailed list of metrics, see [https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md](https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md). | Yes | Model container | 

## Task governance metrics


Use these metrics to monitor task governance and resource allocation on the SageMaker HyperPod cluster.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| Kueue | See [https://kueue.sigs.k8s.io/docs/reference/metrics/](https://kueue.sigs.k8s.io/docs/reference/metrics/). | No | Kueue | 

## Scaling metrics


Use these metrics to monitor auto-scaling behavior and performance on the SageMaker HyperPod cluster.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| KEDA Operator Metrics | See [https://keda.sh/docs/2.17/integrations/prometheus/\$1operator](https://keda.sh/docs/2.17/integrations/prometheus/#operator). | No | Kubernetes Event-driven Autoscaler (KEDA) | 
| KEDA Webhook Metrics | See [https://keda.sh/docs/2.17/integrations/prometheus/\$1admission-webhooks](https://keda.sh/docs/2.17/integrations/prometheus/#admission-webhooks). | No | Kubernetes Event-driven Autoscaler (KEDA) | 
| KEDA Metrics server Metrics | See [https://keda.sh/docs/2.17/integrations/prometheus/\$1metrics-server](https://keda.sh/docs/2.17/integrations/prometheus/#metrics-server). | No | Kubernetes Event-driven Autoscaler (KEDA) | 

## Cluster metrics


Use these metrics to monitor overall cluster health and resource allocation.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| Cluster health | Kubernetes API server metrics. See [https://kubernetes.io/docs/reference/instrumentation/metrics/](https://kubernetes.io/docs/reference/instrumentation/metrics/). | Yes | Kubernetes | 
| Kubestate | See [https://github.com/kubernetes/kube-state-metrics/tree/main/docs\$1default-resources](https://github.com/kubernetes/kube-state-metrics/tree/main/docs#default-resources). | Limited | Kubernetes | 
| KubeState Advanced | See [https://github.com/kubernetes/kube-state-metrics/tree/main/docs\$1optional-resources](https://github.com/kubernetes/kube-state-metrics/tree/main/docs#optional-resources). | No | Kubernetes | 

## Instance metrics


Use these metrics to monitor individual instance performance and health.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| Node Metrics | See [https://github.com/prometheus/node\$1exporter?tab=readme-ov-file\$1enabled-by-default](https://github.com/prometheus/node_exporter?tab=readme-ov-file#enabled-by-default). | Yes | Kubernetes | 
| Container Metrics | Container metrics exposed by Cadvisor. See [https://github.com/google/cadvisor](https://github.com/google/cadvisor). | Yes | Kubernetes | 

## Accelerated compute metrics


Use these metrics to monitor the performance, health, and utilization of individual accelerated compute devices in your cluster.

**Note**  
When GPU partitioning with MIG (Multi-Instance GPU) is enabled on your cluster, DCGM metrics automatically provide partition-level granularity for monitoring individual MIG instances. Each MIG partition is exposed as a separate GPU device with its own metrics for temperature, power, memory utilization, and compute activity. This allows you to track resource usage and health for each GPU partition independently, enabling precise monitoring of workloads running on fractional GPU resources. For more information about configuring GPU partitioning, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| NVIDIA GPU | DCGM metrics. See [https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv](https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv). | Limited |  NVIDIA Data Center GPU Manager (DCGM)  | 
|  NVIDIA GPU (advanced)  | DCGM metrics that are commented out in the following CSV file:[https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv](https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv) | No |  NVIDIA Data Center GPU Manager (DCGM)  | 
| Amazon Trainium | Neuron metrics. See [https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html\$1neuron-monitor-nc-counters](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html#neuron-monitor-nc-counters). | No | Amazon Neuron Monitor | 

## Network metrics


Use these metrics to monitor the performance and health of the Elastic Fabric Adapters (EFA) in your cluster.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| EFA | See [https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation\$1and\$1observability/3.efa-node-exporter/README.md](https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/3.efa-node-exporter/README.md). | No | Elastic Fabric Adapter | 

## File system metrics


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| File system | Amazon FSx for Lustre metrics from Amazon CloudWatch:[Monitoring with Amazon CloudWatch](https://docs.amazonaws.cn/fsx/latest/LustreGuide/monitoring-cloudwatch.html). | Yes | Amazon FSx for Lustre | 

# Preconfigured alerts


The Amazon SageMaker HyperPod (SageMaker HyperPod) observability add-on enables default alerts for your cluster and workloads to notify you when the system detects common early indicators of cluster under-performance. These alerts are defined within the Amazon Managed Grafana built-in alerting system. For information about how to modify these pre-configured alerts or create new ones, see [Alerts in Grafana version 10](https://docs.amazonaws.cn/grafana/latest/userguide/v10-alerts.html) in the *Amazon Managed Grafana User Guide*. The following YAML shows the default alerts.

```
groups:
- name: sagemaker_hyperpod_alerts
  rules:
  # GPU_TEMP_ABOVE_80C
  - alert: GPUHighTemperature
    expr: DCGM_FI_DEV_GPU_TEMP > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "GPU Temperature Above 80C"
      description: "GPU {{ $labels.gpu }} temperature is {{ $value }}°C."

  # GPU_TEMP_ABOVE_85C  
  - alert: GPUCriticalTemperature  
    expr: DCGM_FI_DEV_GPU_TEMP > 85
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "GPU Temperature Above 85C"
      description: "GPU {{ $labels.gpu }} temperature is {{ $value }}°C."

  # GPU_MEMORY_ERROR
  # Any ECC double-bit errors indicate serious memory issues requiring immediate attention
  - alert: GPUMemoryErrorDetected
    expr: DCGM_FI_DEV_ECC_DBE_VOL_TOTAL > 0 or DCGM_FI_DEV_ECC_DBE_AGG_TOTAL > DCGM_FI_DEV_ECC_DBE_AGG_TOTAL offset 5m
    labels:
      severity: critical
    annotations:
      summary: "GPU ECC Double-Bit Error Detected"
      description: "GPU {{ $labels.gpu }} has detected ECC double-bit errors."

  # GPU_POWER_WARNING
  # Sustained power limit violations can impact performance and stability
  - alert: GPUPowerViolation
    expr: DCGM_FI_DEV_POWER_VIOLATION > 100
    for: 5m
    labels:
      severity: warning  
    annotations:
      summary: "GPU Power Violation"
      description: "GPU {{ $labels.gpu }} has been operating at power limit for extended period."

  # GPU_NVLINK_ERROR
  # NVLink errors above threshold indicate interconnect stability issues
  - alert: NVLinkErrorsDetected
    expr: DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL > 0 or DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL > 10
    labels:
      severity: warning
    annotations:
      summary: "NVLink Errors Detected" 
      description: "GPU {{ $labels.gpu }} has detected NVLink errors."

  # GPU_THERMAL_VIOLATION  
  # Immediate alert on thermal violations to prevent hardware damage
  - alert: GPUThermalViolation
    expr: increase(DCGM_FI_DEV_THERMAL_VIOLATION[5m]) > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "GPU Thermal Violation Detected"
      description: "GPU {{ $labels.gpu }} has thermal violations on node {{ $labels.Hostname }}"

  # GPU_XID_ERROR
  # XID errors indicate driver or hardware level GPU issues requiring investigation
  - alert: GPUXidError
    expr: DCGM_FI_DEV_XID_ERRORS > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: "GPU XID Error Detected"
      description: "GPU {{ $labels.gpu }} experienced XID error {{ $value }} on node {{ $labels.Hostname }}"

  # MIG_CONFIG_FAILURE
  # MIG configuration failures indicate issues with GPU partitioning setup
  - alert: MIGConfigFailure
    expr: kubelet_node_name{nvidia_com_mig_config_state="failed"} > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "MIG Configuration Failed"
      description: "MIG configuration failed on node {{ $labels.instance }}"

  # DISK_SPACE_WARNING
  # 90% threshold ensures time to respond before complete disk exhaustion
  - alert: NodeDiskSpaceWarning
    expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High Disk Usage"
      description: "Node {{ $labels.instance }} disk usage is above 90%"

  # FSX_STORAGE_WARNING
  # 80% FSx utilization allows buffer for burst workloads
  - alert: FsxLustreStorageWarning
    expr: fsx_lustre_storage_used_bytes / fsx_lustre_storage_capacity_bytes * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High FSx Lustre Usage"
      description: "FSx Lustre storage usage is above 80% on file system {{ $labels.filesystem_id }}"
```

# Troubleshooting the Amazon SageMaker HyperPod observability add-on
Troubleshooting

Use the following guidance to resolve common issues with the Amazon SageMaker HyperPod (SageMaker HyperPod) observability add-on.

## Troubleshooting missing metrics in Amazon Managed Grafana
Missing metrics

If metrics don't appear in your Amazon Managed Grafana dashboards, perform the following steps to identify and resolve the issue.

### Verify the Amazon Managed Service for Prometheus-Amazon Managed Grafana connection


1. Sign in to the Amazon Managed Grafana console.

1. In the left pane, choose **All workspaces**.

1. In the **Workspaces** table, choose your workspace.

1. In the details page of the workspace, choose the **Data sources** tab.

1. Verify that the Amazon Managed Service for Prometheus data source exists.

1. Check the connection settings:
   + Confirm that the endpoint URL is correct.
   + Verify that IAM authentication is properly configured.
   + Choose **Test connection**. Verify that the status is **Data source is working**.

### Verify the Amazon EKS add-on status


1. Open the Amazon EKS console at [https://console.amazonaws.cn/eks/home\$1/clusters](https://console.amazonaws.cn/eks/home#/clusters).

1. Select your cluster.

1. Choose the **Add-ons** tab.

1. Verify that the SageMaker HyperPod observability add-on is listed and that its status is **ACTIVE**.

1. If the status isn't **ACTIVE**, see [Troubleshooting add-on installation failures](#troubleshooting-addon-installation-failures).

### Verify Pod Identity association


1. Open the Amazon EKS console at [https://console.amazonaws.cn/eks/home\$1/clusters](https://console.amazonaws.cn/eks/home#/clusters).

1. Select your cluster.

1. On the cluster details page, choose the **Access** tab.

1. In the **Pod Identity associations** table, choose the association that has the following property values:
   + **Namespace**: `hyperpod-observability`
   + **Service account**: `hyperpod-observability-operator-otel-collector`
   + **Add-on**: `amazon-sagemaker-hyperpod-observability`

1. Ensure that the IAM role that is attached to this association has the following permissions.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "PrometheusAccess",
               "Effect": "Allow",
               "Action": "aps:RemoteWrite",
               "Resource": "arn:aws-cn:aps:us-east-1:111122223333:workspace/workspace-ID"
           },
           {
               "Sid": "CloudwatchLogsAccess",
               "Effect": "Allow",
               "Action": [
                   "logs:CreateLogGroup",
                   "logs:CreateLogStream",
                   "logs:DescribeLogGroups",
                   "logs:DescribeLogStreams",
                   "logs:PutLogEvents",
                   "logs:GetLogEvents",
                   "logs:FilterLogEvents",
                   "logs:GetLogRecord",
                   "logs:StartQuery",
                   "logs:StopQuery",
                   "logs:GetQueryResults"
               ],
               "Resource": [
                   "arn:aws-cn:logs:us-east-1:111122223333:log-group:/aws/sagemaker/Clusters/*",
                   "arn:aws-cn:logs:us-east-1:111122223333:log-group:/aws/sagemaker/Clusters/*:log-stream:*"
               ]
           }
       ]
   }
   ```

------

1. Ensure that the IAM role that is attached to this association has the following trust policy. Verify that the source ARN and source account are correct.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
               "Effect": "Allow",
               "Principal": {
                   "Service": "pods.eks.amazonaws.com"
               },
               "Action": [
                   "sts:AssumeRole",
                   "sts:TagSession"
               ],
               "Condition": {
                   "StringEquals": {
                       "aws:SourceArn": "arn:aws-cn:eks:us-east-1:111122223333:cluster/cluster-name",
                       "aws:SourceAccount": "111122223333"
                   }
               }
           }
       ]
   }
   ```

------

### Check Amazon Managed Service for Prometheus throttling


1. Sign in to the Amazon Web Services Management Console and open the Service Quotas console at [https://console.amazonaws.cn/servicequotas/](https://console.amazonaws.cn/servicequotas/).

1. In the **Managed quotas** box, search for and select Amazon Managed Service for Prometheus.

1. Choose the **Active series per workspace** quota.

1. In the **Resource-level quotas** tab, select your Amazon Managed Service for Prometheus workspace.

1. Ensure that the utilization is less than your current quota.

1. If you've reached the quota limit, select your workspace by choosing the radio button to its left, and then choose **Request increase at resource level** .

### Verify KV caching and intelligent routing are enabled


If the `KVCache Metrics` dashboard is missing, feature is either not enabled or the port isn't mentioned in the `modelMetrics`. For more information on how to enable this, see steps 1 and 3 in [Configure KV caching and intelligent routing for improved performance](sagemaker-hyperpod-model-deployment-deploy-ftm.md#sagemaker-hyperpod-model-deployment-deploy-ftm-cache-route). 

If the `Intelligent Router Metrics` dashboard is missing, enable the feature to have them show up. For more information on how to enable this, see [Configure KV caching and intelligent routing for improved performance](sagemaker-hyperpod-model-deployment-deploy-ftm.md#sagemaker-hyperpod-model-deployment-deploy-ftm-cache-route). 

## Troubleshooting add-on installation failures
Add-on installation failures

If the observability add-on fails to install, use the following steps to diagnose and resolve the issue.

### Check health probe status


1. Open the Amazon EKS console at [https://console.amazonaws.cn/eks/home\$1/clusters](https://console.amazonaws.cn/eks/home#/clusters).

1. Select your cluster.

1. Choose the **Add-ons** tab.

1. Choose the failed add-on.

1. Review the **Health issues** section.

1. If the health issue is related to credentials or pod identity, see [Verify Pod Identity association](#verify-pod-identity-association). Also ensure that the pod identity agent add-on is running in the cluster.

1. Check for errors in the manager logs. For instructions, see [Review manager logs](#review-manager-logs).

1. Contact Amazon Support with the issue details.

### Review manager logs


1. Get the add-on manager pod:

   ```
   kubectl logs -n hyperpod-observability -l control-plane=hyperpod-observability-controller-manager
   ```

1. For urgent issues, contact Amazon Web Services Support.

## Review all observability pods


All the pods that the SageMaker HyperPod observability add-on creates are in the `hyperpod-observability` namespace. To get the status of these pods, run the following command.

```
kubectl get pods -n hyperpod-observability
```

Look for the pods whose status is either `pending` or `crashloopbackoff`. Run the following command to get the logs of these pending or failing pods.

```
kubectl logs -n hyperpod-observability pod-name
```

If you don't find errors in the logs, run the following command to describe the pods and look for errors.

```
kubectl describe -n hyperpod-observability pod pod-name
```

To get more context, run the two following commands to describe the deployments and daemonsets for these pods.

```
kubectl describe -n hyperpod-observability deployment deployment-name
```

```
kubectl describe -n hyperpod-observability daemonset daemonset-name
```

## Troubleshooting pods that are stuck in the pending status
Pods stuck in pending

If you see that there are pods that are stuck in the `pending` status, make sure that the node is large enough to fit in all the pods. To verify that it is, perform the following steps.

1. Open the Amazon EKS console at [https://console.amazonaws.cn/eks/home\$1/clusters](https://console.amazonaws.cn/eks/home#/clusters).

1. Choose your cluster.

1. Choose the cluster's **Compute** tab.

1. Choose the node with the smallest instance type.

1. In the capacity allocation section, look for available pods.

1. If there are no available pods, then you need a larger instance type.

For urgent issues, contact Amazon Web Services Support.

## Troubleshooting observability on Restricted Instance Groups


Use the following guidance to resolve issues specific to clusters with Restricted Instance Groups.

### Observability pods not starting on restricted nodes


If observability pods are not starting on restricted nodes, check the pod status and events:

```
kubectl get pods -n hyperpod-observability -o wide
kubectl describe pod pod-name -n hyperpod-observability
```

Common causes include:
+ **Image pull failures:** The pod events may show image pull errors if the observability container images are not yet allowlisted on the restricted nodes. Ensure that you are running the latest version of the observability add-on. If the issue persists after upgrading, contact Amazon Web Services Support.
+ **Taint tolerations:** Verify that the pod spec includes the required toleration for restricted nodes. The add-on starting from version `v1.0.5-eksbuild.1` automatically adds this toleration when RIG support is enabled. If you are using older version, please upgrade to the latest version.

### Viewing logs for pods on restricted nodes


The `kubectl logs` command does not work for pods running on restricted nodes. This is an expected limitation because the communication path required for log streaming is not available on restricted nodes.

To view logs from restricted nodes, use the **Cluster Logs** dashboard in Amazon Managed Grafana, which queries CloudWatch Logs directly. You can filter by instance ID, log stream, log level, and free-text search to find relevant log entries.

### DNS resolution failures in clusters with both standard and restricted nodes


In hybrid clusters (clusters with both standard and restricted instance groups), pods on standard nodes may experience DNS resolution timeouts when trying to reach Amazon service endpoints such as Amazon Managed Service for Prometheus or CloudWatch.

**Cause:** The `kube-dns` service has endpoints from both standard CoreDNS pods and RIG CoreDNS pods. Standard node pods cannot reach RIG CoreDNS endpoints due to network isolation. When `kube-proxy` load-balances a DNS request from a standard node pod to a RIG CoreDNS endpoint, the request times out.

**Resolution:** Set `internalTrafficPolicy: Local` on the `kube-dns` service so that pods only reach CoreDNS on their local node:

```
kubectl patch svc kube-dns -n kube-system -p '{"spec":{"internalTrafficPolicy":"Local"}}'
```

After applying this patch, restart the affected observability pods:

```
kubectl delete pods -n hyperpod-observability -l app.kubernetes.io/name=hyperpod-node-collector
```

### Metrics from restricted nodes not reaching Amazon Managed Service for Prometheus


If metrics from restricted nodes are not appearing in your Amazon Managed Service for Prometheus workspace:

1. **Verify the execution role permissions.** Ensure that the execution role for the Restricted Instance Group has `aps:RemoteWrite` permission for your Prometheus workspace. For more information, see [Additional prerequisites for Restricted Instance Groups](hyperpod-observability-addon-setup.md#hyperpod-observability-addon-rig-prerequisites).

1. **Check the node collector pod status.** Run the following command and verify that node collector pods are running on restricted nodes:

   ```
   kubectl get pods -n hyperpod-observability | grep node-collector
   ```

1. **Check the central collector deployments.** In clusters with restricted nodes, the add-on deploys one central collector per network boundary. Verify that a central collector exists for each boundary:

   ```
   kubectl get deployments -n hyperpod-observability | grep central-collector
   ```

1. **Check pod events for errors.** Use `kubectl describe` on the collector pods to look for error events:

   ```
   kubectl describe pod collector-pod-name -n hyperpod-observability
   ```

If the issue persists after verifying the above, contact Amazon Web Services Support.

### Pod Identity verification does not apply to restricted instance group nodes


The [Verify Pod Identity association](#verify-pod-identity-association) troubleshooting steps apply only to standard nodes. On restricted nodes, the add-on uses the cluster instance group execution role for Amazon authentication instead of Amazon EKS Pod Identity. If metrics are missing from restricted nodes, verify the execution role permissions instead of the Pod Identity association.

### Fluent Bit not running on restricted nodes


This is expected behavior. Fluent Bit is intentionally not deployed on restricted nodes. Logs from restricted nodes are published to CloudWatch through the SageMaker HyperPod platform independently of the observability add-on. Use the **Cluster Logs** dashboard in Amazon Managed Grafana to view these logs.

# Observability with Amazon CloudWatch


Use [Amazon CloudWatch Container Insights](https://docs.amazonaws.cn/AmazonCloudWatch/latest/monitoring/ContainerInsights.html) to collect, aggregate, and summarize metrics and logs from the containerized applications and micro-services on the EKS cluster associated with a HyperPod cluster.

Amazon CloudWatch Insights collects metrics for compute resources, such as CPU, memory, disk, and network. Container Insights also provides diagnostic information, such as container restart failures, to help you isolate issues and resolve them quickly. You can also set CloudWatch alarms on metrics that Container Insights collects.

To find a complete list of metrics, see [Amazon EKS and Kubernetes Container Insights metrics](https://docs.amazonaws.cn/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-EKS.html) in the *Amazon EKS User Guide*.

## Install CloudWatch Container Insights


Cluster admin users must set up CloudWatch Container Insights following the instructions at [Install the CloudWatch agent by using the Amazon CloudWatch Observability EKS add-on or the Helm chart](https://docs.amazonaws.cn/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Observability-EKS-addon.html) in the *CloudWatch User Guide*. For more information about Amazon EKS add-on, see also [Install the Amazon CloudWatch Observability EKS add-on](https://docs.amazonaws.cn/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-EKS-addon.html) in the *Amazon EKS User Guide*.

After the installation has completed, verify that the CloudWatch Observability add-on is visible in the EKS cluster add-on tab. It might take about a couple of minutes until the dashboard loads.

**Note**  
SageMaker HyperPod requires the CloudWatch Insight v2.0.1-eksbuild.1 or later.

![\[CloudWatch Observability service card showing status, version, and IAM role information.\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/images/hyperpod-eks-CIaddon.png)


# Access CloudWatch container insights dashboard


1. Open the CloudWatch console at [https://console.amazonaws.cn/cloudwatch/](https://console.amazonaws.cn/cloudwatch/).

1. Choose **Insights**, and then choose **Container Insights**.

1. Select the EKS cluster set up with the HyperPod cluster you're using.

1. View the Pod/Cluster level metrics.

![\[Performance monitoring dashboard for EKS cluster showing node status, resource utilization, and pod metrics.\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/images/hyperpod-eks-CIdashboard.png)


## Access CloudWatch container insights logs


1. Open the CloudWatch console at [https://console.amazonaws.cn/cloudwatch/](https://console.amazonaws.cn/cloudwatch/).

1. Choose **Logs**, and then choose **Log groups**.

When you have the HyperPod clusters integrated with Amazon CloudWatch Container Insights, you can access the relevant log groups in the following format: `/aws/containerinsights /<eks-cluster-name>/*`. Within this log group, you can find and explore various types of logs such as Performance logs, Host logs, Application logs, and Data plane logs.