Troubleshooting the Amazon SageMaker HyperPod observability add-on - Amazon SageMaker AI
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Troubleshooting the Amazon SageMaker HyperPod observability add-on

Use the following guidance to resolve common issues with the Amazon SageMaker HyperPod (SageMaker HyperPod) observability add-on.

Troubleshooting missing metrics in Amazon Managed Grafana

If metrics don't appear in your Amazon Managed Grafana dashboards, perform the following steps to identify and resolve the issue.

Verify the Amazon Managed Service for Prometheus-Amazon Managed Grafana connection

  1. Sign in to the Amazon Managed Grafana console.

  2. In the left pane, choose All workspaces.

  3. In the Workspaces table, choose your workspace.

  4. In the details page of the workspace, choose the Data sources tab.

  5. Verify that the Amazon Managed Service for Prometheus data source exists.

  6. Check the connection settings:

    • Confirm that the endpoint URL is correct.

    • Verify that IAM authentication is properly configured.

    • Choose Test connection. Verify that the status is Data source is working.

Verify the Amazon EKS add-on status

  1. Open the Amazon EKS console at https://console.amazonaws.cn/eks/home#/clusters.

  2. Select your cluster.

  3. Choose the Add-ons tab.

  4. Verify that the SageMaker HyperPod observability add-on is listed and that its status is ACTIVE.

  5. If the status isn't ACTIVE, copy the error message and contact Amazon Web Services Support.

Verify Pod Identity association

  1. Open the Amazon EKS console at https://console.amazonaws.cn/eks/home#/clusters.

  2. Select your cluster.

  3. On the cluster details page, choose the Access tab.

  4. In the Pod Identity associations table, choose the association that has the following property values:

    • Namespace: hyperpod-observability

    • Service account: hyperpod-observability-operator-otel-collector

    • Add-on: amazon-sagemaker-hyperpod-observability

  5. Ensure that the IAM role that is attached to this association has the following permissions.

    { "Version": "2012-10-17", "Statement": [ { "Sid": "PrometheusAccess", "Effect": "Allow", "Action": "aps:RemoteWrite", "Resource": "arn:aws:aps:Amazon Web Services Region:account-ID:workspace/workspace-ID" }, { "Sid": "CloudwatchLogsAccess", "Effect": "Allow", "Action": [ "logs:CreateLogGroup", "logs:CreateLogStream", "logs:DescribeLogGroups", "logs:DescribeLogStreams", "logs:PutLogEvents", "logs:GetLogEvents", "logs:FilterLogEvents", "logs:GetLogRecord", "logs:StartQuery", "logs:StopQuery", "logs:GetQueryResults" ], "Resource": [ "arn:aws:logs:Amazon Web Services Region:account-ID:log-group:/aws/sagemaker/Clusters/*", "arn:aws:logs:Amazon Web Services Region:account-ID:log-group:/aws/sagemaker/Clusters/*:log-stream:*" ] } ] }

Check Amazon Managed Service for Prometheus throttling

  1. Sign in to the Amazon Web Services Management Console and open the Service Quotas console at https://console.amazonaws.cn/servicequotas/.

  2. In the Managed quotas box, search for and select Amazon Managed Service for Prometheus.

  3. Choose the Active series per workspace quota.

  4. In the Resource-level quotas tab, select your Amazon Managed Service for Prometheus workspace.

  5. Ensure that the utilization is less than your current quota.

  6. If you've reached the quota limit, select your workspace by choosing the radio button to its left, and then choose Request increase at resource level .

Troubleshooting add-on installation failures

If the observability add-on fails to install, use the following steps to diagnose and resolve the issue.

Check health probe status

  1. Open the Amazon EKS console at https://console.amazonaws.cn/eks/home#/clusters.

  2. Select your cluster.

  3. Choose the Add-ons tab.

  4. Choose the failed add-on.

  5. Review the Health issues section.

  6. Contact Amazon Support with the issue details.

Review manager logs

  1. Get the add-on manager pod:

    kubectl get pods -n hyperpod-observability | grep manager
  2. Check the logs:

    kubectl logs -n kube-system addon-manager-pod-name

For urgent issues, contact Amazon Web Services Support.