SageMaker HyperPod 运行状况监测座席 - Amazon SageMaker AI
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅 中国的 Amazon Web Services 服务入门 (PDF)

SageMaker HyperPod 运行状况监测座席

SageMaker HyperPod 运行状况监控座席可持续监控每个基于 GPU 或 Trainium 的实例的运行状况。当检测到任何实例或 GPU 故障时,座席会将实例标记为运行状况不佳。

由 SageMaker HyperPod 运行状况监控座席进行运行状况检查

SageMaker HyperPod 运行状况监控座席会检查以下内容。

NVIDIA GPU

Amazon Trainium

SageMaker HyperPod 运行状况监控座席生成的日志

SageMaker HyperPod 运行状况监控座席是开箱即用的运行状况检查功能,可在所有 HyperPod 集群上持续运行。运行状况监控座席会将 GPU 或 Trn 实例上检测到的运行状况事件发布到 CloudWatch 的集群日志组 /aws/sagemaker/Clusters/

来自 HyperPod 运行状况监控座席的检测日志将作为每个节点的单独日志流创建,命名为 SagemakerHealthMonitoringAgent。您可以使用 CloudWatch 日志见解查询检测日志,具体如下。

fields @timestamp, @message | filter @message like /HealthMonitoringAgentDetectionEvent/

返回的输出结果应与下面类似。

2024-08-21T11:35:35.532-07:00 {"level":"info","ts":"2024-08-21T18:35:35Z","msg":"NPD caught event: %v","details: ":{"severity":"warn","timestamp":"2024-08-22T20:59:29Z","reason":"XidHardwareFailure","message":"Node condition NvidiaErrorReboot is now: True, reason: XidHardwareFailure, message: \"NVRM: Xid (PCI:0000:b9:00): 71, pid=<unknown>, name=<unknown>, NVLink: fatal error detected on link 6(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\""},"HealthMonitoringAgentDetectionEvent":"HealthEvent"} 2024-08-21T11:35:35.532-07:00 {"level":"info","ts":"2024-08-21T18:35:35Z","msg":"NPD caught event: %v","details: ":{"severity":"warn","timestamp":"2024-08-22T20:59:29Z","reason":"XidHardwareFailure","message":"Node condition NvidiaErrorReboot is now: True, reason: XidHardwareFailure, message: \"NVRM: Xid (PCI:0000:b9:00): 71, pid=<unknown>, name=<unknown>, NVLink: fatal error detected on link 6(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\""},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}