本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。

# 使用训练操作符来运行作业
<a name="sagemaker-eks-operator-usage"></a>

 要使用 kubectl 运行作业，您必须创建一个 job.yaml 来指定作业规格，并运行 `kubectl apply -f job.yaml` 以提交作业。在此 YAML 文件中，您可以在 `logMonitoringConfiguration` 参数中指定自定义配置来定义自动监控规则，从而分析来自分布式训练作业的日志输出以检测问题并进行恢复。

```
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  labels:
    app.kubernetes.io/name: HyperPod
    app.kubernetes.io/managed-by: kustomize
  name: &jobname xxx
  annotations:
    XXX: XXX
    ......
spec:
  nprocPerNode: "X"
  replicaSpecs:
    - name: 'XXX'
      replicas: 16
      template:
        spec:
          nodeSelector:
            beta.kubernetes.io/instance-type: ml.p5.48xlarge
          containers:
            - name: XXX
              image: XXX
              imagePullPolicy: Always
              ports:
                - containerPort: 8080 # This is the port that HyperPodElasticAgent listens to
              resources:
                limits:
                  nvidia.com/gpu: 8
                  hugepages-2Mi: 5120Mi
                requests:
                  nvidia.com/gpu: 8
                  hugepages-2Mi: 5120Mi
                  memory: 32000Mi
          ......        
  runPolicy:
    jobMaxRetryCount: 50
    restartPolicy:
      numRestartBeforeFullJobRestart: 3 
      evalPeriodSeconds: 21600 
      maxFullJobRestarts: 1
    cleanPodPolicy: "All"
    logMonitoringConfiguration: 
      - name: "JobStart"
        logPattern: ".*Experiment configuration.*" # This is the start of the training script
        expectedStartCutOffInSeconds: 120 # Expected match in the first 2 minutes
      - name: "JobHangingDetection"
        logPattern: ".*\\[Epoch 0 Batch \\d+.*'training_loss_step': (\\d+(\\.\\d+)?).*"
        expectedRecurringFrequencyInSeconds: 300 # If next batch is not printed within 5 minute, consider it hangs. Or if loss is not decimal (e.g. nan) for 2 minutes, mark it hang as well.
        expectedStartCutOffInSeconds: 600 # Allow 10 minutes of job startup time
      - name: "NoS3CheckpointingDetection"
        logPattern: ".*The checkpoint is finalized. All shards is written.*"
        expectedRecurringFrequencyInSeconds: 600 # If next checkpoint s3 upload doesn't happen within 10 mins, mark it hang.
        expectedStartCutOffInSeconds: 1800 # Allow 30 minutes for first checkpoint upload
      - name: "LowThroughputDetection"
        logPattern: ".*\\[Epoch 0 Batch \\d+.*'samples\\/sec': (\\d+(\\.\\d+)?).*"
        metricThreshold: 80 # 80 samples/sec
        operator: "lteq"
        metricEvaluationDataPoints: 25 # if throughput lower than threshold for 25 datapoints, kill the job
```

如果要使用日志监控选项，请确保将训练日志发送到`sys.stdout`。 HyperPod 弹性代理在 sys.stdout 中监控训练日志，该日志保存在中。`/tmp/hyperpod/`您可使用以下命令发出训练日志。

```
logging.basicConfig(format="%(asctime)s [%(levelname)s] %(name)s: %(message)s", level=logging.INFO, stream=sys.stdout)
```

 下表描述了所有可能的日志监控配置：


| 参数 | 用法 | 
| --- | --- | 
| jobMaxRetry计数 | 进程级别的最大重启次数。 | 
| 重启政策： numRestartBeforeFullJobRestart | 操作符在作业级别重新启动之前，进程级别的最大重启次数。 | 
| 重启政策： evalPeriodSeconds | 评估重启限制的时段（以秒为单位）。 | 
| 重启策略：重启 maxFullJob | 作业在失败前，整个作业重新启动的最大次数。 | 
| cleanPodPolicy | 指定操作符应清理的容器组（pod）。接受的值为 All、OnlyComplete 和 None。 | 
| logMonitoringConfiguration | 用于检测缓慢和挂起作业的日志监控规则。 | 
| expectedRecurringFrequencyInSeconds | 连续两次 LogPattern 匹配之间的时间间隔，在此时间间隔之后，规则的计算结果为 HANGING。如果未指定，则连续 LogPattern 匹配之间不存在时间限制。 | 
| expectedStartCutOffInSeconds | 是时候进行首次 LogPattern 匹配了，之后规则的计算结果为 “悬挂”。如果未指定，则第一 LogPattern 场比赛不存在时间限制。 | 
| logPattern | 用于识别规则在激活时将应用于的日志行的正则表达式。 | 
| metricEvaluationData积分 | 在将作业标记为 SLOW 前，规则的评估结果必须为 SLOW 的连续次数。如果未指定，默认值为 1。 | 
| metricThreshold | 通过捕获组提取 LogPattern 的值的阈值。如果未指定，则不执行指标评估。 | 
| operator | 应用于监控配置的不等式。接受的值为：gt、gteq、lt、lteq 和 eq。 | 
| stopPattern | 用于识别需停用规则的日志行的正则表达式。如果未指定，则规则将始终处于活动状态。 | 
| faultOnMatch | 表示匹配是否 LogPattern 应立即触发任务错误。如果为 true，则无论其他规则参数如何，只要匹配， LogPattern 该作业就会被标记为有故障。如果为 false 或未指定，则规则将根据其他参数评估为 “慢” 或 “挂起”。 | 

 要提高训练韧性，请指定备用节点配置详细信息。如果作业失败，操作符会与 Kueue 协同工作以使用事先预留的节点来继续运行作业。备用节点配置需要 Kueue，因此，如果您尝试提交带备用节点但未安装 Kueue 的作业，则该作业将失败。以下示例是一个示例 `job.yaml` 文件，其中包含备用节点配置。

```
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue # Specify the queue to run the job.
  name: hyperpodpytorchjob-sample
spec:
  nprocPerNode: "1"
  runPolicy:
    cleanPodPolicy: "None"
  replicaSpecs: 
    - name: pods
      replicas: 1
      spares: 1 # Specify how many spare nodes to reserve.
      template:
        spec:
          containers:
            - name: XXX
              image: XXX
              
              imagePullPolicy: Always
              ports:
                - containerPort: 8080
              resources:
                requests:
                  nvidia.com/gpu: "0"
                limits:
                  nvidia.com/gpu: "0"
```

## 监控
<a name="sagemaker-eks-operator-usage-monitoring"></a>

亚马逊与亚马逊[托管 Grafana 和适用 SageMaker HyperPod 于 Prometheus 的亚马逊托管服务集成了可观察性](https://docs.amazonaws.cn/sagemaker/latest/dg/sagemaker-hyperpod-observability-addon.html)，因此您可以设置监控以收集指标并将其提供给这些可观察性工具。

您也可以在不使用托管可观测性功能的情况下，通过 Amazon Managed Service for Prometheus 抓取指标。为此，在使用 `kubectl` 运行作业时，需将要监控的指标包含在 `job.yaml` 文件中。

```
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hyperpod-training-operator
  namespace: aws-hyperpod
spec:
  ......
  endpoints:
    - port: 8081
      path: /metrics
      interval: 15s
```

以下是训练操作符发出的事件，可将这些事件输入到 Amazon Managed Service for Prometheus 中以监控训练作业。


| 事件 | 说明 | 
| --- | --- | 
| hyperpod\$1training\$1operator\$1jobs\$1created\$1total | 训练操作符已运行的作业总数 | 
| hyperpod\$1training\$1operator\$1jobs\$1restart\$1latency | 当前作业重启延迟 | 
| hyperpod\$1training\$1operator\$1jobs\$1fault\$1detection\$1latency | 故障检测延迟 | 
| hyperpod\$1training\$1operator\$1jobs\$1deleted\$1total | 已删除的作业总数 | 
| hyperpod\$1training\$1operator\$1jobs\$1successful\$1total | 已完成的作业总数 | 
| hyperpod\$1training\$1operator\$1jobs\$1failed\$1total | 失败的作业总数 | 
| hyperpod\$1training\$1operator\$1jobs\$1restarted\$1total | 已自动重启的作业总数 | 

## 示例 docker 配置
<a name="sagemaker-eks-operator-usage-docker"></a>

以下是可使用 `hyperpod run` 命令运行的示例 docker 文件。

```
export AGENT_CMD="--backend=nccl"
exec hyperpodrun --server-host=${AGENT_HOST} --server-port=${AGENT_PORT} \
    --tee=3 --log_dir=/tmp/hyperpod \
    --nnodes=${NNODES} --nproc-per-node=${NPROC_PER_NODE} \
    --pre-train-script=/workspace/echo.sh --pre-train-args='Pre-training script' \
    --post-train-script=/workspace/echo.sh --post-train-args='Post-training script' \
    /workspace/mnist.py --epochs=1000 ${AGENT_CMD}
```

## 示例日志监控配置
<a name="sagemaker-eks-operator-usage-log-monitoring"></a>

**作业挂起检测**

要检测挂起的作业，请使用以下配置。它使用以下参数：
+ expectedStartCutOffInSeconds — 显示器应等待多长时间才能收到第一批日志
+ expectedRecurringFrequencyInSeconds — 等待下一批日志的时间间隔

利用这些设置，日志监控系统应在训练作业启动后的 60 秒内，检测到一个与正则表达式模式 `.*Train Epoch.*` 匹配的日志行。在首次出现匹配日志行后，监控系统预期每 10 秒能检测到一次匹配日志行。如果第一条日志未在 60 秒内出现，或者后续日志不是每 10 秒出现一次，则 HyperPod 弹性代理会将容器视为卡住，并与训练操作员协调以重新启动作业。

```
runPolicy:
    jobMaxRetryCount: 10
    cleanPodPolicy: "None"
    logMonitoringConfiguration:
      - name: "JobStartGracePeriod"
        # Sample log line: [default0]:2025-06-17 05:51:29,300 [INFO] __main__: Train Epoch: 5 [0/60000 (0%)]       loss=0.8470
        logPattern: ".*Train Epoch.*"  
        expectedStartCutOffInSeconds: 60 
      - name: "JobHangingDetection"
        logPattern: ".*Train Epoch.*"
        expectedRecurringFrequencyInSeconds: 10 # if the next batch is not printed within 10 seconds
```

**训练损失激增**

以下监控配置会发出符合模式 `xxx training_loss_step xx` 的训练日志。它使用参数 `metricEvaluationDataPoints`，可让您在操作符重启作业之前指定数据点的阈值。如果训练损失值大于 2.0，则操作符将重新启动作业。

```
runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "LossSpikeDetection"
      logPattern: ".*training_loss_step (\\d+(?:\\.\\d+)?).*"   # training_loss_step 5.0
      metricThreshold: 2.0
      operator: "gt"
      metricEvaluationDataPoints: 5 # if loss higher than threshold for 5 data points, restart the job
```

**低 TFLOPs 检测**

以下监控配置每五秒发出一次符合模式 `xx TFLOPs xx` 的训练日志。如果 5 个数据点的值小 TFLOPs 于 100，则操作员重新启动训练作业。

```
runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "TFLOPs"
      logPattern: ".* (.+)TFLOPs.*"    # Training model, speed: X TFLOPs...
      expectedRecurringFrequencyInSeconds: 5        
      metricThreshold: 100       # if Tflops is less than 100 for 5 data points, restart the job       
      operator: "lt"
      metricEvaluationDataPoints: 5
```

**训练脚本错误日志检测**

以下监视配置会检测训练日志中`logPattern`是否存在中指定的模式。一旦训练操作员遇到错误模式，训练操作员就会将其视为错误并重新开始作业。

```
runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "GPU Error"
      logPattern: ".*RuntimeError.*out of memory.*"
      faultOnMatch: true
```