# Using the HyperPod training operator
<a name="sagemaker-eks-operator"></a>

 The Amazon SageMaker HyperPod training operator helps you accelerate generative AI model development by efficiently managing distributed training across large GPU clusters. It introduces intelligent fault recovery, hang job detection, and process-level management capabilities that minimize training disruptions and reduce costs. Unlike traditional training infrastructure that requires complete job restarts when failures occur, this operator implements surgical process recovery to keep your training jobs running smoothly. 

 The operator also works with HyperPod's health monitoring and observability functions, providing real-time visibility into training execution and automatic monitoring of critical metrics like loss spikes and throughput degradation. You can define recovery policies through simple YAML configurations without code changes, allowing you to quickly respond to and recover from unrecoverable training states. These monitoring and recovery capabilities work together to maintain optimal training performance while minimizing operational overhead.

 While Kueue is not required for this training operator, your cluster administrator can install and configure it for enhanced job scheduling capabilities. For more information, see the [official documentation for Kueue](https://kueue.sigs.k8s.io/docs/overview/).

**Note**  
To use the training operator, you must use the latest [ HyperPod AMI release](https://docs.amazonaws.cn/sagemaker/latest/dg/sagemaker-hyperpod-release-ami-eks.html). To upgrade, use the [ UpdateClusterSoftware](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API operation. If you use [ HyperPod task governance](https://docs.amazonaws.cn/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance.html), it must also be the latest version.

## Supported versions
<a name="sagemaker-eks-operator-supported-versions"></a>

 The HyperPod training operator works only work with specific versions of Kubernetes, Kueue, and HyperPod. See the list below for the complete list of compatible versions. 
+ Supported Kubernetes versions – 1.28, 1.29, 1.30, 1.31, 1.32, and 1.33
+ Suggested Kueue versions – [ v.0.12.2](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.12.2) and [v.0.12.3](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.12.3)
+ The latest HyperPod AMI release. To upgrade to the latest AMI release, use the [ UpdateClusterSoftware](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API.
+ [PyTorch 2.4.0 – 2.7.1](https://github.com/pytorch/pytorch/releases)

**Note**  
We collect certain routine aggregated and anonymized operational metrics to provide essential service availability. The creation of these metrics is fully automated and does not involve human review of the underlying model training workload. These metrics relate to a job operations, resource management, and essential service functionality.

# Installing the training operator
<a name="sagemaker-eks-operator-install"></a>

See the following sections to learn about how to install the training operator.

## Prerequisites
<a name="sagemaker-eks-operator-prerequisites"></a>

 Before you use the HyperPod training operator, you must have completed the following prerequisites: 
+  [ Created a HyperPod cluster with Amazon EKS orchestration](https://docs.amazonaws.cn/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-create-cluster.html). 
+ Installed the latest AMI on your HyperPod cluster. For more information, see [SageMaker HyperPod AMI releases for Amazon EKS](sagemaker-hyperpod-release-ami-eks.md).
+ [Installed cert-manager](https://cert-manager.io/docs/installation/).
+  [ Set up the EKS Pod Identity Agent using the console](https://docs.amazonaws.cn/eks/latest/userguide/pod-id-agent-setup.html). If you want to use the Amazon CLI, use the following command: 

  ```
  aws eks create-addon \ 
   --cluster-name my-eks-cluster \
   --addon-name eks-pod-identity-agent \
   --region Amazon Web Services Region
  ```
+ (Optional) If you run your HyperPod cluster nodes in a private VPC, you must set up PrivateLinks VPC endpoints for the Amazon SageMaker AI API (`com.amazonaws.aws-region.sagemaker.api`) and Amazon EKS Auth services (com.amazonaws.*aws-region*.eks-auth). You must also make sure that your cluster nodes are running with subnets that are in a security group that allows the traffic to route through the VPC endpoints to communicate with SageMaker AI and Amazon EKS. If these aren't properly set up, the add-on installation can fail. To learn more about setting up VPC endpoints, see [Create a VPC endpoint](https://docs.amazonaws.cn/vpc/latest/privatelink/create-interface-endpoint.html#create-interface-endpoint-aws).

## Installing the training operator
<a name="sagemaker-eks-operator-install-operator"></a>

 You can now install the HyperPod training operator through the SageMaker AI console, the Amazon EKS console, or with the Amazon CLI The console methods offer simplified experiences that help you install the operator. The Amazon CLI offers a programmatic approach that lets you customize more of your installation.

Between the two console experiences, SageMaker AI provides a one-click installation creates the IAM execution role, creates the pod identity association, and installs the operator. The Amazon EKS console installation is similar, but this method doesn't automatically create the IAM execution role. During this process, you can choose to create a new IAM execution role with information that the console pre-populates. By default, these created roles only have access to the current cluster that you're installing the operator in. Unless you edit the role's permissions to include other clusters, if you remove and reinstall the operator, you must create a new role. 

------
#### [ SageMaker AI console (recommended) ]

1. Open the Amazon SageMaker AI console at [https://console.amazonaws.cn/sagemaker/](https://console.amazonaws.cn/sagemaker/).

1. Go to your cluster's details page.

1. On the **Dashboard** tab, locate the add-on named **Amazon SageMaker HyperPod training operator**, and choose **install**. During the installation process, SageMaker AI creates an IAM execution role with permissions similar to the [ AmazonSageMakerHyperPodTrainingOperatorAccess](https://docs.amazonaws.cn/aws-managed-policy/latest/reference/AmazonSageMakerHyperPodTrainingOperatorAccess.html) managed policy and creates a pod identity association between your Amazon EKS cluster and your new execution role.

------
#### [ Amazon EKS console ]

**Note**  
If you install the add-on through the Amazon EKS cluster, first make sure that you've tagged your HyperPod cluster with the key-value pair `SageMaker:true`. Otherwise, the installation will fail.

1. Open the Amazon EKS console at [https://console.amazonaws.cn/eks/home\$1/clusters](https://console.amazonaws.cn/eks/home#/clusters).

1. Go to your EKS cluster, choose **Add-ons**, then choose ** Get more Add-ons**.

1. Choose Amazon SageMaker HyperPod training operator, then choose **Next**.

1. Under **Version**, the console defaults to the latest version, which we recommend that you use.

1. Under **Add-on access**, choose a pod identity IAM role to use with the training operator add-on. If you don't already have a role, choose **Create recommended role** to create one.

1. During this role creation process, the IAM console pre-populates all of the necessary information, such as the use case, the [ AmazonSageMakerHyperPodTrainingOperatorAccess](https://docs.amazonaws.cn/aws-managed-policy/latest/reference/AmazonSageMakerHyperPodTrainingOperatorAccess.html) managed policy and other required permissions, the role name, and the description. As you go through the steps, review the information, and choose **Create role**.

1. In the EKS console, review your add-on's settings, and then choose **Create**.

------
#### [ CLI ]

1. Make sure that the IAM execution role for your HyperPod cluster has a trust relationship that allows EKS Pod Identity to assume the role or or [create a new IAM role](https://docs.amazonaws.cn/IAM/latest/UserGuide/id_roles_create.html) with the following trust policy. Alternatively, you could use the Amazon EKS console to install the add-on, which creates a recommended role.

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
         "Effect": "Allow",
         "Principal": {
           "Service": "pods.eks.amazonaws.com"
         },
         "Action": [
           "sts:AssumeRole",
           "sts:TagSession",
           "eks-auth:AssumeRoleForPodIdentity"
         ]
       }
     ]
   }
   ```

------

1.  Attach the [ AmazonSageMakerHyperPodTrainingOperatorAccess managed policy](https://docs.amazonaws.cn/aws-managed-policy/latest/reference/AmazonSageMakerHyperPodTrainingOperatorAccess.html) to your created role. 

1.  [ Then create a pod identity association between your EKS cluster, your IAM role, and your new IAM role](https://docs.amazonaws.cn/eks/latest/userguide/pod-identities.html).

   ```
   aws eks create-pod-identity-association \
   --cluster-name my-eks-cluster \
   --role-arn ARN of your execution role \
   --namespace aws-hyperpod \
   --service-account hp-training-operator-controller-manager \
   --region Amazon Web Services Region
   ```

1.  After you finish the process, you can use the ListPodIdentityAssociations operation to see the association you created. The following is a sample response of what it might look like. 

   ```
   aws eks list-pod-identity-associations --cluster-name my-eks-cluster
   {
       "associations": [{
           "clusterName": "my-eks-cluster",
           "namespace": "aws-hyperpod",
           "serviceAccount": "hp-training-operator-controller-manager",
           "associationArn": "arn:aws:eks:us-east-2:123456789012:podidentityassociation/my-hyperpod-cluster/a-1a2b3c4d5e6f7g8h9",
           "associationId": "a-1a2b3c4d5e6f7g8h9"
       }]
   }
   ```

1. To install the training operator, use the `create-addon` operation. The `--addon-version` parameter is optional. If you don’t provide one, the default is the latest version. To get the possible versions, use the [ DescribeAddonVersions](https://docs.amazonaws.cn/eks/latest/APIReference/API_DescribeAddonVersions.html) operation.

   ```
   aws eks create-addon \
     --cluster-name my-eks-cluster \
     --addon-name amazon-sagemaker-hyperpod-training-operator \
     --resolve-conflicts OVERWRITE
   ```

------

If you already have the training operator installed on your HyperPod cluster, you can update the EKS add-on to the version that you want. If you want to use [ checkpointless training](https://docs.amazonaws.cn/sagemaker/latest/dg/sagemaker-eks-checkpointless.html) or [ elastic training](https://docs.amazonaws.cn/sagemaker/latest/dg/sagemaker-eks-elastic-training.html), consider the following:
+ Both checkpointless training and elastic training require the EKS add-on to be on version 1.2.0 or above.
+ The Amazon SageMaker HyperPod training operator maintains backwards compatibility for any EKS add-on version, so you can upgrade from any add-on version to 1.2.0 or above.
+ If you downgrade from versions 1.2.0 or above to a lower version, you must first delete the existing jobs before the downgrade and resubmit the jobs after the downgrade is complete.

------
#### [ Amazon EKS Console ]

1. Open the Amazon EKS console at [https://console.amazonaws.cn/eks/home\$1/clusters](https://console.amazonaws.cn/eks/home#/clusters).

1. Go to your EKS cluster, and choose **Add-ons**. Then, choose the Amazon SageMaker HyperPod training operator add-on and choose **Edit**.

1. In the **Version** menu, choose the version of the add-on that you want, then choose **Save changes**.

------
#### [ CLI ]

1. First get the list of the supported versions of the add-on for your cluster.

   ```
   aws eks describe-addon-versions \
     --kubernetes-version $(aws eks describe-cluster --name my-eks-cluster --query 'cluster.version' --output text) \
     --addon-name amazon-sagemaker-hyperpod-training-operator \
     --query 'addons[0].addonVersions[].addonVersion' \
     --output table
   ```

1. Then update the add-on to the version that you want.

   ```
   aws eks update-addon \
     --cluster-name my-eks-cluster \
     --addon-name amazon-sagemaker-hyperpod-training-operator \
     --addon-version target-version
     --resolve-conflicts OVERWRITE
   ```

------

 The training operator comes with a number of options with default values that might fit your use case. We recommend that you try out the training operator with default values before changing them. The table below describes all parameters and examples of when you might want to configure each parameter.


| Parameter | Description | Default | 
| --- | --- | --- | 
| hpTrainingControllerManager.manager.resources.requests.cpu | How many processors to allocate for the controller | 1 | 
| hpTrainingControllerManager.manager.resources.requests.memory | How much memory to allocate to the controller | 2Gi | 
| hpTrainingControllerManager.manager.resources.limits.cpu | The CPU limit for the controller | 2 | 
| hpTrainingControllerManager.manager.resources.limits.memory | The memory limit for the controller | 4Gi | 
| hpTrainingControllerManager.nodeSelector | Node selector for the controller pods | Default behavior is to select nodes with the label sagemaker.amazonaws.com/compute-type: "HyperPod" | 

## HyperPod elastic agent
<a name="sagemaker-eks-operator-elastic-agent"></a>

The HyperPod elastic agent is an extension of [PyTorch’s ElasticAgent](https://docs.pytorch.org/docs/stable/elastic/agent.html). It orchestrates lifecycles of training workers on each container and communicates with the HyperPod training operator. To use the HyperPod training operator, you must first install the HyperPod elastic agent into your training image before you can submit and run jobs using the operator. The following is a docker file that installs elastic agent and uses `hyperpodrun` to create the job launcher.

**Note**  
Both [ checkpointless training](https://docs.amazonaws.cn/sagemaker/latest/dg/sagemaker-eks-checkpointless.html) and [ elastic training](https://docs.amazonaws.cn/sagemaker/latest/dg/sagemaker-eks-elastic-training.html) require that you use HyperPod elastic agent version 1.1.0 or above.

```
RUN pip install hyperpod-elastic-agent

ENTRYPOINT ["entrypoint.sh"]
# entrypoint.sh
...
hyperpodrun --nnodes=node_count --nproc-per-node=proc_count \
            --rdzv-backend hyperpod \ # Optional
            --inprocess-restart \ # Optional (in-process fault recovery with checkpointless training)
            ... # Other torchrun args
            # pre-traing arg_group
            --pre-train-script pre.sh --pre-train-args "pre_1 pre_2 pre_3" \
            # post-train arg_group
            --post-train-script post.sh --post-train-args "post_1 post_2 post_3" \
            training.py --script-args
```

You can now submit jobs with `kubectl`.

### HyperPod elastic agent arguments
<a name="sagemaker-eks-operator-elastic-agent-args"></a>

 The HyperPod elastic agent supports all of the original arguments and adds some additional arguments. The following is all of the arguments available in the HyperPod elastic agent. For more information about PyTorch's Elastic Agent, see their [official documentation](https://docs.pytorch.org/docs/stable/elastic/agent.html). 


| Argument | Description | Default Value | 
| --- | --- | --- | 
| --shutdown-signal | Signal to send to workers for shutdown (SIGTERM or SIGKILL) | "SIGKILL" | 
| --shutdown-timeout | Timeout in seconds between shutdown-signal and SIGKILL signals | 15 | 
| --server-host | Agent server address | "0.0.0.0" | 
| --server-port | Agent server port | 8080 | 
| --server-log-level | Agent server log level | "info" | 
| --server-shutdown-timeout | Server shutdown timeout in seconds | 300 | 
| --pre-train-script | Path to pre-training script | None | 
| --pre-train-args | Arguments for pre-training script | None | 
| --post-train-script | Path to post-training script | None | 
| --post-train-args | Arguments for post-training script | None | 
| --inprocess-restart | Flag specifying whether to use the inprocess\$1restart feature | FALSE | 
| --inprocess-timeout | Time in seconds that the agent waits for workers to reach a synchronization barrier before triggering a process-level restart. | None | 

## Task governance (optional)
<a name="sagemaker-eks-operator-task-governance"></a>

The training operator is integrated with [ HyperPod task governance](https://docs.amazonaws.cn/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance), a robust management system designed to streamline resource allocation and ensure efficient utilization of compute resources across teams and projects for your Amazon EKS clusters. To set up HyperPod task governance, see [Setup for SageMaker HyperPod task governance](sagemaker-hyperpod-eks-operate-console-ui-governance-setup.md). 

**Note**  
When installing the HyperPod task governance add-on, you must use version v1.3.0-eksbuild.1 or higher.

When submitting a job, make sure you include your queue name and priority class labels of `hyperpod-ns-team-name-localqueue` and `priority-class-name-priority`. For example, if you're using Kueue, your labels become the following:
+ kueue.x-k8s.io/queue-name: hyperpod-ns-*team-name*-localqueue
+ kueue.x-k8s.io/priority-class: *priority-class*-name-priority

The following is an example of what your configuration file might look like:

```
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPytorchJob
metadata:
  name: hp-task-governance-sample
  namespace: hyperpod-ns-team-name
  labels:
    kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
    kueue.x-k8s.io/priority-class: priority-class-priority
spec:
  nprocPerNode: "1"
  runPolicy:
    cleanPodPolicy: "None"
  replicaSpecs: 
    - name: pods
      replicas: 4
      spares: 2
      template:
        spec:
          containers:
            - name: ptjob
              image: XXXX
              imagePullPolicy: Always
              ports:
                - containerPort: 8080
              resources:
                requests:
                  cpu: "2"
```

Then use the following kubectl command to apply the YAML file.

```
kubectl apply -f task-governance-job.yaml
```

## Kueue (optional)
<a name="sagemaker-eks-operator-kueue"></a>

While you can run jobs directly, your organization can also integrate the training operator with Kueue to allocate resources and schedule jobs. Follow the steps below to install Kueue into your HyperPod cluster.

1. Follow the installation guide in the [ official Kueue documentation](https://kueue.sigs.k8s.io/docs/installation/#install-a-custom-configured-released-version). When you reach the step of configuring `controller_manager_config.yaml`, add the following configuration:

   ```
   externalFrameworks:
   - "HyperPodPytorchJob.v1.sagemaker.amazonaws.com"
   ```

1. Follow the rest of the steps in the official installation guide. After you finish installing Kueue, you can create some sample queues with the `kubectl apply -f sample-queues.yaml` command. Use the following YAML file.

   ```
   apiVersion: kueue.x-k8s.io/v1beta1
   kind: ClusterQueue
   metadata:
     name: cluster-queue
   spec:
     namespaceSelector: {}
     preemption:
       withinClusterQueue: LowerPriority
     resourceGroups:
     - coveredResources:
       - cpu
       - nvidia.com/gpu
       - pods
       flavors:
       - name: default-flavor
         resources:
         - name: cpu
           nominalQuota: 16
         - name: nvidia.com/gpu
           nominalQuota: 16
         - name: pods
           nominalQuota: 16
   ---
   apiVersion: kueue.x-k8s.io/v1beta1
   kind: LocalQueue
   metadata:
     name: user-queue
     namespace: default
   spec:
     clusterQueue: cluster-queue
   ---
   apiVersion: kueue.x-k8s.io/v1beta1
   kind: ResourceFlavor
   metadata:
     name: default-flavor
   ---
   apiVersion: kueue.x-k8s.io/v1beta1
   description: High priority
   kind: WorkloadPriorityClass
   metadata:
     name: high-priority-class
   value: 1000
   ---
   apiVersion: kueue.x-k8s.io/v1beta1
   description: Low Priority
   kind: WorkloadPriorityClass
   metadata:
     name: low-priority-class
   value: 500
   ```

# Using the training operator to run jobs
<a name="sagemaker-eks-operator-usage"></a>

 To use kubectl to run the job, you must create a job.yaml to specify the job specifications and run `kubectl apply -f job.yaml` to submit the job. In this YAML file, you can specify custom configurations in the `logMonitoringConfiguration` argument to define automated monitoring rules that analyze log outputs from the distributed training job to detect problems and recover. 

```
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  labels:
    app.kubernetes.io/name: HyperPod
    app.kubernetes.io/managed-by: kustomize
  name: &jobname xxx
  annotations:
    XXX: XXX
    ......
spec:
  nprocPerNode: "X"
  replicaSpecs:
    - name: 'XXX'
      replicas: 16
      template:
        spec:
          nodeSelector:
            beta.kubernetes.io/instance-type: ml.p5.48xlarge
          containers:
            - name: XXX
              image: XXX
              imagePullPolicy: Always
              ports:
                - containerPort: 8080 # This is the port that HyperPodElasticAgent listens to
              resources:
                limits:
                  nvidia.com/gpu: 8
                  hugepages-2Mi: 5120Mi
                requests:
                  nvidia.com/gpu: 8
                  hugepages-2Mi: 5120Mi
                  memory: 32000Mi
          ......        
  runPolicy:
    jobMaxRetryCount: 50
    restartPolicy:
      numRestartBeforeFullJobRestart: 3 
      evalPeriodSeconds: 21600 
      maxFullJobRestarts: 1
    cleanPodPolicy: "All"
    logMonitoringConfiguration: 
      - name: "JobStart"
        logPattern: ".*Experiment configuration.*" # This is the start of the training script
        expectedStartCutOffInSeconds: 120 # Expected match in the first 2 minutes
      - name: "JobHangingDetection"
        logPattern: ".*\\[Epoch 0 Batch \\d+.*'training_loss_step': (\\d+(\\.\\d+)?).*"
        expectedRecurringFrequencyInSeconds: 300 # If next batch is not printed within 5 minute, consider it hangs. Or if loss is not decimal (e.g. nan) for 2 minutes, mark it hang as well.
        expectedStartCutOffInSeconds: 600 # Allow 10 minutes of job startup time
      - name: "NoS3CheckpointingDetection"
        logPattern: ".*The checkpoint is finalized. All shards is written.*"
        expectedRecurringFrequencyInSeconds: 600 # If next checkpoint s3 upload doesn't happen within 10 mins, mark it hang.
        expectedStartCutOffInSeconds: 1800 # Allow 30 minutes for first checkpoint upload
      - name: "LowThroughputDetection"
        logPattern: ".*\\[Epoch 0 Batch \\d+.*'samples\\/sec': (\\d+(\\.\\d+)?).*"
        metricThreshold: 80 # 80 samples/sec
        operator: "lteq"
        metricEvaluationDataPoints: 25 # if throughput lower than threshold for 25 datapoints, kill the job
```

If you want to use the log monitoring options, make sure that you’re emitting the training log to `sys.stdout`. HyperPod elastic agent monitors training logs in sys.stdout, which is saved at `/tmp/hyperpod/`. You can use the following command to emit training logs.

```
logging.basicConfig(format="%(asctime)s [%(levelname)s] %(name)s: %(message)s", level=logging.INFO, stream=sys.stdout)
```

 The following table describes all of the possible log monitoring configurations: 


| Parameter | Usage | 
| --- | --- | 
| jobMaxRetryCount | Maximum number of restarts at the process level. | 
| restartPolicy: numRestartBeforeFullJobRestart | Maximum number of restarts at the process level before the operator restarts at the job level. | 
| restartPolicy: evalPeriodSeconds | The period of evaluating the restart limit in seconds | 
| restartPolicy: maxFullJobRestarts | Maximum number of full job restarts before the job fails. | 
| cleanPodPolicy | Specifies the pods that the operator should clean. Accepted values are All, OnlyComplete, and None. | 
| logMonitoringConfiguration | The log monitoring rules for slow and hanging job detection | 
| expectedRecurringFrequencyInSeconds | Time interval between two consecutive LogPattern matches after which the rule evaluates to HANGING. If not specified, no time constraint exists between consecutive LogPattern matches. | 
| expectedStartCutOffInSeconds | Time to first LogPattern match after which the rule evaluates to HANGING. If not specified, no time constraint exists for the first LogPattern match. | 
| logPattern | Regular expression that identifies log lines that the rule applies to when the rule is active | 
| metricEvaluationDataPoints | Number of consecutive times a rule must evaluate to SLOW before marking a job as SLOW. If not specified, the default is 1. | 
| metricThreshold | Threshold for value extracted by LogPattern with a capturing group. If not specified, metric evaluation is not performed. | 
| operator | The inequality to apply to the monitoring configuration. Accepted values are gt, gteq, lt, lteq, and eq. | 
| stopPattern | Regular expresion to identify the log line at which to deactivate the rule. If not specified, the rule will always be active. | 
| faultOnMatch | Indicates whether a match of LogPattern should immediately trigger a job fault. When true, the job will be marked as faulted as soon as the LogPattern is matched, regardless of other rule parameters. When false or not specified, the rule will evaluate to SLOW or HANGING based on other parameters. | 

 For more training resiliency, specify spare node configuration details. If your job fails, the operator works with Kueue to use nodes reserved in advance to continue running the job. Spare node configurations require Kueue, so if you try to submit a job with spare nodes but don’t have Kueue installed, the job will fail. The following example is a sample `job.yaml` file that contains spare node configurations.

```
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue # Specify the queue to run the job.
  name: hyperpodpytorchjob-sample
spec:
  nprocPerNode: "1"
  runPolicy:
    cleanPodPolicy: "None"
  replicaSpecs: 
    - name: pods
      replicas: 1
      spares: 1 # Specify how many spare nodes to reserve.
      template:
        spec:
          containers:
            - name: XXX
              image: XXX
              
              imagePullPolicy: Always
              ports:
                - containerPort: 8080
              resources:
                requests:
                  nvidia.com/gpu: "0"
                limits:
                  nvidia.com/gpu: "0"
```

## Monitoring
<a name="sagemaker-eks-operator-usage-monitoring"></a>

The Amazon SageMaker HyperPod is integrated with [ observability with Amazon Managed Grafana and Amazon Managed Service for Prometheus](https://docs.amazonaws.cn/sagemaker/latest/dg/sagemaker-hyperpod-observability-addon.html), so you can set up monitoring to collect and feed metrics into these observability tools.

Alternatively, you can scrape metrics through Amazon Managed Service for Prometheus without managed observability. To do so, include the metrics that you want to monitor into your `job.yaml` file when you run jobs with `kubectl`.

```
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hyperpod-training-operator
  namespace: aws-hyperpod
spec:
  ......
  endpoints:
    - port: 8081
      path: /metrics
      interval: 15s
```

The following are events that the training operator emits that you can feed into Amazon Managed Service for Prometheus to monitor your training jobs.


| Event | Description | 
| --- | --- | 
| hyperpod\$1training\$1operator\$1jobs\$1created\$1total | Total number of jobs that the training operator has run | 
| hyperpod\$1training\$1operator\$1jobs\$1restart\$1latency | Current job restart latency | 
| hyperpod\$1training\$1operator\$1jobs\$1fault\$1detection\$1latency | Fault detection latency | 
| hyperpod\$1training\$1operator\$1jobs\$1deleted\$1total | Total number of deleted jobs | 
| hyperpod\$1training\$1operator\$1jobs\$1successful\$1total | Total number of completed jobs | 
| hyperpod\$1training\$1operator\$1jobs\$1failed\$1total | Total number of failed jobs | 
| hyperpod\$1training\$1operator\$1jobs\$1restarted\$1total | Total number of auto-restarted jobs | 

## Sample docker configuration
<a name="sagemaker-eks-operator-usage-docker"></a>

The following is a sample docker file that you can run with the `hyperpod run` command.

```
export AGENT_CMD="--backend=nccl"
exec hyperpodrun --server-host=${AGENT_HOST} --server-port=${AGENT_PORT} \
    --tee=3 --log_dir=/tmp/hyperpod \
    --nnodes=${NNODES} --nproc-per-node=${NPROC_PER_NODE} \
    --pre-train-script=/workspace/echo.sh --pre-train-args='Pre-training script' \
    --post-train-script=/workspace/echo.sh --post-train-args='Post-training script' \
    /workspace/mnist.py --epochs=1000 ${AGENT_CMD}
```

## Sample log monitoring configurations
<a name="sagemaker-eks-operator-usage-log-monitoring"></a>

**Job hang detection**

To detect hang jobs, use the following configurations. It uses the following parameters:
+ expectedStartCutOffInSeconds – how long the monitor should wait before expecting the first logs
+ expectedRecurringFrequencyInSeconds – the time interval to wait for the next batch of logs

With these settings, the log monitor expects to see a log line matching the regex pattern `.*Train Epoch.*` within 60 seconds after the training job starts. After the first appearance, the monitor expects to see matching log lines every 10 seconds. If the first logs don't appear within 60 seconds or subsequent logs don't appear every 10 seconds, the HyperPod elastic agent treats the container as stuck and coordinates with the training operator to restart the job.

```
runPolicy:
    jobMaxRetryCount: 10
    cleanPodPolicy: "None"
    logMonitoringConfiguration:
      - name: "JobStartGracePeriod"
        # Sample log line: [default0]:2025-06-17 05:51:29,300 [INFO] __main__: Train Epoch: 5 [0/60000 (0%)]       loss=0.8470
        logPattern: ".*Train Epoch.*"  
        expectedStartCutOffInSeconds: 60 
      - name: "JobHangingDetection"
        logPattern: ".*Train Epoch.*"
        expectedRecurringFrequencyInSeconds: 10 # if the next batch is not printed within 10 seconds
```

**Training loss spike**

The following monitoring configuration emits training logs with the pattern `xxx training_loss_step xx`. It uses the parameter `metricEvaluationDataPoints`, which lets you specify a threshold of data points before the operator restarts the job. If the training loss value is more than 2.0, the operator restarts the job.

```
runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "LossSpikeDetection"
      logPattern: ".*training_loss_step (\\d+(?:\\.\\d+)?).*"   # training_loss_step 5.0
      metricThreshold: 2.0
      operator: "gt"
      metricEvaluationDataPoints: 5 # if loss higher than threshold for 5 data points, restart the job
```

**Low TFLOPs detection**

The following monitoring configuration emits training logs with the pattern `xx TFLOPs xx` every five seconds. If TFLOPs is less than 100 for 5 data points, the operator restarts the training job.

```
runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "TFLOPs"
      logPattern: ".* (.+)TFLOPs.*"    # Training model, speed: X TFLOPs...
      expectedRecurringFrequencyInSeconds: 5        
      metricThreshold: 100       # if Tflops is less than 100 for 5 data points, restart the job       
      operator: "lt"
      metricEvaluationDataPoints: 5
```

**Training script error log detection**

The following monitoring configuration detects if the pattern specified in `logPattern` is present in the training logs. As soon as the training operator encounters the error pattern, the training operator treats it as a fault and restarts the job.

```
runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "GPU Error"
      logPattern: ".*RuntimeError.*out of memory.*"
      faultOnMatch: true
```

# Troubleshooting
<a name="sagemaker-eks-operator-troubleshooting"></a>

See the following sections to learn how to troubleshoot error when using the training operator.

## I can't install the training operator
<a name="sagemaker-eks-operator-troubleshooting-installation-error"></a>

If you can't install the training operator, make sure that you're using the [ supported versions of components](https://docs.amazonaws.cn/sagemaker/latest/dg/sagemaker-eks-operator.html#sagemaker-eks-operator-supported-versions). For example, if you get an error that your HyperPod AMI release is incompatible with the training operator, [ update to the latest version](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html).

## Incompatible HyperPod task governance version
<a name="sagemaker-eks-operator-troubleshooting-task-governance-version"></a>

During installation, you might get an error message that the version of HyperPod task governance is incompatible. The training operator works only with version v1.3.0-eksbuild.1 or higher. Update your HyperPod task governance add-on and try again. 

## Missing permissions
<a name="sagemaker-eks-operator-troubleshooting-task-missing-permissions"></a>

 While you're setting up the training operator or running jobs, you might receive errors that you're not authorized to run certain operations, such as `DescribeClusterNode`. To resolve these errors, make sure you correctly set up IAM permissions while you're [setting up the Amazon EKS Pod Identity Agent](https://docs.amazonaws.cn/sagemaker/latest/dg/sagemaker-eks-operator-install.html#sagemaker-eks-operator-install-pod-identity).