

# Orchestrating SageMaker HyperPod clusters with Slurm
<a name="sagemaker-hyperpod-slurm"></a>

Slurm support in SageMaker HyperPod helps you provision resilient clusters for running machine learning (ML) workloads and developing state-of-the-art models such as large language models (LLMs), diffusion models, and foundation models (FMs). It accelerates development of FMs by removing undifferentiated heavy-lifting involved in building and maintaining large-scale compute clusters powered by thousands of accelerators such as Amazon Trainium and NVIDIA A100 and H100 Graphical Processing Units (GPUs). When accelerators fail, the resiliency features of SageMaker HyperPod monitors the cluster instances automatically detect and replace the faulty hardware on the fly so that you can focus on running ML workloads. Additionally, with lifecycle configuration support in SageMaker HyperPod, you can customize your computing environment to best suit your needs and configure it with the Amazon SageMaker AI distributed training libraries to achieve optimal performance on Amazon.

**Operating clusters**

You can create, conﬁgure, and maintain SageMaker HyperPod clusters graphically through the console user interface (UI) and programmatically through the Amazon command line interface (CLI) or Amazon SDK for Python (Boto3). With Amazon VPC, you can secure the cluster network and also take advantage of configuring your cluster with resources in your VPC, such as Amazon FSx for Lustre, which offers the fastest throughput. You can also give different IAM roles to cluster instance groups, and limit actions that your cluster resources and users can operate. To learn more, see [SageMaker HyperPod Slurm cluster operations](sagemaker-hyperpod-operate-slurm.md).

**Configuring your ML environment**

SageMaker HyperPod runs [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami), which sets up an ML environment on the HyperPod clusters. You can configure additional customizations to the DLAMI by providing lifecycle scripts to support your use case. To learn more about how to set up lifecycle scripts, see [Getting started with SageMaker HyperPod](smcluster-getting-started-slurm.md) and [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md).

**Scheduling jobs**

After you successfully create a HyperPod cluster, cluster users can log into the cluster nodes (such as head or controller node, log-in node, and worker node) and schedule jobs for running machine learning workloads. To learn more, see [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md).

**Resiliency against hardware failures**

SageMaker HyperPod runs health checks on cluster nodes and provides a workload auto-resume functionality. With the cluster resiliency features of HyperPod, you can resume your workload from the last checkpoint you saved, after faulty nodes are replaced with healthy ones in clusters with more than 16 nodes. To learn more, see [SageMaker HyperPod cluster resiliency](sagemaker-hyperpod-resiliency-slurm.md).

**Logging and managing clusters**

You can find SageMaker HyperPod resource utilization metrics and lifecycle logs in Amazon CloudWatch, and manage SageMaker HyperPod resources by tagging them. Each `CreateCluster` API run creates a distinct log stream, named in `<cluster-name>-<timestamp>` format. In the log stream, you can check the host names, the name of failed lifecycle scripts, and outputs from the failed scripts such as `stdout` and `stderr`. For more information, see [SageMaker HyperPod cluster management](sagemaker-hyperpod-cluster-management-slurm.md).

**Compatible with SageMaker AI tools**

Using SageMaker HyperPod, you can configure clusters with Amazon optimized collective communications libraries offered by SageMaker AI, such as the [SageMaker AI distributed data parallelism (SMDDP) library](data-parallel.md). The SMDDP library implements the `AllGather` operation optimized to the Amazon compute and network infrastructure for the most performant SageMaker AI machine learning instances powered by NVIDIA A100 GPUs. To learn more, see [Running distributed training workloads with Slurm on HyperPod](sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload.md).

**Instance placement with UltraServers**

SageMaker AI automatically allocates jobs to instances within your UltraServer based on a best effort strategy of using all of the instances in one UltraServer before using another one. For example, if you request 14 instances and have 2 UltraServers in your training plan, SageMaker AI uses all of the instances in the first UltraServer. If you requested 20 instances and have 2 UltraServers in your training plan, SageMaker AI will will use all 17 instances in the first UltraServer and then use 3 from the second UltraServer.

**Topics**
+ [

# Getting started with SageMaker HyperPod
](smcluster-getting-started-slurm.md)
+ [

# SageMaker HyperPod Slurm cluster operations
](sagemaker-hyperpod-operate-slurm.md)
+ [

# Customizing SageMaker HyperPod clusters using lifecycle scripts
](sagemaker-hyperpod-lifecycle-best-practices-slurm.md)
+ [

# SageMaker HyperPod multi-head node support
](sagemaker-hyperpod-multihead-slurm.md)
+ [

# Jobs on SageMaker HyperPod clusters
](sagemaker-hyperpod-run-jobs-slurm.md)
+ [

# SageMaker HyperPod cluster resources monitoring
](sagemaker-hyperpod-cluster-observability-slurm.md)
+ [

# SageMaker HyperPod cluster resiliency
](sagemaker-hyperpod-resiliency-slurm.md)
+ [

# Continuous provisioning for enhanced cluster operations with Slurm
](sagemaker-hyperpod-scaling-slurm.md)
+ [

# SageMaker HyperPod cluster management
](sagemaker-hyperpod-cluster-management-slurm.md)
+ [

# SageMaker HyperPod FAQs
](sagemaker-hyperpod-faq-slurm.md)

# Getting started with SageMaker HyperPod
<a name="smcluster-getting-started-slurm"></a>

Get started with creating your first SageMaker HyperPod cluster and learn the cluster operation functionalities of SageMaker HyperPod. You can create a SageMaker HyperPod cluster through the SageMaker AI console UI or the Amazon CLI commands. This tutorial shows how to create a new SageMaker HyperPod cluster with Slurm, which is a popular workload scheduler software. After you go through this tutorial, you will know how to log into the cluster nodes using the Amazon Systems Manager commands (`aws ssm`). After you complete this tutorial, see also [SageMaker HyperPod Slurm cluster operations](sagemaker-hyperpod-operate-slurm.md) to learn more about the SageMaker HyperPod basic oparations, and [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md) to learn how to schedule jobs on the provisioned cluster.

**Tip**  
To find practical examples and solutions, see also the [SageMaker HyperPod workshop](https://catalog.workshops.aws/sagemaker-hyperpod).

**Topics**
+ [

# Getting started with SageMaker HyperPod using the SageMaker AI console
](smcluster-getting-started-slurm-console.md)
+ [

# Creating SageMaker HyperPod clusters using Amazon CloudFormation templates
](smcluster-getting-started-slurm-console-create-cluster-cfn.md)
+ [

# Getting started with SageMaker HyperPod using the Amazon CLI
](smcluster-getting-started-slurm-cli.md)

# Getting started with SageMaker HyperPod using the SageMaker AI console
<a name="smcluster-getting-started-slurm-console"></a>

The following tutorial demonstrates how to create a new SageMaker HyperPod cluster and set it up with Slurm through the SageMaker AI console UI. Following the tutorial, you'll create a HyperPod cluster with three Slurm nodes, `my-controller-group`, `my-login-group`, and `worker-group-1`.

**Topics**
+ [

## Create cluster
](#smcluster-getting-started-slurm-console-create-cluster-page)
+ [

## Deploy resources
](#smcluster-getting-started-slurm-console-create-cluster-deploy)
+ [

## Delete the cluster and clean resources
](#smcluster-getting-started-slurm-console-delete-cluster-and-clean)

## Create cluster
<a name="smcluster-getting-started-slurm-console-create-cluster-page"></a>

To navigate to the **SageMaker HyperPod Clusters** page and choose **Slurm** orchestration, follow these steps.

1. Open the Amazon SageMaker AI console at [https://console.amazonaws.cn/sagemaker/](https://console.amazonaws.cn/sagemaker/).

1. Choose **HyperPod Clusters** in the left navigation pane and then **Cluster Management**.

1. On the **SageMaker HyperPod Clusters** page, choose **Create HyperPod cluster**. 

1. On the **Create HyperPod cluster** drop-down, choose **Orchestrated by Slurm**.

1. On the Slurm cluster creation page, you will see two options. Choose the option that best fits your needs.

   1. **Quick setup** - To get started immediately with default settings, choose **Quick setup**. With this option, SageMaker AI will create new resources such as VPC, subnets, security groups, Amazon S3 bucket, IAM role, and FSx for Lustre in the process of creating your cluster.

   1. **Custom setup** - To integrate with existing Amazon resources or have specific networking, security, or storage requirements, choose **Custom setup**. With this option, you can choose to use the existing resources or create new ones, and you can customize the configuration that best fits your needs.

## Quick setup
<a name="smcluster-getting-started-slurm-console-create-cluster-default"></a>

On the **Quick setup** section, follow these steps to create your HyperPod cluster with Slurm orchestration.

### General settings
<a name="smcluster-getting-started-slurm-console-create-cluster-default-general"></a>

Specify a name for the new cluster. You can’t change the name after the cluster is created.

### Instance groups
<a name="smcluster-getting-started-slurm-console-create-cluster-default-instance-groups"></a>

To add an instance group, choose **Add group**. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. To deploy a cluster, you must add at least one instance group for Controller and Compute group types.

**Important**  
You can add one instance group at a time. To create multiple instance groups, repeat the process for each instance group.

Follow these steps to add an instance group.

1. For **Instance group type**, choose a type for your instance group. For this tutorial, choose **Controller (head)** for `my-controller-group`, **Login** for `my-login-group`, and **Compute (worker)** for `worker-group-1`.

1. For **Name**, specify a name for the instance group. For this tutorial, create three instance groups named `my-controller-group`, `my-login-group`, and `worker-group-1`.

1.  For **Instance capacity**, choose either on-demand capacity or a training plan to reserve your compute resources.

1. For **Instance type**, choose the instance for the instance group. For this tutorial, select `ml.c5.xlarge` for `my-controller-group`, `ml.m5.4xlarge` for `my-login-group`, and `ml.trn1.32xlarge` for `worker-group-1`. 
**Important**  
Ensure that you choose an instance type with sufficient quotas and enough unassigned IP addresses for your account. To view or request additional quotas, see [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

1. For **Instance quantity**, specify an integer not exceeding the instance quota for cluster usage. For this tutorial, enter **1** for all three groups.

1. For **Target Availability Zone**, choose the Availability Zone where your instances will be provisioned. The Availability Zone should correspond to the location of your accelerated compute capacity.

1. For **Additional storage volume per instance (GB) - optional**, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is `/opt/sagemaker`. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running the `df -h` command. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the [Amazon EBS volumes](https://docs.aws.amazon.com/ebs/latest/userguide/ebs-volumes.html) section in the *Amazon Elastic Block Store User Guide*.

1. Choose **Add instance group**.

### Quick setup defaults
<a name="smcluster-getting-started-slurm-console-create-cluster-default-settings"></a>

This section lists all the default settings for your cluster creation, including all the new Amazon resources that will be created during the cluster creation process. Review the default settings.

## Custom setup
<a name="smcluster-getting-started-slurm-console-create-cluster-custom"></a>

On the **Custom setup** section, follow these steps to create your HyperPod cluster with Slurm orchestration.

### General settings
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-general"></a>

Specify a name for the new cluster. You can’t change the name after the cluster is created.

For **Instance recovery**, choose **Automatic - *recommended*** or **None**.

### Networking
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-network"></a>

Configure your network settings for the cluster creation. These settings can't be changed after the cluster is created.

1. For **VPC**, choose your own VPC if you already have one that gives SageMaker AI access to your VPC. To create a new VPC, follow the instructions at [Create a VPC](https://docs.amazonaws.cn/vpc/latest/userguide/create-vpc.html) in the *Amazon Virtual Private Cloud User Guide*. You can leave it as **None** to use the default SageMaker AI VPC.

1. For **VPC IPv4 CIDR block**, enter the starting IP of your VPC.

1. For **Availability Zones**, choose the Availability Zones (AZ) where HyperPod will create subnets for your cluster. Choose AZs that match the location of your accelerated compute capacity.

1. For **Security groups**, create a security group or choose up to five security groups configured with rules to allow inter-resource communication within the VPC.

### Instance groups
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-instance-groups"></a>

To add an instance group, choose **Add group**. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. To deploy a cluster, you must add at least one instance group.

**Important**  
You can add one instance group at a time. To create multiple instance groups, repeat the process for each instance group.

Follow these steps to add an instance group.

1. For **Instance group type**, choose a type for your instance group. For this tutorial, choose **Controller (head)** for `my-controller-group`, **Login** for `my-login-group`, and **Compute (worker)** for `worker-group-1`.

1. For **Name**, specify a name for the instance group. For this tutorial, create three instance groups named `my-controller-group`, `my-login-group`, and `worker-group-1`.

1.  For **Instance capacity**, choose either on-demand capacity or a training plan to reserve your compute resources.

1. For **Instance type**, choose the instance for the instance group. For this tutorial, select `ml.c5.xlarge` for `my-controller-group`, `ml.m5.4xlarge` for `my-login-group`, and `ml.trn1.32xlarge` for `worker-group-1`. 
**Important**  
Ensure that you choose an instance type with sufficient quotas and enough unassigned IP addresses for your account. To view or request additional quotas, see [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

1. For **Instance quantity**, specify an integer not exceeding the instance quota for cluster usage. For this tutorial, enter **1** for all three groups.

1. For **Target Availability Zone**, choose the Availability Zone where your instances will be provisioned. The Availability Zone should correspond to the location of your accelerated compute capacity.

1. For **Additional storage volume per instance (GB) - optional**, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is `/opt/sagemaker`. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running the `df -h` command. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the [Amazon EBS volumes](https://docs.aws.amazon.com/ebs/latest/userguide/ebs-volumes.html) section in the *Amazon Elastic Block Store User Guide*.

1. Choose **Add instance group**.

### Lifecycle scripts
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-lifecycle"></a>

You can choose to use the default lifecycle scripts or the custom lifecycle scripts, which will be stored in your Amazon S3 bucket. You can view the default lifecycle scripts in the [Awesome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/7.sagemaker-hyperpod-eks/LifecycleScripts). To learn more about the lifecycle scripts, see [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md).

1. For **Lifecycle scripts**, choose to use default or custom lifecycle scripts.

1. For **S3 bucket for lifecycle scripts**, choose to create a new bucket or use an existing bucket to store the lifecycle scripts.

### Permissions
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-permissions"></a>

Choose or create an IAM role that allows HyperPod to run and access necessary Amazon resources on your behalf.

### Storage
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-storage"></a>

Configure the FSx for Lustre file system to be provisioned on the HyperPod cluster.

1. For **File system**, choose an existing FSx for Lustre file system, to create a new FSx for Lustre file system, or don't provision an FSx for Lustre file system.

1. For **Throughput per unit of storage**, choose the throughput that will be available per TiB of provisioned storage.

1. For **Storage capacity**, enter a capacity value in TB.

1. For **Data compression type**, choose **LZ4** to enable data compression.

1. For **Lustre version**, view the value that's recommended for the new file systems.

### Tags - optional
<a name="smcluster-getting-started-slurm-console-create-cluster-tags"></a>

For **Tags - *optional***, add key and value pairs to the new cluster and manage the cluster as an Amazon resource. To learn more, see [Tagging your Amazon resources](https://docs.amazonaws.cn/tag-editor/latest/userguide/tagging.html).

## Deploy resources
<a name="smcluster-getting-started-slurm-console-create-cluster-deploy"></a>

After you complete the cluster configurations using either **Quick setup** or **Custom setup**, choose the following option to start resource provisioning and cluster creation.
+  **Submit** - SageMaker AI will start provisioning the default configuration resources and creating the cluster. 
+ **Download CloudFormation template parameters** - You will download the configuration parameter JSON file and run Amazon CLI command to deploy the CloudFormation stack to provision the configuration resources and creating the cluster. You can edit the downloaded parameter JSON file if needed. If you choose this option, see more instructions in [Creating SageMaker HyperPod clusters using Amazon CloudFormation templates](smcluster-getting-started-slurm-console-create-cluster-cfn.md).

## Delete the cluster and clean resources
<a name="smcluster-getting-started-slurm-console-delete-cluster-and-clean"></a>

After you have successfully tested creating a SageMaker HyperPod cluster, it continues running in the `InService` state until you delete the cluster. We recommend that you delete any clusters created using on-demand SageMaker AI instances when not in use to avoid incurring continued service charges based on on-demand pricing. In this tutorial, you have created a cluster that consists of two instance groups. One of them uses a C5 instance, so make sure you delete the cluster by following the instructions at [Delete a SageMaker HyperPod cluster](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-delete-cluster).

However, if you have created a cluster with reserved compute capacity, the status of the clusters does not affect service billing.

To clean up the lifecycle scripts from the S3 bucket used for this tutorial, go to the S3 bucket you used during cluster creation and remove the files entirely.

If you have tested running any workloads on the cluster, make sure if you have uploaded any data or if your job saved any artifacts to different S3 buckets or file system services such as Amazon FSx for Lustre and Amazon Elastic File System. To prevent any incurring charges, delete all artifacts and data from the storage or file system.

# Creating SageMaker HyperPod clusters using Amazon CloudFormation templates
<a name="smcluster-getting-started-slurm-console-create-cluster-cfn"></a>

You can create SageMaker HyperPod clusters using the CloudFormation templates for HyperPod. You must install Amazon CLI to proceed.

**Topics**
+ [

## Configure resources in the console and deploy using CloudFormation
](#smcluster-getting-started-slurm-console-create-cluster-deploy-console)
+ [

## Configure resources and deploy using CloudFormation
](#smcluster-getting-started-slurm-console-create-cluster-deploy-cfn)

## Configure resources in the console and deploy using CloudFormation
<a name="smcluster-getting-started-slurm-console-create-cluster-deploy-console"></a>

You can configure resources using the Amazon Web Services Management Console and deploy using the CloudFormation templates. 

Follow these steps.

1. *Instead of choosing **Submit***, choose **Download CloudFormation template parameters** at the end of the tutorial in [Getting started with SageMaker HyperPod using the SageMaker AI console](smcluster-getting-started-slurm-console.md). The tutorial contains important configuration information you will need to create your cluster successfully.
**Important**  
If you choose **Submit**, you will not be able to deploy a cluster with the same name until you delete the cluster.

   After you choose **Download CloudFormation template parameters**, the **Using the configuration file to create the cluster using the Amazon CLI** window will appear on the right side of the page.

1. On the **Using the configuration file to create the cluster using the Amazon CLI** window, choose **Download configuration parameters file**. The file will be downloaded to your machine. You can edit the configuration JSON file based on your needs or leave it as-is, if no change is required.

1. In the terminal, navigate to the location of the parameter file `file://params.json`.

1. Run the [create-stack](https://docs.amazonaws.cn//cli/latest/reference/cloudformation/create-stack.html) Amazon CLI command to deploy the CloudFormation stack that will provision the configured resources and create the HyperPod cluster.

   ```
   aws cloudformation create-stack 
       --stack-name my-stack
       --template-url https://aws-sagemaker-hyperpod-cluster-setup.amazonaws.com/templates-slurm/main-stack-slurm-based-template.yaml
       --parameters file://params.json
       --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
   ```

1. To view the status of the resources provisioning, navigate to the [CloudFormation console](https://console.amazonaws.cn/cloudformation).

   After the cluster creation completes, view the new cluster under **Clusters** in the main pane of the SageMaker HyperPod console. You can check the status of it displayed under the **Status** column.

1. After the status of the cluster turns to `InService`, you can start logging into the cluster nodes. To access the cluster nodes and start running ML workloads, see [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md).

## Configure resources and deploy using CloudFormation
<a name="smcluster-getting-started-slurm-console-create-cluster-deploy-cfn"></a>

You can configure resources and deploy using the CloudFormation templates for SageMaker HyperPod.

Follow these steps.

1. Download a CloudFormation template for SageMaker HyperPod from the [sagemaker-hyperpod-cluster-setup](https://github.com/aws/sagemaker-hyperpod-cluster-setup) GitHub repository.

1. Run the [create-stack](https://docs.amazonaws.cn//cli/latest/reference/cloudformation/create-stack.html) Amazon CLI command to deploy the CloudFormation stack that will provision the configured resources and create the HyperPod cluster.

   ```
   aws cloudformation create-stack 
       --stack-name my-stack
       --template-url URL_of_the_file_that_contains_the_template_body
       --parameters file://params.json
       --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
   ```

1. To view the status of the resources provisioning, navigate to the CloudFormation console.

   After the cluster creation completes, view the new cluster under **Clusters** in the main pane of the SageMaker HyperPod console. You can check the status of it displayed under the **Status** column.

1. After the status of the cluster turns to `InService`, you can start logging into the cluster nodes. To access the cluster nodes and start running ML workloads, see [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md).

# Getting started with SageMaker HyperPod using the Amazon CLI
<a name="smcluster-getting-started-slurm-cli"></a>

Create your first SageMaker HyperPod cluster using the Amazon CLI commands for HyperPod.

## Create your first SageMaker HyperPod cluster with Slurm
<a name="smcluster-getting-started-slurm-cli-create-cluster"></a>

The following tutorial demonstrates how to create a new SageMaker HyperPod cluster and set it up with Slurm through the [Amazon CLI commands for SageMaker HyperPod](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-cli). Following the tutorial, you'll create a HyperPod cluster with three Slurm nodes: `my-controller-group`, `my-login-group`, and `worker-group-1`.

With the API-driven configuration approach, you define Slurm node types and partition assignments directly in the CreateCluster API request using `SlurmConfig`. This eliminates the need for a separate `provisioning_parameters.json` file and provides built-in validation, drift detection, and per-instance-group FSx configuration.

1. First, prepare and upload lifecycle scripts to an Amazon S3 bucket. During cluster creation, HyperPod runs them in each instance group. Upload lifecycle scripts to Amazon S3 using the following command.

   ```
   aws s3 sync \
       ~/local-dir-to-lifecycle-scripts/* \
       s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src
   ```
**Note**  
The S3 bucket path should start with a prefix `sagemaker-`, because the [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod) with `AmazonSageMakerClusterInstanceRolePolicy` only allows access to Amazon S3 buckets that starts with the specific prefix.

   If you are starting from scratch, use sample lifecycle scripts provided in the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/). The following sub-steps show how to download and upload the sample lifecycle scripts to an Amazon S3 bucket.

   1. Download a copy of the lifecycle script samples to a directory on your local computer.

      ```
      git clone https://github.com/aws-samples/awsome-distributed-training/
      ```

   1. Go into the directory [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config), where you can find a set of lifecycle scripts.

      ```
      cd awsome-distributed-training/1.architectures/5.sagemaker_hyperpods/LifecycleScripts/base-config
      ```

      To learn more about the lifecycle script samples, see [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md).

   1. Upload the scripts to `s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src`. You can do so by using the Amazon S3 console, or by running the following Amazon CLI Amazon S3 command.

      ```
      aws s3 sync \
          ~/local-dir-to-lifecycle-scripts/* \
          s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src
      ```
**Note**  
With API-driven configuration, you do not need to create or upload a `provisioning_parameters.json` file. The Slurm configuration is defined directly in the CreateCluster API request in the next step.

1. Prepare a [CreateCluster](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateCluster.html) request file in JSON format and save as `create_cluster.json`.

   With API-driven configuration, you specify the Slurm node type and partition assignment for each instance group using the `SlurmConfig` field. You also configure the cluster-level Slurm settings using `Orchestrator.Slurm`.

   For `ExecutionRole`, provide the ARN of the IAM role you created with the managed `AmazonSageMakerClusterInstanceRolePolicy` in [Prerequisites for using SageMaker HyperPod](sagemaker-hyperpod-prerequisites.md).

   ```
   {
       "ClusterName": "my-hyperpod-cluster",
       "InstanceGroups": [
           {
               "InstanceGroupName": "my-controller-group",
               "InstanceType": "ml.c5.xlarge",
               "InstanceCount": 1,
               "SlurmConfig": {
                   "NodeType": "Controller"
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole",
               "InstanceStorageConfigs": [
                   {
                       "EbsVolumeConfig": {
                           "VolumeSizeInGB": 500
                       }
                   }
               ]
           },
           {
               "InstanceGroupName": "my-login-group",
               "InstanceType": "ml.m5.4xlarge",
               "InstanceCount": 1,
               "SlurmConfig": {
                   "NodeType": "Login"
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole"
           },
           {
               "InstanceGroupName": "worker-group-1",
               "InstanceType": "ml.trn1.32xlarge",
               "InstanceCount": 1,
               "SlurmConfig": {
                   "NodeType": "Compute",
                   "PartitionNames": ["partition-1"]
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole"
           }
       ],
       "Orchestrator": {
           "Slurm": {
               "SlurmConfigStrategy": "Managed"
           }
       }
   }
   ```

   **SlurmConfig fields:**    
[\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/smcluster-getting-started-slurm-cli.html)

   **Orchestrator.Slurm fields:**    
[\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/smcluster-getting-started-slurm-cli.html)

   **SlurmConfigStrategy options:**
   + `Managed` (recommended): HyperPod fully manages `slurm.conf` and detects unauthorized changes (drift detection). Updates fail if drift is detected.
   + `Overwrite`: HyperPod overwrites `slurm.conf` on updates, ignoring any manual changes.
   + `Merge`: HyperPod preserves manual changes and merges them with API configuration.

   **Adding FSx for Lustre (optional):**

   To mount an FSx for Lustre filesystem to your compute nodes, add `FsxLustreConfig` to the `InstanceStorageConfigs` for the instance group. This requires a Custom VPC configuration.

   ```
   {
       "InstanceGroupName": "worker-group-1",
       "InstanceType": "ml.trn1.32xlarge",
       "InstanceCount": 1,
       "SlurmConfig": {
           "NodeType": "Compute",
           "PartitionNames": ["partition-1"]
       },
       "InstanceStorageConfigs": [
           {
               "FsxLustreConfig": {
                   "DnsName": "fs-0abc123def456789.fsx.us-west-2.amazonaws.com",
                   "MountPath": "/fsx",
                   "MountName": "abcdefgh"
               }
           }
       ],
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole"
   }
   ```

   **Adding FSx for OpenZFS (optional):**

   You can also mount FSx for OpenZFS filesystems:

   ```
   "InstanceStorageConfigs": [
       {
           "FsxOpenZfsConfig": {
               "DnsName": "fs-0xyz789abc123456.fsx.us-west-2.amazonaws.com",
               "MountPath": "/shared"
           }
       }
   ]
   ```
**Note**  
Each instance group can have at most one FSx for Lustre and one FSx for OpenZFS configuration. Different instance groups can mount different filesystems.

   **Adding VPC configuration (required for FSx):**

   If using FSx, you must specify a Custom VPC configuration:

   ```
   {
       "ClusterName": "my-hyperpod-cluster",
       "InstanceGroups": [
           {
               "InstanceGroupName": "my-controller-group",
               "InstanceType": "ml.c5.xlarge",
               "InstanceCount": 1,
               "SlurmConfig": {
                   "NodeType": "Controller"
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole"
           },
       ],
       "Orchestrator": {
           "Slurm": {
               "SlurmConfigStrategy": "Managed"
           }
       },
       "VpcConfig": {
           "SecurityGroupIds": ["sg-0abc123def456789a"],
           "Subnets": ["subnet-0abc123def456789a"]
       }
   }
   ```

1. Run the following command to create the cluster.

   ```
   aws sagemaker create-cluster --cli-input-json file://complete/path/to/create_cluster.json
   ```

   This should return the ARN of the created cluster.

   ```
   {
       "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/my-hyperpod-cluster"
   }
   ```

   If you receive an error due to resource limits, ensure that you change the instance type to one with sufficient quotas in your account, or request additional quotas by following [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

   **Common validation errors:**    
[\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/smcluster-getting-started-slurm-cli.html)

1. Run `describe-cluster` to check the status of the cluster.

   ```
   aws sagemaker describe-cluster --cluster-name my-hyperpod-cluster
   ```

   Example response:

   ```
   {
       "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/my-hyperpod-cluster",
       "ClusterName": "my-hyperpod-cluster",
       "ClusterStatus": "Creating",
       "InstanceGroups": [
           {
               "InstanceGroupName": "my-controller-group",
               "InstanceType": "ml.c5.xlarge",
               "InstanceCount": 1,
               "CurrentCount": 0,
               "TargetCount": 1,
               "SlurmConfig": {
                   "NodeType": "Controller"
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<bucket>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodExecutionRole"
           },
           {
               "InstanceGroupName": "my-login-group",
               "InstanceType": "ml.m5.4xlarge",
               "InstanceCount": 1,
               "CurrentCount": 0,
               "TargetCount": 1,
               "SlurmConfig": {
                   "NodeType": "Login"
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<bucket>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodExecutionRole"
           },
           {
               "InstanceGroupName": "worker-group-1",
               "InstanceType": "ml.trn1.32xlarge",
               "InstanceCount": 1,
               "CurrentCount": 0,
               "TargetCount": 1,
               "SlurmConfig": {
                   "NodeType": "Compute",
                   "PartitionNames": ["partition-1"]
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<bucket>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodExecutionRole"
           }
       ],
       "Orchestrator": {
           "Slurm": {
               "SlurmConfigStrategy": "Managed"
           }
       },
       "CreationTime": "2024-01-15T10:30:00Z"
   }
   ```

   After the status of the cluster turns to **InService**, proceed to the next step. Cluster creation typically takes 10-15 minutes.

1. Run `list-cluster-nodes` to check the details of the cluster nodes.

   ```
   aws sagemaker list-cluster-nodes --cluster-name my-hyperpod-cluster
   ```

   Example response:

   ```
   {
       "ClusterNodeSummaries": [
           {
               "InstanceGroupName": "my-controller-group",
               "InstanceId": "i-0abc123def456789a",
               "InstanceType": "ml.c5.xlarge",
               "InstanceStatus": {
                   "Status": "Running",
                   "Message": ""
               },
               "LaunchTime": "2024-01-15T10:35:00Z"
           },
           {
               "InstanceGroupName": "my-login-group",
               "InstanceId": "i-0abc123def456789b",
               "InstanceType": "ml.m5.4xlarge",
               "InstanceStatus": {
                   "Status": "Running",
                   "Message": ""
               },
               "LaunchTime": "2024-01-15T10:35:00Z"
           },
           {
               "InstanceGroupName": "worker-group-1",
               "InstanceId": "i-0abc123def456789c",
               "InstanceType": "ml.trn1.32xlarge",
               "InstanceStatus": {
                   "Status": "Running",
                   "Message": ""
               },
               "LaunchTime": "2024-01-15T10:36:00Z"
           }
       ]
   }
   ```

   The `InstanceId` is what your cluster users need for logging (`aws ssm`) into them. For more information about logging into the cluster nodes and running ML workloads, see [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md).

1. Connect to your cluster using Amazon Systems Manager Session Manager.

   ```
   aws ssm start-session \
       --target sagemaker-cluster:my-hyperpod-cluster_my-login-group-i-0abc123def456789b \
       --region us-west-2
   ```

   Once connected, verify Slurm is configured correctly:

   ```
   # Check Slurm nodes
   sinfo
   
   # Check Slurm partitions
   sinfo -p partition-1
   
   # Submit a test job
   srun -p partition-1 --nodes=1 hostname
   ```

## Delete the cluster and clean resources
<a name="smcluster-getting-started-slurm-cli-delete-cluster-and-clean"></a>

After you have successfully tested creating a SageMaker HyperPod cluster, it continues running in the `InService` state until you delete the cluster. We recommend that you delete any clusters created using on-demand SageMaker AI capacity when not in use to avoid incurring continued service charges based on on-demand pricing. In this tutorial, you have created a cluster that consists of three instance groups. Make sure you delete the cluster by running the following command.

```
aws sagemaker delete-cluster --cluster-name my-hyperpod-cluster
```

To clean up the lifecycle scripts from the Amazon S3 bucket used for this tutorial, go to the Amazon S3 bucket you used during cluster creation and remove the files entirely.

```
aws s3 rm s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src --recursive
```

If you have tested running any model training workloads on the cluster, also check if you have uploaded any data or if your job has saved any artifacts to different Amazon S3 buckets or file system services such as Amazon FSx for Lustre and Amazon Elastic File System. To prevent incurring charges, delete all artifacts and data from the storage or file system.

## Related topics
<a name="smcluster-getting-started-slurm-cli-related-topics"></a>
+ [SageMaker HyperPod Slurm configuration](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-slurm-configuration)
+ [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md)
+ [FSx configuration via InstanceStorageConfigs](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-slurm-fsx-config)
+ [SageMaker HyperPod Slurm cluster operations](sagemaker-hyperpod-operate-slurm.md)

# SageMaker HyperPod Slurm cluster operations
<a name="sagemaker-hyperpod-operate-slurm"></a>

This section provides guidance on managing SageMaker HyperPod through the SageMaker AI console UI or the Amazon Command Line Interface (CLI). You'll learn how to perform various tasks related to SageMaker HyperPod, whether you prefer a visual interface or working with commands.

**Topics**
+ [

# Managing SageMaker HyperPod Slurm clusters using the SageMaker console
](sagemaker-hyperpod-operate-slurm-console-ui.md)
+ [

# Managing SageMaker HyperPod Slurm clusters using the Amazon CLI
](sagemaker-hyperpod-operate-slurm-cli-command.md)

# Managing SageMaker HyperPod Slurm clusters using the SageMaker console
<a name="sagemaker-hyperpod-operate-slurm-console-ui"></a>

The following topics provide guidance on how to manage SageMaker HyperPod through the console UI.

**Topics**
+ [

## Create a SageMaker HyperPod cluster
](#sagemaker-hyperpod-operate-slurm-console-ui-create-cluster)
+ [

## Browse your SageMaker HyperPod clusters
](#sagemaker-hyperpod-operate-slurm-console-ui-browse-clusters)
+ [

## View details of each SageMaker HyperPod cluster
](#sagemaker-hyperpod-operate-slurm-console-ui-view-details-of-clusters)
+ [

## Edit a SageMaker HyperPod cluster
](#sagemaker-hyperpod-operate-slurm-console-ui-edit-clusters)
+ [

## Delete a SageMaker HyperPod cluster
](#sagemaker-hyperpod-operate-slurm-console-ui-delete-cluster)

## Create a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-operate-slurm-console-ui-create-cluster"></a>

See the instructions in [Getting started with SageMaker HyperPod using the SageMaker AI console](smcluster-getting-started-slurm-console.md) to create a new SageMaker HyperPod cluster through the SageMaker HyperPod console UI.

## Browse your SageMaker HyperPod clusters
<a name="sagemaker-hyperpod-operate-slurm-console-ui-browse-clusters"></a>

Under **Clusters** in the main pane of the SageMaker HyperPod console on the SageMaker HyperPod console main page, all created clusters should appear listed under the **Clusters** section, which provides a summary view of clusters, their ARNs, status, and creation time.

## View details of each SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-operate-slurm-console-ui-view-details-of-clusters"></a>

Under **Clusters** on the console main page, the cluster **Names** are activated as links. Choose the cluster name link to see details of each cluster.

## Edit a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-operate-slurm-console-ui-edit-clusters"></a>

1. Under **Clusters** in the main pane of the SageMaker HyperPod console, choose the cluster you want to update.

1. Select your cluster, and choose **Edit**.

1. In the **Edit <your-cluster>** page, you can edit the configurations of existing instance groups, add more instance groups, delete instance groups, and change tags for the cluster. After making changes, choose **Submit**. 

   1. In the **Configure instance groups** section, you can add more instance groups by choosing **Create instance group**.

   1. In the **Configure instance groups** section, you can choose **Edit** to change its configuration or **Delete** to remove the instance group permanently.
**Important**  
When deleting an instance group, consider the following points:  
Your SageMaker HyperPod cluster must always maintain at least one instance group.
Ensure all critical data is backed up before removal
The removal process cannot be undone.
**Note**  
Deleting an instance group will terminate all compute resources associated with that group.

   1. In the **Tags** section, you can update tags for the cluster.

## Delete a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-operate-slurm-console-ui-delete-cluster"></a>

1. Under **Clusters** in the main pane of the SageMaker HyperPod console, choose the cluster you want to delete.

1. Select your cluster, and choose **Delete**.

1. In the pop-up window for cluster deletion, review the cluster information carefully to confirm that you chose the right cluster to delete.

1. After you reviewed the cluster information, choose **Yes, delete cluster**.

1. In the text field to confirm this deletion, type **delete**.

1. Choose **Delete** on the lower right corner of the pop-up window to finish sending the cluster deletion request.

# Managing SageMaker HyperPod Slurm clusters using the Amazon CLI
<a name="sagemaker-hyperpod-operate-slurm-cli-command"></a>

The following topics provide guidance on writing SageMaker HyperPod API request files in JSON format and run them using the Amazon CLI commands.

**Topics**
+ [

## Create a new cluster
](#sagemaker-hyperpod-operate-slurm-cli-command-create-cluster)
+ [

## Describe a cluster
](#sagemaker-hyperpod-operate-slurm-cli-command-describe-cluster)
+ [

## List details of cluster nodes
](#sagemaker-hyperpod-operate-slurm-cli-command-list-cluster-nodes)
+ [

## Describe details of a cluster node
](#sagemaker-hyperpod-operate-slurm-cli-command-describe-cluster-node)
+ [

## List clusters
](#sagemaker-hyperpod-operate-slurm-cli-command-list-clusters)
+ [

## Update cluster configuration
](#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster)
+ [

## Update the SageMaker HyperPod platform software of a cluster
](#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software)
+ [

## Scale down a cluster
](#sagemaker-hyperpod-operate-slurm-cli-command-scale-down)
+ [

## Delete a cluster
](#sagemaker-hyperpod-operate-slurm-cli-command-delete-cluster)

## Create a new cluster
<a name="sagemaker-hyperpod-operate-slurm-cli-command-create-cluster"></a>

1. Prepare lifecycle configuration scripts and upload them to an S3 bucket, such as `s3://sagemaker-amzn-s3-demo-bucket/lifecycle-script-directory/src/`. The following step 2 assumes that there’s an entry point script named `on_create.sh` in the specified S3 bucket.
**Important**  
Make sure that you set the S3 path to start with `s3://sagemaker-`. The [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod) has the managed [https://docs.amazonaws.cn/sagemaker/latest/dg/security-iam-awsmanpol-cluster.html](https://docs.amazonaws.cn/sagemaker/latest/dg/security-iam-awsmanpol-cluster.html) attached, which allows access to S3 buckets with the specific prefix `sagemaker-`.

1. Prepare a [CreateCluster](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateCluster.html) API request file in JSON format. You should configure instance groups to match with the Slurm cluster you design in the `provisioning_parameters.json` file that'll be used during cluster creating as part of running a set of lifecycle scripts. To learn more, see [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md). The following template has two instance groups to meet the minimum requirement for a Slurm cluster: one controller (head) node and one compute (worker) node. For `ExecutionRole`, provide the ARN of the IAM role you created with the managed `AmazonSageMakerClusterInstanceRolePolicy` from the section [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod).

   ```
   // create_cluster.json
   {
       "ClusterName": "your-hyperpod-cluster",
       "InstanceGroups": [
           {
               "InstanceGroupName": "controller-group",
               "InstanceType": "ml.m5.xlarge",
               "InstanceCount": 1,
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://amzn-s3-demo-bucket-sagemaker/lifecycle-script-directory/src/",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster",
               // Optional: Configure an additional storage per instance group.
               "InstanceStorageConfigs": [
                   {
                      // Attach an additional EBS volume to each instance within the instance group.
                      // The default mount path for the additional EBS volume is /opt/sagemaker.
                      "EbsVolumeConfig":{
                         // Specify an integer between 1 and 16384 in gigabytes (GB).
                         "VolumeSizeInGB": integer,
                      }
                   }
               ]
           }, 
           {
               "InstanceGroupName": "worker-group-1",
               "InstanceType": "ml.p4d.xlarge",
               "InstanceCount": 1,
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://amzn-s3-demo-bucket-sagemaker/lifecycle-script-directory/src/",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster"
           }
       ],
       // Optional
       "Tags": [ 
           { 
              "Key": "string",
              "Value": "string"
           }
       ],
       // Optional
       "VpcConfig": { 
           "SecurityGroupIds": [ "string" ],
           "Subnets": [ "string" ]
       }
   }
   ```

   Depending on how you design the cluster structure through your lifecycle scripts, you can configure up to 20 instance groups under the `InstanceGroups` parameter.

   For the `Tags` request parameter, you can add custom tags for managing the SageMaker HyperPod cluster as an Amazon resource. You can add tags to your cluster in the same way you add them in other Amazon services that support tagging. To learn more about tagging Amazon resources in general, see [Tagging Amazon Resources User Guide](https://docs.amazonaws.cn/tag-editor/latest/userguide/tagging.html).

   For the `VpcConfig` request parameter, specify the information of a VPC you want to use. For more information, see [Setting up SageMaker HyperPod with a custom Amazon VPC](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-optional-vpc).

1. Run the [create-cluster](https://docs.amazonaws.cn/cli/latest/reference/sagemaker/create-cluster.html) command as follows.

   ```
   aws sagemaker create-cluster \
       --cli-input-json file://complete/path/to/create_cluster.json
   ```

   This should return the ARN of the new cluster.

## Describe a cluster
<a name="sagemaker-hyperpod-operate-slurm-cli-command-describe-cluster"></a>

Run [describe-cluster](https://docs.amazonaws.cn/cli/latest/reference/sagemaker/describe-cluster.html) to check the status of the cluster. You can specify either the name or the ARN of the cluster.

```
aws sagemaker describe-cluster --cluster-name your-hyperpod-cluster
```

After the status of the cluster turns to **InService**, proceed to the next step. Using this API, you can also retrieve failure messages from running other HyperPod API operations.

## List details of cluster nodes
<a name="sagemaker-hyperpod-operate-slurm-cli-command-list-cluster-nodes"></a>

Run [list-cluster-nodes](https://docs.amazonaws.cn/cli/latest/reference/sagemaker/list-cluster-nodes.html) to check the key information of the cluster nodes.

```
aws sagemaker list-cluster-nodes --cluster-name your-hyperpod-cluster
```

This returns a response, and the `InstanceId` is what you need to use for logging (using `aws ssm`) into them.

## Describe details of a cluster node
<a name="sagemaker-hyperpod-operate-slurm-cli-command-describe-cluster-node"></a>

Run [describe-cluster-node](https://docs.amazonaws.cn/cli/latest/reference/sagemaker/describe-cluster-node.html) to retrieve details of a cluster node. You can get the cluster node ID from list-cluster-nodes output. You can specify either the name or the ARN of the cluster.

```
aws sagemaker describe-cluster-node \
    --cluster-name your-hyperpod-cluster \
    --node-id i-111222333444555aa
```

## List clusters
<a name="sagemaker-hyperpod-operate-slurm-cli-command-list-clusters"></a>

Run [list-clusters](https://docs.amazonaws.cn/cli/latest/reference/sagemaker/list-clusters.html) to list all clusters in your account.

```
aws sagemaker list-clusters
```

You can also add additional flags to filter the list of clusters down. To learn more about what this command runs at low level and additional flags for filtering, see the [ListClusters](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_ListClusters.html) API reference.

## Update cluster configuration
<a name="sagemaker-hyperpod-operate-slurm-cli-command-update-cluster"></a>

Run [update-cluster](https://docs.amazonaws.cn/cli/latest/reference/sagemaker/update-cluster.html) to update the configuration of a cluster.

**Note**  
You can use the `UpdateCluster` API to scale down or remove entire instance groups from your SageMaker HyperPod cluster. For additional instructions on how to scale down or delete instance groups, see [Scale down a cluster](#sagemaker-hyperpod-operate-slurm-cli-command-scale-down).

1. Create an `UpdateCluster` request file in JSON format. Make sure that you specify the right cluster name and instance group name to update. You can change the instance type, the number of instances, the lifecycle configuration entrypoint script, and the path to the script.

   1. For `ClusterName`, specify the name of the cluster you want to update.

   1. For `InstanceGroupName`

      1. To update an existing instance group, specify the name of the instance group you want to update.

      1. To add a new instance group, specify a new name not existing in your cluster.

   1. For `InstanceType`

      1. To update an existing instance group, you must match the instance type you initially specified to the group.

      1. To add a new instance group, specify an instance type you want to configure the group with.

   1. For `InstanceCount`

      1. To update an existing instance group, specify an integer that corresponds to your desired number of instances. You can provide a higher or lower value (down to 0) to scale the instance group up or down.

      1. To add a new instance group, specify an integer greater or equal to 1. 

   1. For `LifeCycleConfig`, you can change both `SourceS3Uri` and `OnCreate` values as you want to update the instance group.

   1. For `ExecutionRole`

      1. For updating an existing instance group, keep using the same IAM role you attached during cluster creation.

      1. For adding a new instance group, specify an IAM role you want to attach.

   1. For `ThreadsPerCore`

      1. For updating an existing instance group, keep using the same value you specified during cluster creation.

      1. For adding a new instance group, you can choose any value from the allowed options per instance type. For more information, search the instance type and see the **Valid threads per core** column in the reference table at [CPU cores and threads per CPU core per instance type](https://docs.amazonaws.cn/AWSEC2/latest/UserGuide/cpu-options-supported-instances-values.html) in the *Amazon EC2 User Guide*.

   The following code snippet is a JSON request file template you can use. For more information about the request syntax and parameters of this API, see the [UpdateCluster](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_UpdateCluster.html) API reference.

   ```
   // update_cluster.json
   {
       // Required
       "ClusterName": "name-of-cluster-to-update",
       // Required
       "InstanceGroups": [
           {
               "InstanceGroupName": "name-of-instance-group-to-update",
               "InstanceType": "ml.m5.xlarge",
               "InstanceCount": 1,
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://amzn-s3-demo-bucket-sagemaker/lifecycle-script-directory/src/",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster",
               // Optional: Configure an additional storage per instance group.
               "InstanceStorageConfigs": [
                   {
                      // Attach an additional EBS volume to each instance within the instance group.
                      // The default mount path for the additional EBS volume is /opt/sagemaker.
                      "EbsVolumeConfig":{
                         // Specify an integer between 1 and 16384 in gigabytes (GB).
                         "VolumeSizeInGB": integer,
                      }
                   }
               ]
           },
           // add more blocks of instance groups as needed
           { ... }
       ]
   }
   ```

1. Run the following `update-cluster` command to submit the request. 

   ```
   aws sagemaker update-cluster \
       --cli-input-json file://complete/path/to/update_cluster.json
   ```

## Update the SageMaker HyperPod platform software of a cluster
<a name="sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software"></a>

Run [update-cluster-software](https://docs.amazonaws.cn/cli/latest/reference/sagemaker/update-cluster-software.html) to update existing clusters with software and security patches provided by the SageMaker HyperPod service. For `--cluster-name`, specify either the name or the ARN of the cluster to update.

**Important**  
Note that you must back up your work before running this API. The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre. For more information, see [Use the backup script provided by SageMaker HyperPod](#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software-backup).

```
aws sagemaker update-cluster-software --cluster-name your-hyperpod-cluster
```

This command calls the [UpdateClusterSoftware](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API. After the API call, SageMaker HyperPod checks if there's a newer DLAMI available for the cluster instances. If a DLAMI update is required, SageMaker HyperPod will update the cluster instances to use the latest [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami) and run your lifecycle scripts in the Amazon S3 bucket that you specified during cluster creation or update. If the cluster is already using the latest DLAMI, SageMaker HyperPod will not make any changes to the cluster or run the lifecycle scripts again. The SageMaker HyperPod service team regularly rolls out new [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami)s for enhancing security and improving user experiences. We recommend that you always keep updating to the latest SageMaker HyperPod DLAMI. For future SageMaker HyperPod DLAMI updates for security patching, follow up with [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

**Tip**  
If the security patch fails, you can retrieve failure messages by running the [https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_DescribeCluster.html](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_DescribeCluster.html) API as instructed at [Describe a cluster](#sagemaker-hyperpod-operate-slurm-cli-command-describe-cluster).

**Note**  
You can only run this API programatically. The patching functionality is not implemented in the SageMaker HyperPod console UI.

### Use the backup script provided by SageMaker HyperPod
<a name="sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software-backup"></a>

SageMaker HyperPod provides a script to back up and restore your data at [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/patching-backup.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/patching-backup.sh) in the *Awsome Distributed Training GitHub repository*. The script provides the following two functions.

**To back up data to an S3 bucket before patching**

```
sudo bash patching-backup.sh --create <s3-buckup-bucket-path>
```

After you run the command, the script checks `squeue` if there are queued jobs, stops Slurm if there's no job in the queue, backs up `mariadb`, and copies local items on disc defined under `LOCAL_ITEMS`. You can add more files and directories to `LOCAL_ITEMS`.

```
# Define files and directories to back up.
LOCAL_ITEMS=(
    "/var/spool/slurmd"
    "/var/spool/slurmctld"
    "/etc/systemd/system/slurmctld.service"
    "/home/ubuntu/backup_slurm_acct_db.sql"
    # ... Add more items as needed
)
```

Also, you can add custom code to the provided script to back up any applications for your use case.

**To restore data from an S3 bucket after patching**

```
sudo bash patching-backup.sh --restore <s3-buckup-bucket-path>
```

## Scale down a cluster
<a name="sagemaker-hyperpod-operate-slurm-cli-command-scale-down"></a>

You can scale down the number of instances or delete instance groups in your SageMaker HyperPod cluster to optimize resource allocation or reduce costs.

You scale down by either using the `UpdateCluster` API operation to randomly terminate instances from your instance group down to a specified number, or by terminating specific instances using the `BatchDeleteClusterNodes` API operation. You can also completely remove entire instance groups using the `UpdateCluster` API. For more information about how to scale down using these methods, see [Scaling down a SageMaker HyperPod cluster](smcluster-scale-down.md).

**Note**  
You cannot remove instances that are configured as Slurm controller nodes. Attempting to delete a Slurm controller node results in a validation error with the error code `NODE_ID_IN_USE`.

## Delete a cluster
<a name="sagemaker-hyperpod-operate-slurm-cli-command-delete-cluster"></a>

Run [delete-cluster](https://docs.amazonaws.cn/cli/latest/reference/sagemaker/delete-cluster.html) to delete a cluster. You can specify either the name or the ARN of the cluster.

```
aws sagemaker delete-cluster --cluster-name your-hyperpod-cluster
```

# Customizing SageMaker HyperPod clusters using lifecycle scripts
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm"></a>

SageMaker HyperPod offers always up-and-running compute clusters, which are highly customizable as you can write lifecycle scripts to tell SageMaker HyperPod how to set up the cluster resources. The following topics are best practices for preparing lifecycle scripts to set up SageMaker HyperPod clusters with open source workload manager tools.

The following topics discuss in-depth best practices for preparing lifecycle scripts to set up Slurm configurations on SageMaker HyperPod.

## High-level overview
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-highlevel-overview"></a>

The following procedure is the main flow of provisioning a HyperPod cluster and setting it up with Slurm. The steps are put in order of a ***bottom-up*** approach.

1. Plan how you want to create Slurm nodes on a HyperPod cluster. For example, if you want to configure two Slurm nodes, you'll need to set up two instance groups in a HyperPod cluster.

1. Prepare Slurm configuration. Choose one of the following approaches:
   + **Option A: API-driven configuration (recommended)** – Define Slurm node types and partitions directly in the `CreateCluster` API payload using `SlurmConfig` within each instance group. With this approach:
     + No `provisioning_parameters.json` file is needed
     + Slurm topology is defined in the API payload alongside instance group definitions
     + FSx filesystems are configured per-instance-group via `InstanceStorageConfigs`
     + Configuration strategy is controlled via `Orchestrator.Slurm.SlurmConfigStrategy`

     Example `SlurmConfig` in an instance group:

     ```
     {
         "InstanceGroupName": "gpu-compute",
         "InstanceType": "ml.p4d.24xlarge",
         "InstanceCount": 8,
         "SlurmConfig": {
             "NodeType": "Compute",
             "PartitionNames": ["gpu-training"]
         }
     }
     ```
   + **Option B: Legacy configuration** – Prepare a `provisioning_parameters.json` file, which is a [Configuration form for provisioning\$1parameters.json](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-provisioning-forms-slurm). `provisioning_parameters.json` should contain Slurm node configuration information to be provisioned on the HyperPod cluster. This should reflect the design of Slurm nodes from Step 1.

1. Prepare a set of lifecycle scripts to set up Slurm on HyperPod to install software packages and set up an environment in the cluster for your use case. You should structure the lifecycle scripts to collectively run in order in a central Python script (`lifecycle_script.py`), and write an entrypoint shell script (`on_create.sh`) to run the Python script. The entrypoint shell script is what you need to provide to a HyperPod cluster creation request later in Step 5. 

   Also, note that you should write the scripts to expect `resource_config.json` that will be generated by HyperPod during cluster creation. `resource_config.json` contains HyperPod cluster resource information such as IP addresses, instance types, and ARNs, and is what you need to use for configuring Slurm.

1. Collect all the files from the previous steps into a folder. The folder structure depends on the configuration approach you selected in Step 2.

   If you selected Option A (API-driven configuration):

   Your folder only needs lifecycle scripts for custom setup tasks. Slurm configuration and FSx mounting are handled automatically by HyperPod based on the API payload.

   ```
   └── lifecycle_files // your local folder
   
       ├── on_create.sh
       ├── lifecycle_script.py
       └── ... // more setup scripts to be fed into lifecycle_script.py
   ```
**Note**  
The `provisioning_parameters.json` file is not required when using API-driven configuration.

   If you selected Option B (legacy configuration):

   Your folder must include `provisioning_parameters.json` and the full set of lifecycle scripts.

   ```
   └── lifecycle_files // your local folder
   
       ├── provisioning_parameters.json
       ├── on_create.sh
       ├── lifecycle_script.py
       └── ... // more setup scrips to be fed into lifecycle_script.py
   ```

1. Upload all the files to an S3 bucket. Copy and keep the S3 bucket path. Note that you should create an S3 bucket path starting with `sagemaker-` because you need to choose an [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod) attached with [`AmazonSageMakerClusterInstanceRolePolicy`](security-iam-awsmanpol-AmazonSageMakerClusterInstanceRolePolicy.md), which only allows S3 bucket paths starting with the prefix `sagemaker-`. The following command is an example command to upload all the files to an S3 bucket.

   ```
   aws s3 cp --recursive ./lifecycle_files s3://sagemaker-hyperpod-lifecycle/src
   ```

1. Prepare a HyperPod cluster creation request. 
   + Option 1: If you use the Amazon CLI, write a cluster creation request in JSON format (`create_cluster.json`) following the instructions at [Create a new cluster](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-create-cluster).
   + Option 2: If you use the SageMaker AI console UI, fill the **Create a cluster** request form in the HyperPod console UI following the instructions at [Create a SageMaker HyperPod cluster](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-create-cluster).

   At this stage, make sure that you create instance groups in the same structure that you planned in Step 1 and 2. Also, make sure that you specify the S3 bucket from Step 5 in the request forms.

1. Submit the cluster creation request. HyperPod provisions a cluster based on the request, and then creates a `resource_config.json` file in the HyperPod cluster instances, and sets up Slurm on the cluster running the lifecycle scripts.

The following topics walk you through and dive deep into details on how to organize configuration files and lifecycle scripts to work properly during HyperPod cluster creation.

**Topics**
+ [

## High-level overview
](#sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-highlevel-overview)
+ [

# Base lifecycle scripts provided by HyperPod
](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md)
+ [

# What particular configurations HyperPod manages in Slurm configuration files
](sagemaker-hyperpod-lifecycle-best-practices-slurm-what-hyperpod-overrides-in-slurm-conf.md)
+ [

# Slurm log rotations
](sagemaker-hyperpod-slurm-log-rotation.md)
+ [

# Mounting Amazon FSx for Lustre and Amazon FSx for OpenZFS to a HyperPod cluster
](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-setup-with-fsx.md)
+ [

# Validating the JSON configuration files before creating a Slurm cluster on HyperPod
](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-validate-json-files.md)
+ [

# Validating runtime before running production workloads on a HyperPod Slurm cluster
](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-validate-runtime.md)
+ [

# Developing lifecycle scripts interactively on a HyperPod cluster node
](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-develop-lifecycle-scripts.md)

# Base lifecycle scripts provided by HyperPod
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config"></a>

This section walks you through every component of the basic flow of setting up Slurm on HyperPod in a ***top-down*** approach. It starts from preparing a HyperPod cluster creation request to run the `CreateCluster` API, and dives deep into the hierarchical structure down to lifecycle scripts. Use the sample lifecycle scripts provided in the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/). Clone the repository by running the following command.

```
git clone https://github.com/aws-samples/awsome-distributed-training/
```

The base lifecycle scripts for setting up a Slurm cluster on SageMaker HyperPod are available at [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config).

```
cd awsome-distributed-training/1.architectures/5.sagemaker_hyperpods/LifecycleScripts/base-config
```

The following flowchart shows a detailed overview of how you should design the base lifecycle scripts. The descriptions below the diagram and the procedural guide explain how they work during the HyperPod `CreateCluster` API call.

![\[A detailed flow chart of HyperPod cluster creation and the structure of lifecycle scripts.\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/images/hyperpod-lifecycle-structure.png)


***Figure:** A detailed flow chart of HyperPod cluster creation and the structure of lifecycle scripts. (1) The dashed arrows are directed to where the boxes are "called into" and shows the flow of configuration files and lifecycle scripts preparation. It starts from preparing `provisioning_parameters.json` and lifecycle scripts. These are then coded in `lifecycle_script.py` for a collective execution in order. And the execution of the `lifecycle_script.py` script is done by the `on_create.sh` shell script, which to be run in the HyperPod instance terminal. (2) The solid arrows show the main HyperPod cluster creation flow and how the boxes are "called into" or "submitted to". `on_create.sh` is required for cluster creation request, either in `create_cluster.json` or the **Create a cluster** request form in the console UI. After you submit the request, HyperPod runs the `CreateCluster` API based on the given configuration information from the request and the lifecycle scripts. (3) The dotted arrow indicates that the HyperPod platform creates `resource_config.json` in the cluster instances during cluster resource provisioning. `resource_config.json` contains HyperPod cluster resource information such as the cluster ARN, instance types, and IP addresses. It is important to note that you should prepare the lifecycle scripts to expect the `resource_config.json` file during cluster creation. For more information, see the procedural guide below.*

The following procedural guide explains what happens during HyperPod cluster creation and how the base lifecycle scripts are designed.

1. `create_cluster.json` – To submit a HyperPod cluster creation request, you prepare a `CreateCluster` request file in JSON format. In this best practices example, we assume that the request file is named `create_cluster.json`. Write `create_cluster.json` to provision a HyperPod cluster with instance groups. The best practice is to add the same number of instance groups as the number of Slurm nodes you plan to configure on the HyperPod cluster. Make sure that you give distinctive names to the instance groups that you'll assign to Slurm nodes you plan to set up.

   Also, you are required to specify an S3 bucket path to store your entire set of configuration files and lifecycle scripts to the field name `InstanceGroups.LifeCycleConfig.SourceS3Uri` in the `CreateCluster` request form, and specify the file name of an entrypoint shell script (assume that it's named `on_create.sh`) to `InstanceGroups.LifeCycleConfig.OnCreate`.
**Note**  
If you are using the **Create a cluster** submission form in the HyperPod console UI, the console manages filling and submitting the `CreateCluster` request on your behalf, and runs the `CreateCluster` API in the backend. In this case, you don't need to create `create_cluster.json`; instead, make sure that you specify the correct cluster configuration information to the **Create a cluster** submission form.

1. `on_create.sh` – For each instance group, you need to provide an entrypoint shell script, `on_create.sh`, to run commands, run scripts to install software packages, and set up the HyperPod cluster environment with Slurm. The two things you need to prepare are a `provisioning_parameters.json` required by HyperPod for setting up Slurm and a set of lifecycle scripts for installing software packages. This script should be written to find and run the following files as shown in the sample script at [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/on_create.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/on_create.sh).
**Note**  
Make sure that you upload the entire set of lifecycle scripts to the S3 location you specify in `create_cluster.json`. You should also place your `provisioning_parameters.json` in the same location.

   1. `provisioning_parameters.json` – This is a [Configuration form for provisioning\$1parameters.json](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-provisioning-forms-slurm). The `on_create.sh` script finds this JSON file and defines environment variable for identifying the path to it. Through this JSON file, you can configure Slurm nodes and storage options such as Amazon FSx for Lustre for Slurm to communicate with. In `provisioning_parameters.json`, make sure that you assign the HyperPod cluster instance groups using the names you specified in `create_cluster.json` to the Slurm nodes appropriately based on how you plan to set them up.

      The following diagram shows an example of how the two JSON configuration files `create_cluster.json` and `provisioning_parameters.json` should be written to assign HyperPod instance groups to Slurm nodes. In this example, we assume a case of setting up three Slurm nodes: controller (management) node, log-in node (which is optional), and compute (worker) node.
**Tip**  
To help you validate these two JSON files, the HyperPod service team provides a validation script, [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/validate-config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/validate-config.py). To learn more, see [Validating the JSON configuration files before creating a Slurm cluster on HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-validate-json-files.md).  
![\[Direct comparison between .json files.\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/images/hyperpod-lifecycle-slurm-config.png)

      ***Figure:** Direct comparison between `create_cluster.json` for HyperPod cluster creation and `provisiong_params.json` for Slurm configuration. The number of instance groups in `create_cluster.json` should match with the number of nodes you want to configure as Slurm nodes. In case of the example in the figure, three Slurm nodes will be configured on a HyperPod cluster of three instance groups. You should assign the HyperPod cluster instance groups to Slurm nodes by specifying the instance group names accordingly.*

   1. `resource_config.json` – During cluster creation, the `lifecycle_script.py` script is written to expect a `resource_config.json` file from HyperPod. This file contains information about the cluster, such as instance types and IP addresses.

      When you run the `CreateCluster` API, HyperPod creates a resource configuration file at `/opt/ml/config/resource_config.json` based on the `create_cluster.json` file. The file path is saved to the environment variable named `SAGEMAKER_RESOURCE_CONFIG_PATH`. 
**Important**  
The `resource_config.json` file is auto-generated by the HyperPod platform, and you DO NOT need to create it. The following code is to show an example of `resource_config.json` that would be created from the cluster creation based on `create_cluster.json` in the previous step, and to help you understand what happens in the backend and how an auto-generated `resource_config.json` would look.

      ```
      {
      
          "ClusterConfig": {
              "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/abcde01234yz",
              "ClusterName": "your-hyperpod-cluster"
          },
          "InstanceGroups": [
              {
                  "Name": "controller-machine",
                  "InstanceType": "ml.c5.xlarge",
                  "Instances": [
                      {
                          "InstanceName": "controller-machine-1",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      }
                  ]
              },
              {
                  "Name": "login-group",
                  "InstanceType": "ml.m5.xlarge",
                  "Instances": [
                      {
                          "InstanceName": "login-group-1",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      }
                  ]
              },
              {
                  "Name": "compute-nodes",
                  "InstanceType": "ml.trn1.32xlarge",
                  "Instances": [
                      {
                          "InstanceName": "compute-nodes-1",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      },
                      {
                          "InstanceName": "compute-nodes-2",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      },
                      {
                          "InstanceName": "compute-nodes-3",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      },
                      {
                          "InstanceName": "compute-nodes-4",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      }
                  ]
              }
          ]
      }
      ```

   1. `lifecycle_script.py` – This is the main Python script that collectively runs lifecycle scripts setting up Slurm on the HyperPod cluster while being provisioned. This script reads in `provisioning_parameters.json` and `resource_config.json` from the paths that are specified or identified in `on_create.sh`, passes the relevant information to each lifecycle script, and then runs the lifecycle scripts in order.

      Lifecycle scripts are a set of scripts that you have a complete flexibility to customize to install software packages and set up necessary or custom configurations during cluster creation, such as setting up Slurm, creating users, installing Conda or Docker. The sample [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py) script is prepared to run other base lifecycle scripts in the repository, such as launching Slurm deamons ([https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/start_slurm.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/start_slurm.sh)), mounting Amazon FSx for Lustre ([https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/mount_fsx.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/mount_fsx.sh)), and setting up MariaDB accounting ([https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_mariadb_accounting.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_mariadb_accounting.sh)) and RDS accounting ([https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_rds_accounting.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_rds_accounting.sh)). You can also add more scripts, package them under the same directory, and add code lines to `lifecycle_script.py` to let HyperPod run the scripts. For more information about the base lifecycle scripts, see also [3.1 Lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod#31-lifecycle-scripts) in the *Awsome Distributed Training GitHub repository*.
**Note**  
HyperPod runs [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami) on each instance of a cluster, and the AMI has pre-installed software packages complying compatibilities between them and HyperPod functionalities. Note that if you reinstall any of the pre-installed packages, you are responsible for installing compatible packages and note that some HyperPod functionalities might not work as expected.

      In addition to the default setups, more scripts for installing the following software are available under the [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils) folder. The `lifecycle_script.py` file is already prepared to include code lines for running the installation scripts, so see the following items to search those lines and uncomment to activate them.

      1. The following code lines are for installing [Docker](https://www.docker.com/), [Enroot](https://github.com/NVIDIA/enroot), and [Pyxis](https://github.com/NVIDIA/pyxis). These packages are required to run Docker containers on a Slurm cluster. 

         To enable this installation step, set the `enable_docker_enroot_pyxis` parameter to `True` in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py) file.

         ```
         # Install Docker/Enroot/Pyxis
         if Config.enable_docker_enroot_pyxis:
             ExecuteBashScript("./utils/install_docker.sh").run()
             ExecuteBashScript("./utils/install_enroot_pyxis.sh").run(node_type)
         ```

      1. You can integrate your HyperPod cluster with [Amazon Managed Service for Prometheus](https://docs.amazonaws.cn/prometheus/latest/userguide/what-is-Amazon-Managed-Service-Prometheus.html) and [Amazon Managed Grafana](https://docs.amazonaws.cn/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html) to export metrics about the HyperPod cluster and cluster nodes to Amazon Managed Grafana dashboards. To export metrics and use the [Slurm dashboard](https://grafana.com/grafana/dashboards/4323-slurm-dashboard/), the [NVIDIA DCGM Exporter dashboard](https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/), and the [EFA Metrics dashboard](https://grafana.com/grafana/dashboards/20579-efa-metrics-dev/) on Amazon Managed Grafana, you need to install the [Slurm exporter for Prometheus](https://github.com/vpenso/prometheus-slurm-exporter), the [NVIDIA DCGM exporter](https://github.com/NVIDIA/dcgm-exporter), and the [EFA node exporter](https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/3.efa-node-exporter/README.md). For more information about installing the exporter packages and using Grafana dashboards on an Amazon Managed Grafana workspace, see [SageMaker HyperPod cluster resources monitoring](sagemaker-hyperpod-cluster-observability-slurm.md). 

         To enable this installation step, set the `enable_observability` parameter to `True` in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py) file.

         ```
         # Install metric exporting software and Prometheus for observability
         
         if Config.enable_observability:
             if node_type == SlurmNodeType.COMPUTE_NODE:
                 ExecuteBashScript("./utils/install_docker.sh").run()
                 ExecuteBashScript("./utils/install_dcgm_exporter.sh").run()
                 ExecuteBashScript("./utils/install_efa_node_exporter.sh").run()
             
             if node_type == SlurmNodeType.HEAD_NODE:
                 wait_for_scontrol()
                 ExecuteBashScript("./utils/install_docker.sh").run()
                 ExecuteBashScript("./utils/install_slurm_exporter.sh").run()
                 ExecuteBashScript("./utils/install_prometheus.sh").run()
         ```

1. Make sure that you upload all configuration files and setup scripts from **Step 2** to the S3 bucket you provide in the `CreateCluster` request in **Step 1**. For example, assume that your `create_cluster.json` has the following.

   ```
   "LifeCycleConfig": { 
   
       "SourceS3URI": "s3://sagemaker-hyperpod-lifecycle/src",
       "OnCreate": "on_create.sh"
   }
   ```

   Then, your `"s3://sagemaker-hyperpod-lifecycle/src"` should contain `on_create.sh`, `lifecycle_script.py`, `provisioning_parameters.json`, and all other setup scripts. Assume that you have prepared the files in a local folder as follows.

   ```
   └── lifecycle_files // your local folder
       ├── provisioning_parameters.json
       ├── on_create.sh
       ├── lifecycle_script.py
       └── ... // more setup scrips to be fed into lifecycle_script.py
   ```

   To upload the files, use the S3 command as follows.

   ```
   aws s3 cp --recursive ./lifecycle_scripts s3://sagemaker-hyperpod-lifecycle/src
   ```

# What particular configurations HyperPod manages in Slurm configuration files
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-what-hyperpod-overrides-in-slurm-conf"></a>

When you create a Slurm cluster on HyperPod, the HyperPod agent sets up the [https://slurm.schedmd.com/slurm.conf.html](https://slurm.schedmd.com/slurm.conf.html) and [https://slurm.schedmd.com/gres.conf.html](https://slurm.schedmd.com/gres.conf.html) files at `/opt/slurm/etc/` to manage the Slurm cluster based on your HyperPod cluster creation request and lifecycle scripts. The following list shows which specific parameters the HyperPod agent handles and overwrites. 

**Important**  
We strongly recommend that you **do not** change these parameters managed by HyperPod.
+ In [https://slurm.schedmd.com/slurm.conf.html](https://slurm.schedmd.com/slurm.conf.html), HyperPod sets up the following basic parameters: `ClusterName`, `SlurmctldHost`, `PartitionName`, and `NodeName`.

  Also, to enable the [Automatic node recovery and auto-resume](sagemaker-hyperpod-resiliency-slurm-auto-resume.md) functionality, HyperPod requires the `TaskPlugin` and `SchedulerParameters` parameters set as follows. The HyperPod agent sets up these two parameters with the required values by default.

  ```
  TaskPlugin=task/none
  SchedulerParameters=permit_job_expansion
  ```
+ In [https://slurm.schedmd.com/gres.conf.html](https://slurm.schedmd.com/gres.conf.html), HyperPod manages `NodeName` for GPU nodes.

# Slurm log rotations
<a name="sagemaker-hyperpod-slurm-log-rotation"></a>

SageMaker HyperPod provides automatic log rotation for Slurm daemon logs to help manage disk space usage and maintain system performance. Log rotation is crucial for preventing logs from consuming excessive disk space and ensuring optimal system operation by automatically archiving and removing old log files while maintaining recent logging information. Slurm log rotations are enabled by default when you create a cluster.

## How log rotation works
<a name="sagemaker-hyperpod-slurm-log-rotation-how-it-works"></a>

When enabled, the log rotation configuration:
+ Monitors all Slurm log files with the extension `.log` located in the `/var/log/slurm/` folder on the controller, login and compute nodes.
+ Rotates logs when they reach 50 MB in size.
+ Maintains up to two rotated log files before deleting them.
+ Sends SIGUSR2 signal to Slurm daemons (`slurmctld`, `slurmd`, and `slurmdbd`) after rotation.

## List of log files rotated
<a name="sagemaker-hyperpod-slurm-log-rotation-log-files-list"></a>

Slurm logs are located in the `/var/log/slurm/` directory. Log rotation is enabled for all files that match `/var/log/slurm/*.log`. When rotation occurs, rotated files have numerical suffixes (such as `slurmd.log.1`). The following list is not exhaustive but shows some of the critical log files that rotate automatically:
+ `/var/log/slurm/slurmctld.log`
+ `/var/log/slurm/slurmd.log`
+ `/var/log/slurm/slurmdb.log`
+ `/var/log/slurm/slurmrestd.log`

## Enable or disable log rotation
<a name="sagemaker-hyperpod-slurm-log-rotation-enable-disable"></a>

You can control the log rotation feature using the `enable_slurm_log_rotation` parameter in the `config.py` script of your cluster's lifecycle scripts, as shown in the following example:

```
class Config:
    # Set false if you want to disable log rotation of Slurm daemon logs
    enable_slurm_log_rotation = True  # Default value
```

To disable log rotation, set the parameter to `False`, as shown in the following example:

```
enable_slurm_log_rotation = False
```

**Note**  
Lifecycle scripts run on all Slurm nodes (controller, login, and compute nodes) during cluster creation. They also run on new nodes when added to the cluster. Updating the log rotation configurations must be done manually after cluster creation. The log rotation configuration is stored in `/etc/logrotate.d/sagemaker-hyperpod-slurm`. We recommend keeping log rotation enabled to prevent log files from consuming excessive disk space. To disable log rotation, delete the `sagemaker-hyperpod-slurm` file or comment out its contents by adding `#` at the start of each line in the `sagemaker-hyperpod-slurm` file.

## Default log rotation settings
<a name="sagemaker-hyperpod-slurm-log-rotation-default-settings"></a>

The following settings are configured automatically for each log file rotated:


| Setting | Value | Description | 
| --- | --- | --- | 
| rotate | 2 | Number of rotated log files to keep | 
| size | 50 MB | Maximum size before rotation | 
| copytruncate | enabled | Copies and truncates the original log file | 
| compress | disabled | Rotated logs are not compressed | 
| missingok | enabled | No error if log file is missing | 
| notifempty | enabled | Doesn't rotate empty files | 
| noolddir | enabled | Rotated files stay in same directory | 

# Mounting Amazon FSx for Lustre and Amazon FSx for OpenZFS to a HyperPod cluster
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-setup-with-fsx"></a>

To mount an Amazon FSx for Lustre shared file system to your HyperPod cluster, set up the following.

1. Use your Amazon VPC. 

   1. For HyperPod cluster instances to communicate within your VPC, make sure that you attach the [Setting up SageMaker HyperPod with a custom Amazon VPC](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-optional-vpc) to the IAM role for SageMaker HyperPod. 

   1. In `create_cluster.json`, include the following VPC information.

      ```
      "VpcConfig": { 
          "SecurityGroupIds": [ "string" ],
          "Subnets": [ "string" ]
      }
      ```

      For more tips about setting up Amazon VPC, see [Prerequisites for using SageMaker HyperPod](sagemaker-hyperpod-prerequisites.md).

1. To finish configuring Slurm with Amazon FSx for Lustre, you can use one of the following approaches. You can find the Amazon FSx information either from the Amazon FSx for Lustre console in your account or by running the following Amazon CLI command, `aws fsx describe-file-systems`.

   **Option A: API-Driven Configuration (Recommended)**

   Specify the Amazon FSx configuration directly in the CreateCluster API payload using `InstanceStorageConfigs` within each instance group. This approach supports both FSx for Lustre and FSx for OpenZFS, and allows per-instance-group FSx configuration.

   ```
   "InstanceStorageConfigs": [
       {
           "FsxLustreConfig": {
               "DnsName": "fs-12345678a90b01cde.fsx.us-west-2.amazonaws.com",
               "MountPath": "/fsx",
               "MountName": "1abcdefg"
           }
       }
   ]
   ```

   For FSx for OpenZFS, use `FsxOpenZfsConfig` instead:

   ```
   "InstanceStorageConfigs": [
       {
           "FsxOpenZfsConfig": {
               "DnsName": "fs-12345678a90b01cde.fsx.us-west-2.amazonaws.com",
               "MountPath": "/fsx-openzfs"
           }
       }
   ]
   ```

   For more details, see [Getting started with SageMaker HyperPod using the Amazon CLI](sagemaker-hyperpod-quickstart.md).

   **Option B: Legacy Configuration**

   Specify the Amazon FSx DNS name and Amazon FSx mount name in `provisioning_parameters.json` as shown in the figure in the [Base lifecycle scripts provided by HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) section.

   ```
   "fsx_dns_name": "fs-12345678a90b01cde.fsx.us-west-2.amazonaws.com",
   "fsx_mountname": "1abcdefg"
   ```

# Validating the JSON configuration files before creating a Slurm cluster on HyperPod
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-validate-json-files"></a>

To validate the JSON configuration files before submitting a cluster creation request, use the configuration validation script [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/validate-config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/validate-config.py). This script parses and compares your HyperPod cluster configuration JSON file and Slurm configuration JSON file, and identifies if there's any resource misconfiguration between the two files and also across Amazon EC2, Amazon VPC, and Amazon FSx resources. For example, to validate the `create_cluster.json` and `provisioning_parameters.json` files from the [Base lifecycle scripts provided by HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) section, run the validation script as follows.

```
python3 validate-config.py --cluster-config create_cluster.json --provisioning-parameters provisioning_parameters.json
```

The following is an example output of a successful validation.

```
✔️  Validated instance group name worker-group-1 is correct ...

✔️  Validated subnet subnet-012345abcdef67890 ...
✔️  Validated security group sg-012345abcdef67890 ingress rules ...
✔️  Validated security group sg-012345abcdef67890 egress rules ...
✔️  Validated FSx Lustre DNS name fs-012345abcdef67890.fsx.us-east-1.amazonaws.com
✔️  Validated FSx Lustre mount name abcdefgh
✅ Cluster Validation succeeded
```

# Validating runtime before running production workloads on a HyperPod Slurm cluster
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-validate-runtime"></a>

To check the runtime before running any production workloads on a Slurm cluster on HyperPod, use the runtime validation script [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/hyperpod-precheck.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/hyperpod-precheck.py). This script checks if the Slurm cluster has all packages installed for running Docker, if the cluster has a properly mounted FSx for Lustre file system and a user directory sharing the file system, and if the Slurm deamon is running on all compute nodes.

To run the script on multiple nodes at once, use `srun` as shown in the following example command of running the script on a Slurm cluster of 8 nodes.

```
# The following command runs on 8 nodes
srun -N 8 python3 hyperpod-precheck.py
```

**Note**  
To learn more about the validation script such as what runtime validation functions the script provides and guidelines to resolve issues that don't pass the validations, see [Runtime validation before running workloads](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod#35-runtime-validation-before-running-workloads) in the *Awsome Distributed Training GitHub repository*.

# Developing lifecycle scripts interactively on a HyperPod cluster node
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-develop-lifecycle-scripts"></a>

This section explains how you can interactively develop lifecycle scripts without repeatedly creating and deleting a HyperPod cluster.

1. Create a HyperPod cluster with the base lifecycle scripts.

1. Log in to a cluster node.

1. Develop a script (`configure_xyz.sh`) by editing and running it repeatedly on the node.

   1. HyperPod runs the lifecycle scripts as the root user, so we recommend that you run the `configure_xyz.sh` as the root user while developing to make sure that the script is tested under the same condition while run by HyperPod.

1. Integrate the script into `lifecycle_script.py` by adding a code line similar to the following.

   ```
   ExecuteBashScript("./utils/configure_xyz.sh").run()
   ```

1. Upload the updated lifecycle scripts to the S3 bucket that you initially used for uploading the base lifecycle scripts.

1. Test the integrated version of `lifecycle_script.py` by creating a new HyperPod cluster. You can also use manual instance replacement to test the updated lifecycle scripts by creating new instances. For detailed instructions, see [Manually replace a node](https://docs.amazonaws.cn//sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance.html#sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-replace). Note that only worker nodes are replaceable.

# SageMaker HyperPod multi-head node support
<a name="sagemaker-hyperpod-multihead-slurm"></a>

You can create multiple controller (head) nodes in a single SageMaker HyperPod Slurm cluster, with one serving as the primary controller node and the others serving as backup controller nodes. The primary controller node is responsible for controlling the compute (worker) nodes and handling Slurm operations. The backup controller nodes constantly monitor the primary controller node. If the primary controller node fails or becomes unresponsive, one of the backup controller nodes will automatically take over as the new primary controller node.

Configuring multiple controller nodes in SageMaker HyperPod Slurm clusters provides several key benefits. It eliminates the risk of single controller node failure by providing controller head nodes, enables automatic failover to backup controller nodes with faster recovery, and allows you to manage your own accounting databases and Slurm configuration independently.

## Key concepts
<a name="sagemaker-hyperpod-multihead-slurm-concepts"></a>

The following provides details about the concepts related to SageMaker HyperPod multiple controller (head) nodes support for Slurm clusters.

**Controller node**

A controller node is an Amazon EC2 instance within a cluster that runs critical Slurm services for managing and coordinating the cluster's operations. Specifically, it hosts the [Slurm controller daemon (slurmctld)](https://slurm.schedmd.com/slurmctld.html) and the [Slurm database daemon (slurmdbd)](https://slurm.schedmd.com/slurmdbd.html). A controller node is also known as a head node.

**Primary controller node**

A primary controller node is the active and currently controlling controller node in a Slurm cluster. It is identified by Slurm as the primary controller node responsible for managing the cluster. The primary controller node receives and executes commands from users to control and allocate resources on the compute nodes for running jobs.

**Backup controller node**

A backup controller node is an inactive and standby controller node in a Slurm cluster. It is identified by Slurm as a backup controller node that is not currently managing the cluster. The backup controller node runs the [Slurm controller daemon (slurmctld)](https://slurm.schedmd.com/slurmctld.html) in standby mode. Any controller commands executed on the backup controller nodes will be propagated to the primary controller node for execution. Its primary purpose is to continuously monitor the primary controller node and take over its responsibilities if the primary controller node fails or becomes unresponsive.

**Compute node**

A compute node is an Amazon EC2 instance within a cluster that hosts the [Slurm worker daemon (slurmd)](https://slurm.schedmd.com/slurmd.html). The compute node's primary function is to execute jobs assigned by the [Slurm controller daemon (slurmctld)](https://slurm.schedmd.com/slurmctld.html) running on the primary controller node. When a job is scheduled, the compute node receives instructions from the [Slurm controller daemon (slurmctld)](https://slurm.schedmd.com/slurmctld.html) to carry out the necessary tasks and computations for that job within the node itself. A compute is also known as a worker node.

## How it works
<a name="sagemaker-hyperpod-multihead-slurm-how"></a>

The following diagram illustrates how different Amazon services work together to support the multiple controller (head) nodes architecture for SageMaker HyperPod Slurm clusters.

![\[SageMaker HyperPod multi-head nodes architecture diagram\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/images/hyperpod/hyperpod-multihead-architecture.png)


The Amazon services that work together to support the SageMaker HyperPod multiple controller (head) nodes architecture include the following.


**Amazon services that work together to support the SageMaker HyperPod multiple controller nodes architecture**  

| Service | Description | 
| --- | --- | 
| IAM (Amazon Identity and Access Management) | Defines two IAM roles to control the access permissions: one role for the compute node instance group and the other for the controller node instance group. | 
| Amazon RDS for MariaDB | Stores accounting data for Slurm, which holds job records and metering data. | 
| Amazon Secrets Manager | Stores and manages credentials that can be accessed by Amazon FSx for Lustre. | 
| Amazon FSx for Lustre  | Stores Slurm configurations and runtime state. | 
| Amazon VPC | Provides an isolated network environment where the HyperPod cluster and its resources are deployed. | 
| Amazon SNS  | Sends notifications to administrators when there are status changes (Slurm controller is ON or OFF) related to the primary controller (head) node. | 

The HyperPod cluster itself consists of controller nodes (primary and backup) and compute nodes. The controller nodes run the Slurm controller (SlurmCtld) and database (SlurmDBd) components, which manage and monitor the workload across the compute nodes.

The controller nodes access Slurm configurations and runtime state stored in the Amazon FSx for Lustre file system. The Slurm accounting data is stored in the Amazon RDS for MariaDB database. Amazon Secrets Manager provides secure access to the database credentials for the controller nodes.

If there is a status change (Slurm controller is `ON` or `OFF`) in the Slurm controller nodes, Amazon SNS sends notifications to the admin for further action.

This multiple controller nodes architecture eliminates the single point of failure of a single controller (head) node, enables fast and automatic failover recovery, and gives you control over the Slurm accounting database and configurations.

# Setting up multiple controller nodes for a SageMaker HyperPod Slurm cluster
<a name="sagemaker-hyperpod-multihead-slurm-setup"></a>

This topic explains how to configure multiple controller (head) nodes in a SageMaker HyperPod Slurm cluster using lifecycle scripts. Before you start, review the prerequisites listed in [Prerequisites for using SageMaker HyperPod](sagemaker-hyperpod-prerequisites.md) and familiarize yourself with the lifecycle scripts in [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md). The instructions in this topic use Amazon CLI commands in Amazon Linux environment. Note that the environment variables used in these commands are available in the current session unless explicitly preserved.

**Topics**
+ [

# Provisioning resources using Amazon CloudFormation stacks
](sagemaker-hyperpod-multihead-slurm-cfn.md)
+ [

# Creating and attaching an IAM policy
](sagemaker-hyperpod-multihead-slurm-iam.md)
+ [

# Preparing and uploading lifecycle scripts
](sagemaker-hyperpod-multihead-slurm-scripts.md)
+ [

# Creating a SageMaker HyperPod cluster
](sagemaker-hyperpod-multihead-slurm-create.md)
+ [

# Considering important notes
](sagemaker-hyperpod-multihead-slurm-notes.md)
+ [

# Reviewing environment variables reference
](sagemaker-hyperpod-multihead-slurm-variables-reference.md)

# Provisioning resources using Amazon CloudFormation stacks
<a name="sagemaker-hyperpod-multihead-slurm-cfn"></a>

To set up multiple controller nodes in a HyperPod Slurm cluster, provision Amazon resources through two Amazon CloudFormation stacks: [Provision basic resources](#sagemaker-hyperpod-multihead-slurm-cfn-basic) and [Provision additional resources to support multiple controller nodes](#sagemaker-hyperpod-multihead-slurm-cfn-multihead).

## Provision basic resources
<a name="sagemaker-hyperpod-multihead-slurm-cfn-basic"></a>

Follow these steps to provision basic resources for your Amazon SageMaker HyperPod Slurm cluster.

1. Download the [sagemaker-hyperpod.yaml](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/sagemaker-hyperpod.yaml) template file to your machine. This YAML file is an Amazon CloudFormation template that defines the following resources to create for your Slurm cluster.
   + An execution IAM role for the compute node instance group
   + An Amazon S3 bucket to store the lifecycle scripts
   + Public and private subnets (private subnets have internet access through NAT gateways)
   + Internet Gateway/NAT gateways
   + Two Amazon EC2 security groups
   + An Amazon FSx volume to store configuration files

1. Run the following CLI command to create a Amazon CloudFormation stack named `sagemaker-hyperpod`. Define the Availability Zone (AZ) IDs for your cluster in `PrimarySubnetAZ` and `BackupSubnetAZ`. For example, *use1-az4* is an AZ ID for an Availability Zone in the `us-east-1` Region. For more information, see [Availability Zone IDs](https://docs.amazonaws.cn//ram/latest/userguide/working-with-az-ids.html) and [Setting up SageMaker HyperPod clusters across multiple AZs](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-multiple-availability-zones).

   ```
   aws cloudformation deploy \
   --template-file /path_to_template/sagemaker-hyperpod.yaml \
   --stack-name sagemaker-hyperpod \
   --parameter-overrides PrimarySubnetAZ=use1-az4 BackupSubnetAZ=use1-az1 \
   --capabilities CAPABILITY_IAM
   ```

   For more information, see [deploy](https://docs.amazonaws.cn//cli/latest/reference/cloudformation/deploy/) from the Amazon Command Line Interface Reference. The stack creation can take a few minutes to complete. When it's complete, you will see the following in your command line interface.

   ```
   Waiting for changeset to be created..
   Waiting for stack create/update to complete
   Successfully created/updated stack - sagemaker-hyperpod
   ```

1. (Optional) Verify the stack in the [Amazon CloudFormation console](https://console.aws.amazon.com/cloudformation/home).
   + From the left navigation, choose **Stack**.
   + On the **Stack** page, find and choose **sagemaker-hyperpod**.
   + Choose the tabs like **Resources** and **Outputs** to review the resources and outputs.

1. Create environment variables from the stack (`sagemaker-hyperpod`) outputs. You will use values of these variables to [Provision additional resources to support multiple controller nodes](#sagemaker-hyperpod-multihead-slurm-cfn-multihead).

   ```
   source .env
   PRIMARY_SUBNET=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`PrimaryPrivateSubnet`].OutputValue' --output text)
   BACKUP_SUBNET=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`BackupPrivateSubnet`].OutputValue' --output text)
   EMAIL=$(bash -c 'read -p "INPUT YOUR SNSSubEmailAddress HERE: " && echo $REPLY')
   DB_USER_NAME=$(bash -c 'read -p "INPUT YOUR DB_USER_NAME HERE: " && echo $REPLY')
   SECURITY_GROUP=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`SecurityGroup`].OutputValue' --output text)
   ROOT_BUCKET_NAME=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`AmazonS3BucketName`].OutputValue' --output text)
   SLURM_FSX_DNS_NAME=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`FSxLustreFilesystemDNSname`].OutputValue' --output text)
   SLURM_FSX_MOUNT_NAME=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`FSxLustreFilesystemMountname`].OutputValue' --output text)
   COMPUTE_NODE_ROLE=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`AmazonSagemakerClusterExecutionRoleArn`].OutputValue' --output text)
   ```

   When you see prompts asking for your email address and database user name, enter values like the following.

   ```
   INPUT YOUR SNSSubEmailAddress HERE: Email_address_to_receive_SNS_notifications
   INPUT YOUR DB_USER_NAME HERE: Database_user_name_you_define
   ```

   To verify variable values, use the `print $variable` command.

   ```
   print $REGION
   us-east-1
   ```

## Provision additional resources to support multiple controller nodes
<a name="sagemaker-hyperpod-multihead-slurm-cfn-multihead"></a>

Follow these steps to provision additional resources for your Amazon SageMaker HyperPod Slurm cluster with multiple controller nodes.

1. Download the [sagemaker-hyperpod-slurm-multi-headnode.yaml](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/sagemaker-hyperpod-slurm-multi-headnode.yaml) template file to your machine. This second YAML file is an Amazon CloudFormation template that defines the additional resources to create for multiple controller nodes support in your Slurm cluster.
   + An execution IAM role for the controller node instance group
   + An Amazon RDS for MariaDB instance
   + An Amazon SNS topic and subscription
   + Amazon Secrets Manager credentials for Amazon RDS for MariaDB

1. Run the following CLI command to create a Amazon CloudFormation stack named `sagemaker-hyperpod-mh`. This second stack uses the Amazon CloudFormation template to create additional Amazon resources to support the multiple controller nodes architecture.

   ```
   aws cloudformation deploy \
   --template-file /path_to_template/slurm-multi-headnode.yaml \
   --stack-name sagemaker-hyperpod-mh \
   --parameter-overrides \
   SlurmDBSecurityGroupId=$SECURITY_GROUP \
   SlurmDBSubnetGroupId1=$PRIMARY_SUBNET \
   SlurmDBSubnetGroupId2=$BACKUP_SUBNET \
   SNSSubEmailAddress=$EMAIL \
   SlurmDBUsername=$DB_USER_NAME \
   --capabilities CAPABILITY_NAMED_IAM
   ```

   For more information, see [deploy](https://docs.amazonaws.cn//cli/latest/reference/cloudformation/deploy/) from the Amazon Command Line Interface Reference. The stack creation can take a few minutes to complete. When it's complete, you will see the following in your command line interface.

   ```
   Waiting for changeset to be created..
   Waiting for stack create/update to complete
   Successfully created/updated stack - sagemaker-hyperpod-mh
   ```

1. (Optional) Verify the stack in the [Amazon Cloud Formation console](https://console.aws.amazon.com/cloudformation/home).
   + From the left navigation, choose **Stack**.
   + On the **Stack** page, find and choose **sagemaker-hyperpod-mh**.
   + Choose the tabs like **Resources** and **Outputs** to review the resources and outputs.

1. Create environment variables from the stack (`sagemaker-hyperpod-mh`) outputs. You will use values of these variables to update the configuration file (`provisioning_parameters.json`) in [Preparing and uploading lifecycle scripts](sagemaker-hyperpod-multihead-slurm-scripts.md).

   ```
   source .env
   SLURM_DB_ENDPOINT_ADDRESS=$(aws --region us-east-1 cloudformation describe-stacks --stack-name $MULTI_HEAD_SLURM_STACK --query 'Stacks[0].Outputs[?OutputKey==`SlurmDBEndpointAddress`].OutputValue' --output text)
   SLURM_DB_SECRET_ARN=$(aws --region us-east-1 cloudformation describe-stacks --stack-name $MULTI_HEAD_SLURM_STACK --query 'Stacks[0].Outputs[?OutputKey==`SlurmDBSecretArn`].OutputValue' --output text)
   SLURM_EXECUTION_ROLE_ARN=$(aws --region us-east-1 cloudformation describe-stacks --stack-name $MULTI_HEAD_SLURM_STACK --query 'Stacks[0].Outputs[?OutputKey==`SlurmExecutionRoleArn`].OutputValue' --output text)
   SLURM_SNS_FAILOVER_TOPIC_ARN=$(aws --region us-east-1 cloudformation describe-stacks --stack-name $MULTI_HEAD_SLURM_STACK --query 'Stacks[0].Outputs[?OutputKey==`SlurmFailOverSNSTopicArn`].OutputValue' --output text)
   ```

# Creating and attaching an IAM policy
<a name="sagemaker-hyperpod-multihead-slurm-iam"></a>

This section explains how to create an IAM policy and attach it to the execution role you created in [Provision additional resources to support multiple controller nodes](sagemaker-hyperpod-multihead-slurm-cfn.md#sagemaker-hyperpod-multihead-slurm-cfn-multihead).

1. Download the [IAM policy example](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/1.AmazonSageMakerClustersExecutionRolePolicy.json) to your machine from the GitHub repository.

1. Create an IAM policy with the downloaded example, using the [create-policy](https://docs.amazonaws.cn//cli/latest/reference/iam/create-policy.html) CLI command.

   ```
   aws --region us-east-1 iam create-policy \
       --policy-name AmazonSagemakerExecutionPolicy \
       --policy-document file://1.AmazonSageMakerClustersExecutionRolePolicy.json
   ```

   Example output of the command.

   ```
   {
       "Policy": {
           "PolicyName": "AmazonSagemakerExecutionPolicy",
           "PolicyId": "ANPAXISIWY5UYZM7WJR4W",
           "Arn": "arn:aws:iam::111122223333:policy/AmazonSagemakerExecutionPolicy",
           "Path": "/",
           "DefaultVersionId": "v1",
           "AttachmentCount": 0,
           "PermissionsBoundaryUsageCount": 0,
           "IsAttachable": true,
           "CreateDate": "2025-01-22T20:01:21+00:00",
           "UpdateDate": "2025-01-22T20:01:21+00:00"
       }
   }
   ```

1. Attach the policy `AmazonSagemakerExecutionPolicy` to the Slurm execution role you created in [Provision additional resources to support multiple controller nodes](sagemaker-hyperpod-multihead-slurm-cfn.md#sagemaker-hyperpod-multihead-slurm-cfn-multihead), using the [attach-role-policy](https://docs.amazonaws.cn//cli/latest/reference/iam/attach-role-policy.html) CLI command.

   ```
   aws --region us-east-1 iam attach-role-policy \
       --role-name AmazonSagemakerExecutionRole \
       --policy-arn arn:aws:iam::111122223333:policy/AmazonSagemakerExecutionPolicy
   ```

   This command doesn't produce any output.

   (Optional) If you use environment variables, here are the example commands.
   + To get the role name and policy name 

     ```
     POLICY=$(aws --region $REGION iam list-policies --query 'Policies[?PolicyName==AmazonSagemakerExecutionPolicy].Arn' --output text)
     ROLENAME=$(aws --region $REGION iam list-roles --query "Roles[?Arn=='${SLURM_EXECUTION_ROLE_ARN}'].RoleName" —output text)
     ```
   + To attach the policy

     ```
     aws  --region us-east-1 iam attach-role-policy \
          --role-name $ROLENAME --policy-arn $POLICY
     ```

For more information, see [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod).

# Preparing and uploading lifecycle scripts
<a name="sagemaker-hyperpod-multihead-slurm-scripts"></a>

After creating all the required resources, you'll need to set up [lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts) for your SageMaker HyperPod cluster. These [lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts) provide a [base configuration](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config) you can use to create a basic HyperPod Slurm cluster.

## Prepare the lifecycle scripts
<a name="sagemaker-hyperpod-multihead-slurm-prepare-scripts"></a>

Follow these steps to get the lifecycle scripts.

1. Download the [lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts) from the GitHub repository to your machine.

1. Upload the [lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts) to the Amazon S3 bucket you created in [Provision basic resources](sagemaker-hyperpod-multihead-slurm-cfn.md#sagemaker-hyperpod-multihead-slurm-cfn-basic), using the [cp](https://docs.amazonaws.cn//cli/latest/reference/s3/cp.html) CLI command.

   ```
   aws s3 cp --recursive LifeCycleScripts/base-config s3://${ROOT_BUCKET_NAME}/LifeCycleScripts/base-config
   ```

## Create configuration file
<a name="sagemaker-hyperpod-multihead-slurm-update-config-file"></a>

Follow these steps to create the configuration file and upload it to the same Amazon S3 bucket where you store the lifecycle scripts.

1. Create a configuration file named `provisioning_parameters.json` with the following configuration. Note that `slurm_sns_arn` is optional. If not provided, HyperPod will not set up the Amazon SNS notifications.

   ```
   cat <<EOF > /tmp/provisioning_parameters.json
   {
     "version": "1.0.0",
     "workload_manager": "slurm",
     "controller_group": "$CONTOLLER_IG_NAME",
     "login_group": "my-login-group",
     "worker_groups": [
       {
         "instance_group_name": "$COMPUTE_IG_NAME",
         "partition_name": "dev"
       }
     ],
     "fsx_dns_name": "$SLURM_FSX_DNS_NAME",
     "fsx_mountname": "$SLURM_FSX_MOUNT_NAME",
     "slurm_configurations": {
       "slurm_database_secret_arn": "$SLURM_DB_SECRET_ARN",
       "slurm_database_endpoint": "$SLURM_DB_ENDPOINT_ADDRESS",
       "slurm_shared_directory": "/fsx",
       "slurm_database_user": "$DB_USER_NAME",
       "slurm_sns_arn": "$SLURM_SNS_FAILOVER_TOPIC_ARN"
     }
   }
   EOF
   ```

1. Upload the `provisioning_parameters.json` file to the same Amazon S3 bucket where you store the lifecycle scripts.

   ```
   aws s3 cp /tmp/provisioning_parameters.json s3://${ROOT_BUCKET_NAME}/LifeCycleScripts/base-config/provisioning_parameters.json
   ```
**Note**  
If you are using API-driven configuration, the `provisioning_parameters.json` file is not required. With API-driven configuration, you define Slurm node types, partitions, and FSx mounting directly in the CreateCluster API payload. For details, see [Getting started with SageMaker HyperPod using the Amazon CLI](smcluster-getting-started-slurm-cli.md).

## Verify files in Amazon S3 bucket
<a name="sagemaker-hyperpod-multihead-slurm-verify-s3"></a>

After you upload all the lifecycle scripts and the `provisioning_parameters.json` file, your Amazon S3 bucket should look like the following.

![\[Image showing all the lifecycle scripts uploaded to the Amazon S3 bucket in the Amazon Simple Storage Service console.\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/images/hyperpod/hyperpod-lifecycle-scripts-s3.png)


For more information, see [Start with base lifecycle scripts provided by HyperPod](https://docs.amazonaws.cn//sagemaker/latest/dg/sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.html).

# Creating a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-multihead-slurm-create"></a>

After setting up all the required resources and uploading the scripts to the Amazon S3 bucket, you can create a cluster.

1. To create a cluster, run the [https://docs.amazonaws.cn//cli/latest/reference/sagemaker/create-cluster.html](https://docs.amazonaws.cn//cli/latest/reference/sagemaker/create-cluster.html) Amazon CLI command. The creation process can take up to 15 minutes to complete.

   ```
   aws --region $REGION sagemaker create-cluster \
       --cluster-name $HP_CLUSTER_NAME \
       --vpc-config '{
           "SecurityGroupIds":["'$SECURITY_GROUP'"],
           "Subnets":["'$PRIMARY_SUBNET'", "'$BACKUP_SUBNET'"]
       }' \
       --instance-groups '[{                  
       "InstanceGroupName": "'$CONTOLLER_IG_NAME'",
       "InstanceType": "ml.t3.medium",
       "InstanceCount": 2,
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://'$BUCKET_NAME'",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "'$SLURM_EXECUTION_ROLE_ARN'",
       "ThreadsPerCore": 1
   },
   {
       "InstanceGroupName": "'$COMPUTE_IG_NAME'",          
       "InstanceType": "ml.c5.xlarge",
       "InstanceCount": 2,
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://'$BUCKET_NAME'",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "'$COMPUTE_NODE_ROLE'",
       "ThreadsPerCore": 1
   }]'
   ```

   After successful execution, the command returns the cluster ARN like the following.

   ```
   {
       "ClusterArn": "arn:aws:sagemaker:us-east-1:111122223333:cluster/cluster_id"
   }
   ```

1. (Optional) To check the status of your cluster, you can use the SageMaker AI console ([https://console.amazonaws.cn/sagemaker/](https://console.amazonaws.cn/sagemaker/)). From the left navigation, choose **HyperPod Clusters**, then choose **Cluster Management**. Choose a cluster name to open the cluster details page. If your cluster is created successfully, you will see the cluster status is **InService**.  
![\[Image showing a HyperPod Slurm cluster with multiple controller nodes in the Amazon SageMaker AI console.\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/images/hyperpod/hyperpod-lifecycle-multihead-cluster.png)

# Considering important notes
<a name="sagemaker-hyperpod-multihead-slurm-notes"></a>

This section provides several important notes which you might find helpful. 

1. To migrate to a multi-controller Slurm cluster, complete these steps.

   1. Follow the instructions in [Provisioning resources using Amazon CloudFormation stacks](sagemaker-hyperpod-multihead-slurm-cfn.md) to provision all the required resources.

   1. Follow the instructions in [Preparing and uploading lifecycle scripts](sagemaker-hyperpod-multihead-slurm-scripts.md) to upload the updated lifecycle scripts. When updating the `provisioning_parameters.json` file, move your existing controller group to the `worker_groups` section, and add a new controller group name in the `controller_group` section.

   1. Run the [update-cluster](https://docs.amazonaws.cn/cli/latest/reference/sagemaker/update-cluster.html) API call to create a new controller group and keep the original compute instance groups and controller group.

1. To scale down the number of controller nodes, use the [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) CLI command. For each controller instance group, the minimum number of controller nodes you can scale down to is 1. This means that you cannot scale down the number of controller nodes to 0.
**Important**  
For clusters created before Jan 24, 2025, you must first update your cluster software using the [UpdateClusterSoftware](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API before running the [update-cluster](https://docs.amazonaws.cn/cli/latest/reference/sagemaker/update-cluster.html) CLI command.

   The following is an example CLI command to scale down the number of controller nodes.

   ```
   aws sagemaker update-cluster \
       --cluster-name my_cluster \
       --instance-groups '[{                  
       "InstanceGroupName": "controller_ig_name",
       "InstanceType": "ml.t3.medium",
       "InstanceCount": 3,
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://amzn-s3-demo-bucket1",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "slurm_execution_role_arn",
       "ThreadsPerCore": 1
   },
   {
       "InstanceGroupName": "compute-ig_name",       
       "InstanceType": "ml.c5.xlarge",
       "InstanceCount": 2,
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://amzn-s3-demo-bucket1",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "compute_node_role_arn",
       "ThreadsPerCore": 1
   }]'
   ```

1. To batch delete the controller nodes, use the [batch-delete-cluster-nodes](https://docs.amazonaws.cn/cli/latest/reference/sagemaker/batch-delete-cluster-nodes.html) CLI command. For each controller instance group, you must keep at least one controller node. If you want to batch delete all the controller nodes, the API operation won't work.
**Important**  
For clusters created before Jan 24, 2025, you must first update your cluster software using the [UpdateClusterSoftware](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API before running the [batch-delete-cluster-nodes](https://docs.amazonaws.cn/cli/latest/reference/sagemaker/batch-delete-cluster-nodes.html) CLI command.

   The following is an example CLI command to batch delete the controller nodes.

   ```
   aws sagemaker batch-delete-cluster-nodes --cluster-name my_cluster --node-ids instance_ids_to_delete
   ```

1. To troubleshoot your cluster creation issues, check the failure message from the cluster details page in your SageMaker AI console. You can also use CloudWatch logs to troubleshoot cluster creation issues. From the CloudWatch console, choose **Log groups**. Then, search `clusters` to see the list of log groups related to your cluster creation.  
![\[Image showing Amazon SageMaker HyperPod cluster log groups in the CloudWatch console.\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/images/hyperpod/hyperpod-lifecycle-multihead-logs.png)

# Reviewing environment variables reference
<a name="sagemaker-hyperpod-multihead-slurm-variables-reference"></a>

The following environment variables are defined and used in the tutorial of [Setting up multiple controller nodes for a SageMaker HyperPod Slurm cluster](sagemaker-hyperpod-multihead-slurm-setup.md). These environment variables are only available in the current session unless explicitly preserved. They are defined using the `$variable_name` syntax. Variables with key/value pairs represent Amazon-created resources, while variables without keys are user-defined.


**Environment variables reference**  

| Variable | Description | 
| --- | --- | 
| \$1BACKUP\$1SUBNET |  [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1COMPUTE\$1IG\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1COMPUTE\$1NODE\$1ROLE |  [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1CONTOLLER\$1IG\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1DB\$1USER\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1EMAIL |  [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1PRIMARY\$1SUBNET |  [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1POLICY |  [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1REGION |  [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1ROOT\$1BUCKET\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SECURITY\$1GROUP |  [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1DB\$1ENDPOINT\$1ADDRESS |  [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1DB\$1SECRET\$1ARN |  [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1EXECUTION\$1ROLE\$1ARN |  [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1FSX\$1DNS\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1FSX\$1MOUNT\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1SNS\$1FAILOVER\$1TOPIC\$1ARN |  [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 

# Jobs on SageMaker HyperPod clusters
<a name="sagemaker-hyperpod-run-jobs-slurm"></a>

The following topics provide procedures and examples of accessing compute nodes and running ML workloads on provisioned SageMaker HyperPod clusters. Depending on how you have set up the environment on your HyperPod cluster, there are many ways to run ML workloads on HyperPod clusters. Examples of running ML workloads on HyperPod clusters are also provided in the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/). The following topics walk you through how to log in to the provisioned HyperPod clusters and get you started with running sample ML workloads.

**Tip**  
To find practical examples and solutions, see also the [SageMaker HyperPod workshop](https://catalog.workshops.aws/sagemaker-hyperpod).

**Topics**
+ [

# Accessing your SageMaker HyperPod cluster nodes
](sagemaker-hyperpod-run-jobs-slurm-access-nodes.md)
+ [

# Scheduling a Slurm job on a SageMaker HyperPod cluster
](sagemaker-hyperpod-run-jobs-slurm-schedule-slurm-job.md)
+ [

# Running Docker containers on a Slurm compute node on HyperPod
](sagemaker-hyperpod-run-jobs-slurm-docker.md)
+ [

# Running distributed training workloads with Slurm on HyperPod
](sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload.md)

# Accessing your SageMaker HyperPod cluster nodes
<a name="sagemaker-hyperpod-run-jobs-slurm-access-nodes"></a>

You can access your **InService** cluster through Amazon Systems Manager (SSM) by running the Amazon CLI command `aws ssm start-session` with the SageMaker HyperPod cluster host name in format of `sagemaker-cluster:[cluster-id]_[instance-group-name]-[instance-id]`. You can retrieve the cluster ID, the instance ID, and the instance group name from the [SageMaker HyperPod console](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-view-details-of-clusters) or by running `describe-cluster` and `list-cluster-nodes` from the [Amazon CLI commands for SageMaker HyperPod](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-list-cluster-nodes). For example, if your cluster ID is `aa11bbbbb222`, the cluster node name is `controller-group`, and the cluster node ID is `i-111222333444555aa`, the SSM `start-session` command should be the following.

**Note**  
Granting users access to HyperPod cluster nodes allows them to install and operate user-managed software on the nodes. Ensure that you maintain the principle of least-privilege permissions for users.  
If you haven't set up Amazon Systems Manager, follow the instructions provided at [Setting up Amazon Systems Manager and Run As for cluster user access control](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-ssm).

```
$ aws ssm start-session \
    --target sagemaker-cluster:aa11bbbbb222_controller-group-i-111222333444555aa \
    --region us-west-2
Starting session with SessionId: s0011223344aabbccdd
root@ip-111-22-333-444:/usr/bin#
```

Note that this initially connects you as the root user. Before running jobs, switch to the `ubuntu` user by running the following command.

```
root@ip-111-22-333-444:/usr/bin# sudo su - ubuntu
ubuntu@ip-111-22-333-444:/usr/bin#
```

For advanced settings for practical use of HyperPod clusters, see the following topics.

**Topics**
+ [

## Additional tips for accessing your SageMaker HyperPod cluster nodes
](#sagemaker-hyperpod-run-jobs-slurm-access-nodes-tips)
+ [

## Set up a multi-user environment through the Amazon FSx shared space
](#sagemaker-hyperpod-run-jobs-slurm-access-nodes-multi-user-with-fxs-shared-space)
+ [

## Set up a multi-user environment by integrating HyperPod clusters with Active Directory
](#sagemaker-hyperpod-run-jobs-slurm-access-nodes-multi-user-with-active-directory)

## Additional tips for accessing your SageMaker HyperPod cluster nodes
<a name="sagemaker-hyperpod-run-jobs-slurm-access-nodes-tips"></a>

**Use the `easy-ssh.sh` script provided by HyperPod for simplifying the connection process**

To make the previous process into a single line command, the HyperPod team provides the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh) script that retrieves your cluster information, aggregates them into the SSM command, and connects to the compute node. You don't need to manually look for the required HyperPod cluster information as this script runs `describe-cluster` and `list-cluster-nodes` commands and parses the information needed for completing the SSM command. The following example commands show how to run the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh) script. If it runs successfully, you'll be connected to the cluster as the root user. It also prints a code snippet to set up SSH by adding the HyperPod cluster as a remote host through an SSM proxy. By setting up SSH, you can connect your local development environment such as Visual Studio Code with the HyperPod cluster.

```
$ chmod +x easy-ssh.sh
$ ./easy-ssh.sh -c <node-group> <cluster-name>
Cluster id: <cluster_id>
Instance id: <instance_id>
Node Group: <node-group>
Add the following to your ~/.ssh/config to easily connect:

$ cat <<EOF >> ~/.ssh/config
Host <cluster-name>
  User ubuntu
  ProxyCommand sh -c "aws ssm start-session  --target sagemaker-cluster:<cluster_id>_<node-group>-<instance_id> --document-name AWS-StartSSHSession --parameters 'portNumber=%p'"
EOF

Add your ssh keypair and then you can do:

$ ssh <cluster-name>

aws ssm start-session --target sagemaker-cluster:<cluster_id>_<node-group>-<instance_id>

Starting session with SessionId: s0011223344aabbccdd
root@ip-111-22-333-444:/usr/bin#
```

Note that this initially connects you as the root user. Before running jobs, switch to the `ubuntu` user by running the following command.

```
root@ip-111-22-333-444:/usr/bin# sudo su - ubuntu
ubuntu@ip-111-22-333-444:/usr/bin#
```

**Set up for easy access with SSH by using the HyperPod compute node as a remote host**

To further simplify access to the compute node using SSH from a local machine, the `easy-ssh.sh` script outputs a code snippet of setting up the HyperPod cluster as a remote host as shown in the previous section. The code snippet is auto-generated to help you directly add to the `~/.ssh/config` file on your local device. The following procedure shows how to set up for easy access using SSH through the SSM proxy, so that you or your cluster users can directly run `ssh <cluster-name>` to connect to the HyperPod cluster node.

1. On your local device, add the HyperPod compute node with a user name as a remote host to the `~/.ssh/config` file. The following command shows how to append the auto-generated code snippet from the `easy-ssh.sh` script to the `~/.ssh/config` file. Make sure that you copy it from the auto-generated output of the `easy-ssh.sh` script that has the correct cluster information.

   ```
   $ cat <<EOF >> ~/.ssh/config
   Host <cluster-name>
     User ubuntu
     ProxyCommand sh -c "aws ssm start-session  --target sagemaker-cluster:<cluster_id>_<node-group>-<instance_id> --document-name AWS-StartSSHSession --parameters 'portNumber=%p'"
   EOF
   ```

1. On the HyperPod cluster node, add the public key on your local device to the `~/.ssh/authorized_keys` file on the HyperPod cluster node.

   1. Print the public key file on your local machine.

      ```
      $ cat ~/.ssh/id_rsa.pub
      ```

      This should return your key. Copy the output of this command. 

      (Optional) If you don't have a public key, create one by running the following command.

      ```
      $ ssh-keygen -t rsa -q -f "$HOME/.ssh/id_rsa" -N ""
      ```

   1. Connect to the cluster node and switch to the user to add the key. The following command is an example of accessing as the `ubuntu` user. Replace `ubuntu` to the user name for which you want to set up the easy access with SSH.

      ```
      $ ./easy-ssh.sh -c <node-group> <cluster-name>
      $ sudo su - ubuntu
      ubuntu@ip-111-22-333-444:/usr/bin#
      ```

   1. Open the `~/.ssh/authorized_keys` file and add the public key at the end of the file.

      ```
      ubuntu@ip-111-22-333-444:/usr/bin# vim ~/.ssh/authorized_keys
      ```

After you finish setting up, you can connect to the HyperPod cluster node as the user by running a simplified SSH command as follows.

```
$ ssh <cluster-name>
ubuntu@ip-111-22-333-444:/usr/bin#
```

Also, you can use the host for remote development from an IDE on your local device, such as [Visual Studio Code Remote - SSH](https://code.visualstudio.com/docs/remote/ssh).

## Set up a multi-user environment through the Amazon FSx shared space
<a name="sagemaker-hyperpod-run-jobs-slurm-access-nodes-multi-user-with-fxs-shared-space"></a>

You can use the Amazon FSx shared space to manage a multi-user environment in a Slurm cluster on SageMaker HyperPod. If you have configured your Slurm cluster with Amazon FSx during the HyperPod cluster creation, this is a good option for setting up workspace for your cluster users. Create a new user and setup the home directory for the user on the Amazon FSx shared file system.

**Tip**  
To allow users to access your cluster through their user name and dedicated directories, you should also associate them with IAM roles or users by tagging them as guided in **Option 2** of step 5 under the procedure **To turn on Run As support for Linux and macOS managed nodes** provided at [Turn on Run As support for Linux and macOS managed nodes](https://docs.amazonaws.cn/systems-manager/latest/userguide/session-preferences-run-as.html) in the Amazon Systems Manager User Guide. See also [Setting up Amazon Systems Manager and Run As for cluster user access control](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-ssm).

**To set up a multi-user environment while creating a Slurm cluster on SageMaker HyperPod**

The SageMaker HyperPod service team provides a script [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh) as part of the base lifecycle script samples. 

1. Prepare a text file named `shared_users.txt` that you need to create in the following format. The first column is for user names, the second column is for unique user IDs, and the third column is for the user directories in the Amazon FSx shared space.

   ```
   username1,uid1,/fsx/username1
   username2,uid2,/fsx/username2
   ...
   ```

1. Make sure that you upload the `shared_users.txt` and [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh) files to the S3 bucket for HyperPod lifecycle scripts. While the cluster creation, cluster update, or cluster software update is in progress, the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh) reads in the `shared_users.txt` and sets up the user directories properly.

**To create new users and add to an existing Slurm cluster running on SageMaker HyperPod **

1. On the head node, run the following command to save a script that helps create a user. Make sure that you run this with sudo permissions.

   ```
   $ cat > create-user.sh << EOL
   #!/bin/bash
   
   set -x
   
   # Prompt user to get the new user name.
   read -p "Enter the new user name, i.e. 'sean': 
   " USER
   
   # create home directory as /fsx/<user>
   # Create the new user on the head node
   sudo useradd \$USER -m -d /fsx/\$USER --shell /bin/bash;
   user_id=\$(id -u \$USER)
   
   # add user to docker group
   sudo usermod -aG docker \${USER}
   
   # setup SSH Keypair
   sudo -u \$USER ssh-keygen -t rsa -q -f "/fsx/\$USER/.ssh/id_rsa" -N ""
   sudo -u \$USER cat /fsx/\$USER/.ssh/id_rsa.pub | sudo -u \$USER tee /fsx/\$USER/.ssh/authorized_keys
   
   # add user to compute nodes
   read -p "Number of compute nodes in your cluster, i.e. 8: 
   " NUM_NODES
   srun -N \$NUM_NODES sudo useradd -u \$user_id \$USER -d /fsx/\$USER --shell /bin/bash;
   
   # add them as a sudoer
   read -p "Do you want this user to be a sudoer? (y/N):
   " SUDO
   if [ "\$SUDO" = "y" ]; then
           sudo usermod -aG sudo \$USER
           sudo srun -N \$NUM_NODES sudo usermod -aG sudo \$USER
           echo -e "If you haven't already you'll need to run:\n\nsudo visudo /etc/sudoers\n\nChange the line:\n\n%sudo   ALL=(ALL:ALL) ALL\n\nTo\n\n%sudo   ALL=(ALL:ALL) NOPASSWD: ALL\n\nOn each node."
   fi
   EOL
   ```

1. Run the script with the following command. You'll be prompted for adding the name of a user and the number of compute nodes that you want to allow the user to access.

   ```
   $ bash create-user.sh
   ```

1. Test the user by running the following commands. 

   ```
   $ sudo su - <user> && ssh $(srun hostname)
   ```

1. Add the user information to the `shared_users.txt` file, so the user will be created on any new compute nodes or new clusters.

## Set up a multi-user environment by integrating HyperPod clusters with Active Directory
<a name="sagemaker-hyperpod-run-jobs-slurm-access-nodes-multi-user-with-active-directory"></a>

In practical use cases, HyperPod clusters are typically used by multiple users: machine learning (ML) researchers, software engineers, data scientists, and cluster administrators. They edit their own files and run their own jobs without impacting each other's work. To set up a multi-user environment, use the Linux user and group mechanism to statically create multiple users on each instance through lifecycle scripts. However, the drawback to this approach is that you need to duplicate user and group settings across multiple instances in the cluster to keep a consistent configuration across all instances when you make updates such as adding, editing, and removing users.

To solve this, you can use [Lightweight Directory Access Protocol (LDAP)](https://en.wikipedia.org/wiki/Lightweight_Directory_Access_Protocol) and [LDAP over TLS/SSL (LDAPS)](https://en.wikipedia.org/wiki/Lightweight_Directory_Access_Protocol) to integrate with a directory service such as [Amazon Directory Service for Microsoft Active Directory](https://aws.amazon.com/directoryservice/). To learn more about setting up Active Directory and a multi-user environment in a HyperPod cluster, see the blog post [Integrate HyperPod clusters with Active Directory for seamless multi-user login](https://amazonaws-china.com/blogs/machine-learning/integrate-hyperpod-clusters-with-active-directory-for-seamless-multi-user-login/).

# Scheduling a Slurm job on a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-run-jobs-slurm-schedule-slurm-job"></a>

You can launch training jobs using the standard Slurm `sbatch` or `srun` commands. For example, to launch an 8-node training job, you can run `srun -N 8 --exclusive train.sh` SageMaker HyperPod supports training in a range of environments, including `conda`, `venv`, `docker`, and `enroot`. You can configure an ML environment by running lifecycle scripts on your SageMaker HyperPod clusters. You also have an option to attach a shared file system such as Amazon FSx, which can also be used as a virtual environment.

The following example shows how to run a job for training Llama-2 with the Fully Sharded Data Parallelism (FSDP) technique on a SageMaker HyperPod cluster with an Amazon FSx shared file system. You can also find more examples from the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/).

**Tip**  
All SageMaker HyperPod examples are available in the `3.test_cases` folder of the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/).

1. Clone the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/), and copy the training job examples to your Amazon FSx file system. 

   ```
   $ TRAINING_DIR=/fsx/users/my-user/fsdp
   $ git clone https://github.com/aws-samples/awsome-distributed-training/
   ```

1. Run the [https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/10.FSDP/0.create_conda_env.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/10.FSDP/0.create_conda_env.sh) script. This creates a `conda` environment on your Amazon FSx file system. Make sure that the file system is accessible to all nodes in the cluster.

1. Build the virtual Conda environment by launching a single node slurm job as follows.

   ```
   $ srun -N 1 /path_to/create_conda_env.sh
   ```

1. After the environment is built, you can launch a training job by pointing to the environment path on the shared volume. You can launch both single-node and multi-node training jobs with the same setup. To launch a job, create a job launcher script (also called an entry point script) as follows.

   ```
   #!/usr/bin/env bash
   set -ex
   
   ENV_PATH=/fsx/users/my_user/pytorch_env
   TORCHRUN=$ENV_PATH/bin/torchrun
   TRAINING_SCRIPT=/fsx/users/my_user/pt_train.py
   
   WORLD_SIZE_JOB=$SLURM_NTASKS
   RANK_NODE=$SLURM_NODEID
   PROC_PER_NODE=8
   MASTER_ADDR=(`scontrol show hostnames \$SLURM_JOB_NODELIST | head -n 1`)
   MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
   
   DIST_ARGS="--nproc_per_node=$PROC_PER_NODE \
              --nnodes=$WORLD_SIZE_JOB \
              --node_rank=$RANK_NODE \
              --master_addr=$MASTER_ADDR \
              --master_port=$MASTER_PORT \
             "
             
   $TORCHRUN $DIST_ARGS $TRAINING_SCRIPT
   ```
**Tip**  
If you want to make your training job more resilient against hardware failures by using the auto-resume capability of SageMaker HyperPod, you need to properly set up the environment variable `MASTER_ADDR` in the entrypoint script. To learn more, see [Automatic node recovery and auto-resume](sagemaker-hyperpod-resiliency-slurm-auto-resume.md).

   This tutorial assumes that this script is saved as `/fsx/users/my_user/train.sh`.

1. With this script in the shared volume at `/fsx/users/my_user/train.sh`, run the following `srun` command to schedule the Slurm job.

   ```
   $ cd /fsx/users/my_user/
   $ srun -N 8 train.sh
   ```

# Running Docker containers on a Slurm compute node on HyperPod
<a name="sagemaker-hyperpod-run-jobs-slurm-docker"></a>

To run Docker containers with Slurm on SageMaker HyperPod, you need to use [Enroot](https://github.com/NVIDIA/enroot) and [Pyxis](https://github.com/NVIDIA/pyxis). The Enroot package helps convert Docker images into a runtime that Slurm can understand, while the Pyxis enables scheduling the runtime as a Slurm job through an `srun` command, `srun --container-image=docker/image:tag`. 

**Tip**  
The Docker, Enroot, and Pyxis packages should be installed during cluster creation as part of running the lifecycle scripts as guided in [Base lifecycle scripts provided by HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md). Use the [base lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config) provided by the HyperPod service team when creating a HyperPod cluster. Those base scripts are set up to install the packages by default. In the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py) script, there's the `Config` class with the boolean type parameter for installing the packages set to `True` (`enable_docker_enroot_pyxis=True`). This is called by and parsed in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py) script, which calls `install_docker.sh` and `install_enroot_pyxis.sh` scripts from the [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils) folder. The installation scripts are where the actual installations of the packages take place. Additionally, the installation scripts identify if they can detect NVMe store paths from the instances they are run on and set up the root paths for Docker and Enroot to `/opt/dlami/nvme`. The default root volume of any fresh instance is mounted to `/tmp` only with a 100GB EBS volume, which runs out if the workload you plan to run involves training of LLMs and thus large size Docker containers. If you use instance families such as P and G with local NVMe storage, you need to make sure that you use the NVMe storage attached at `/opt/dlami/nvme`, and the installation scripts take care of the configuration processes.

**To check if the root paths are set up properly**

On a compute node of your Slurm cluster on SageMaker HyperPod, run the following commands to make sure that the lifecycle script worked properly and the root volume of each node is set to `/opt/dlami/nvme/*`. The following commands shows examples of checking the Enroot runtime path and the data root path for 8 compute nodes of a Slurm cluster.

```
$ srun -N 8 cat /etc/enroot/enroot.conf | grep "ENROOT_RUNTIME_PATH"
ENROOT_RUNTIME_PATH        /opt/dlami/nvme/tmp/enroot/user-$(id -u)
... // The same or similar lines repeat 7 times
```

```
$ srun -N 8 cat /etc/docker/daemon.json
{
    "data-root": "/opt/dlami/nvme/docker/data-root"
}
... // The same or similar lines repeat 7 times
```

After you confirm that the runtime paths are properly set to `/opt/dlami/nvme/*`, you're ready to build and run Docker containers with Enroot and Pyxis.

**To test Docker with Slurm**

1. On your compute node, try the following commands to check if Docker and Enroot are properly installed.

   ```
   $ docker --help
   $ enroot --help
   ```

1. Test if Pyxis and Enroot installed correctly by running one of the [NVIDIA CUDA Ubuntu](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda) images.

   ```
   $ srun --container-image=nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY nvidia-smi
   pyxis: importing docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
   pyxis: imported docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
   DAY MMM DD HH:MM:SS YYYY
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: XX.YY    |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
   | N/A   40C    P0    27W /  70W |      0MiB / 15109MiB |      0%      Default |
   |                               |                      |                  N/A |
   +-------------------------------+----------------------+----------------------+
   
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   |  No running processes found                                                 |
   +-----------------------------------------------------------------------------+
   ```

   You can also test it by creating a script and running an `sbatch` command as follows.

   ```
   $ cat <<EOF >> container-test.sh
   #!/bin/bash
   #SBATCH --container-image=nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
   nvidia-smi
   EOF
   
   $ sbatch container-test.sh
   pyxis: importing docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
   pyxis: imported docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
   DAY MMM DD HH:MM:SS YYYY
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: XX.YY    |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
   | N/A   40C    P0    27W /  70W |      0MiB / 15109MiB |      0%      Default |
   |                               |                      |                  N/A |
   +-------------------------------+----------------------+----------------------+
   
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   |  No running processes found                                                 |
   +-----------------------------------------------------------------------------+
   ```

**To run a test Slurm job with Docker**

After you have completed setting up Slurm with Docker, you can bring any pre-built Docker images and run using Slurm on SageMaker HyperPod. The following is a sample use case that walks you through how to run a training job using Docker and Slurm on SageMaker HyperPod. It shows an example job of model-parallel training of the Llama 2 model with the SageMaker AI model parallelism (SMP) library.

1. If you want to use one of the pre-built ECR images distributed by SageMaker AI or DLC, make sure that you give your HyperPod cluster the permissions to pull ECR images through the [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod). If you use your own or an open source Docker image, you can skip this step. Add the following permissions to the [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod). In this tutorial, we use the [SMP Docker image](distributed-model-parallel-support-v2.md#distributed-model-parallel-supported-frameworks-v2) pre-packaged with the SMP library .

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "ecr:BatchCheckLayerAvailability",
                   "ecr:BatchGetImage",
                   "ecr-public:*",
                   "ecr:GetDownloadUrlForLayer",
                   "ecr:GetAuthorizationToken",
                   "sts:*"
               ],
               "Resource": "*"
           }
       ]
   }
   ```

------

1. On the compute node, clone the repository and go to the folder that provides the example scripts of training with SMP.

   ```
   $ git clone https://github.com/aws-samples/awsome-distributed-training/
   $ cd awsome-distributed-training/3.test_cases/17.SM-modelparallelv2
   ```

1. In this tutorial, run the sample script [https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/17.SM-modelparallelv2/docker_build.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/17.SM-modelparallelv2/docker_build.sh) that pulls the SMP Docker image, build the Docker container, and runs it as an Enroot runtime. You can modify this as you want.

   ```
   $ cat docker_build.sh
   #!/usr/bin/env bash
   
   region=us-west-2
   dlc_account_id=658645717510
   aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $dlc_account_id.dkr.ecr.$region.amazonaws.com
   
   docker build -t smpv2 .
   enroot import -o smpv2.sqsh  dockerd://smpv2:latest
   ```

   ```
   $ bash docker_build.sh
   ```

1. Create a batch script to launch a training job using `sbatch`. In this tutorial, the provided sample script [https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/17.SM-modelparallelv2/launch_training_enroot.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/17.SM-modelparallelv2/launch_training_enroot.sh) launches a model-parallel training job of the 70-billion-parameter Llama 2 model with a synthetic dataset on 8 compute nodes. A set of training scripts are provided at [https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/17.SM-modelparallelv2/scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/17.SM-modelparallelv2/scripts), and `launch_training_enroot.sh` takes `train_external.py` as the entrypoint script.
**Important**  
To use the a Docker container on SageMaker HyperPod, you must mount the `/var/log` directory from the host machine, which is the HyperPod compute node in this case, onto the `/var/log` directory in the container. You can set it up by adding the following variable for Enroot.  

   ```
   "${HYPERPOD_PATH:="/var/log/aws/clusters":"/var/log/aws/clusters"}"
   ```

   ```
   $ cat launch_training_enroot.sh
   #!/bin/bash
   
   # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
   # SPDX-License-Identifier: MIT-0
   
   #SBATCH --nodes=8 # number of nodes to use, 2 p4d(e) = 16 A100 GPUs
   #SBATCH --job-name=smpv2_llama # name of your job
   #SBATCH --exclusive # job has exclusive use of the resource, no sharing
   #SBATCH --wait-all-nodes=1
   
   set -ex;
   
   ###########################
   ###### User Variables #####
   ###########################
   
   #########################
   model_type=llama_v2
   model_size=70b
   
   # Toggle this to use synthetic data
   use_synthetic_data=1
   
   
   # To run training on your own data  set Training/Test Data path  -> Change this to the tokenized dataset path in Fsx. Acceptable formats are huggingface (arrow) and Jsonlines.
   # Also change the use_synthetic_data to 0
   
   export TRAINING_DIR=/fsx/path_to_data
   export TEST_DIR=/fsx/path_to_data
   export CHECKPOINT_DIR=$(pwd)/checkpoints
   
   # Variables for Enroot
   : "${IMAGE:=$(pwd)/smpv2.sqsh}"
   : "${HYPERPOD_PATH:="/var/log/aws/clusters":"/var/log/aws/clusters"}" # This is needed for validating its hyperpod cluster
   : "${TRAIN_DATA_PATH:=$TRAINING_DIR:$TRAINING_DIR}"
   : "${TEST_DATA_PATH:=$TEST_DIR:$TEST_DIR}"
   : "${CHECKPOINT_PATH:=$CHECKPOINT_DIR:$CHECKPOINT_DIR}"   
   
   
   ###########################
   ## Environment Variables ##
   ###########################
   
   #export NCCL_SOCKET_IFNAME=en
   export NCCL_ASYNC_ERROR_HANDLING=1
   
   export NCCL_PROTO="simple"
   export NCCL_SOCKET_IFNAME="^lo,docker"
   export RDMAV_FORK_SAFE=1
   export FI_EFA_USE_DEVICE_RDMA=1
   export NCCL_DEBUG_SUBSYS=off
   export NCCL_DEBUG="INFO"
   export SM_NUM_GPUS=8
   export GPU_NUM_DEVICES=8
   export FI_EFA_SET_CUDA_SYNC_MEMOPS=0
   
   # async runtime error ...
   export CUDA_DEVICE_MAX_CONNECTIONS=1
   
   
   #########################
   ## Command and Options ##
   #########################
   
   if [ "$model_size" == "7b" ]; then
       HIDDEN_WIDTH=4096
       NUM_LAYERS=32
       NUM_HEADS=32
       LLAMA_INTERMEDIATE_SIZE=11008
       DEFAULT_SHARD_DEGREE=8
   # More Llama model size options
   elif [ "$model_size" == "70b" ]; then
       HIDDEN_WIDTH=8192
       NUM_LAYERS=80
       NUM_HEADS=64
       LLAMA_INTERMEDIATE_SIZE=28672
       # Reduce for better perf on p4de
       DEFAULT_SHARD_DEGREE=64
   fi
   
   
   if [ -z "$shard_degree" ]; then
       SHARD_DEGREE=$DEFAULT_SHARD_DEGREE
   else
       SHARD_DEGREE=$shard_degree
   fi
   
   if [ -z "$LLAMA_INTERMEDIATE_SIZE" ]; then
       LLAMA_ARGS=""
   else
       LLAMA_ARGS="--llama_intermediate_size $LLAMA_INTERMEDIATE_SIZE "
   fi
   
   
   if [ $use_synthetic_data == 1 ]; then
       echo "using synthetic data"
       declare -a ARGS=(
       --container-image $IMAGE
       --container-mounts $HYPERPOD_PATH,$CHECKPOINT_PATH
       )
   else
       echo "using real data...."
       declare -a ARGS=(
       --container-image $IMAGE
       --container-mounts $HYPERPOD_PATH,$TRAIN_DATA_PATH,$TEST_DATA_PATH,$CHECKPOINT_PATH
       )
   fi
   
   
   declare -a TORCHRUN_ARGS=(
       # change this to match the number of gpus per node:
       --nproc_per_node=8 \
       --nnodes=$SLURM_JOB_NUM_NODES \
       --rdzv_id=$SLURM_JOB_ID \
       --rdzv_backend=c10d \
       --rdzv_endpoint=$(hostname) \
   )
   
   srun -l "${ARGS[@]}" torchrun "${TORCHRUN_ARGS[@]}" /path_to/train_external.py \
               --train_batch_size 4 \
               --max_steps 100 \
               --hidden_width $HIDDEN_WIDTH \
               --num_layers $NUM_LAYERS \
               --num_heads $NUM_HEADS \
               ${LLAMA_ARGS} \
               --shard_degree $SHARD_DEGREE \
               --model_type $model_type \
               --profile_nsys 1 \
               --use_smp_implementation 1 \
               --max_context_width 4096 \
               --tensor_parallel_degree 1 \
               --use_synthetic_data $use_synthetic_data \
               --training_dir $TRAINING_DIR \
               --test_dir $TEST_DIR \
               --dataset_type hf \
               --checkpoint_dir $CHECKPOINT_DIR \
               --checkpoint_freq 100 \
   
   $ sbatch launch_training_enroot.sh
   ```

To find the downloadable code examples, see [Run a model-parallel training job using the SageMaker AI model parallelism library, Docker and Enroot with Slurm](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/17.SM-modelparallelv2#option-2----run-training-using-docker-and-enroot) in the *Awsome Distributed Training GitHub repository*. For more information about distributed training with a Slurm cluster on SageMaker HyperPod, proceed to the next topic at [Running distributed training workloads with Slurm on HyperPod](sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload.md).

# Running distributed training workloads with Slurm on HyperPod
<a name="sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload"></a>

SageMaker HyperPod is specialized for workloads of training large language models (LLMs) and foundation models (FMs). These workloads often require the use of multiple parallelism techniques and optimized operations for ML infrastructure and resources. Using SageMaker HyperPod, you can use the following SageMaker AI distributed training frameworks:
+ The [SageMaker AI distributed data parallelism (SMDDP) library](data-parallel.md) that offers collective communication operations optimized for Amazon.
+ The [SageMaker AI model parallelism (SMP) library](model-parallel-v2.md) that implements various model parallelism techniques.

**Topics**
+ [

## Using SMDDP on a SageMaker HyperPod
](#sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload-smddp)
+ [

## Using SMP on a SageMaker HyperPod cluster
](#sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload-smp)

## Using SMDDP on a SageMaker HyperPod
<a name="sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload-smddp"></a>

The [SMDDP library](data-parallel.md) is a collective communication library that improves compute performance of distributed data parallel training. The SMDDP library works with the following open source distributed training frameworks:
+ [PyTorch distributed data parallel (DDP)](https://pytorch.org/docs/stable/notes/ddp.html)
+ [PyTorch fully sharded data parallelism (FSDP)](https://pytorch.org/docs/stable/fsdp.html)
+ [DeepSpeed](https://github.com/microsoft/DeepSpeed)
+ [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed)

The SMDDP library addresses communications overhead of the key collective communication operations by offering the following for SageMaker HyperPod.
+ The library offers `AllGather` optimized for Amazon. `AllGather` is a key operation used in sharded data parallel training, which is a memory-efficient data parallelism technique offered by popular libraries. These include the SageMaker AI model parallelism (SMP) library, DeepSpeed Zero Redundancy Optimizer (ZeRO), and PyTorch Fully Sharded Data Parallelism (FSDP).
+ The library performs optimized node-to-node communication by fully utilizing the Amazon network infrastructure and the SageMaker AI ML instance topology. 

**To run sample data-parallel training jobs**

Explore the following distributed training samples implementing data parallelism techniques using the SMDDP library.
+ [https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/12.SM-dataparallel-FSDP](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/12.SM-dataparallel-FSDP)
+ [https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/13.SM-dataparallel-deepspeed](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/13.SM-dataparallel-deepspeed)

**To set up an environment for using the SMDDP library on SageMaker HyperPod**

The following are training environment requirements for using the SMDDP library on SageMaker HyperPod.
+ PyTorch v2.0.1 and later
+ CUDA v11.8 and later
+ `libstdc++` runtime version greater than 3
+ Python v3.10.x and later
+ `ml.p4d.24xlarge` and `ml.p4de.24xlarge`, which are supported instance types by the SMDDP library
+ `imdsv2` enabled on training host

Depending on how you want to run the distributed training job, there are two options to install the SMDDP library:
+ A direct installation using the SMDDP binary file.
+ Using the SageMaker AI Deep Learning Containers (DLCs) pre-installed with the SMDDP library.

Docker images pre-installed with the SMDDP library or the URLs to the SMDDP binary files are listed at [Supported Frameworks](https://docs.amazonaws.cn/sagemaker/latest/dg/distributed-data-parallel-support.html#distributed-data-parallel-supported-frameworks) in the SMDDP library documentation.

**To install the SMDDP library on the SageMaker HyperPod DLAMI**
+ `pip install --no-cache-dir https://smdataparallel.s3.amazonaws.com/binary/pytorch/<pytorch-version>/cuXYZ/YYYY-MM-DD/smdistributed_dataparallel-X.Y.Z-cp310-cp310-linux_x86_64.whl`
**Note**  
If you work in a Conda environment, ensure that you install PyTorch using `conda install` instead of `pip`.  

  ```
  conda install pytorch==X.Y.Z  torchvision==X.Y.Z torchaudio==X.Y.Z pytorch-cuda=X.Y.Z -c pytorch -c nvidia
  ```

**To use the SMDDP library on a Docker container**
+ The SMDDP library is pre-installed on the SageMaker AI Deep Learning Containers (DLCs). To find the list of SageMaker AI framework DLCs for PyTorch with the SMDDP library, see [Supported Frameworks](https://docs.amazonaws.cn/sagemaker/latest/dg/distributed-data-parallel-support.html#distributed-data-parallel-supported-frameworks) in the SMDDP library documentation. You can also bring your own Docker container with required dependencies installed to use the SMDDP library. To learn more about setting up a custom Docker container to use the SMDDP library, see also [Create your own Docker container with the SageMaker AI distributed data parallel library](data-parallel-bring-your-own-container.md).
**Important**  
To use the SMDDP library in a Docker container, mount the `/var/log` directory from the host machine onto `/var/log` in the container. This can be done by adding the following option when running your container.  

  ```
  docker run <OTHER_OPTIONS> -v /var/log:/var/log ...
  ```

To learn how to run data-parallel training jobs with SMDDP in general, see [Distributed training with the SageMaker AI distributed data parallelism library](data-parallel-modify-sdp.md).

## Using SMP on a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload-smp"></a>

The [SageMaker AI model parallelism (SMP) library](model-parallel-v2.md) offers various [state-of-the-art model parallelism techniques](model-parallel-core-features-v2.md), including:
+ fully sharded data parallelism
+ expert parallelism
+ mixed precision training with FP16/BF16 and FP8 data types
+ tensor parallelism

The SMP library is also compatible with open source frameworks such as PyTorch FSDP, NVIDIA Megatron, and NVIDIA Transformer Engine.

**To run a sample model-parallel training workload**

The SageMaker AI service teams provide sample training jobs implementing model parallelism with the SMP library at [https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/17.SM-modelparallelv2](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/17.SM-modelparallelv2).

# SageMaker HyperPod cluster resources monitoring
<a name="sagemaker-hyperpod-cluster-observability-slurm"></a>

To achieve comprehensive observability into your SageMaker HyperPod cluster resources and software components, integrate the cluster with [Amazon Managed Service for Prometheus](https://docs.amazonaws.cn/prometheus/latest/userguide/what-is-Amazon-Managed-Service-Prometheus.html) and [Amazon Managed Grafana](https://docs.amazonaws.cn/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html). The integration with Amazon Managed Service for Prometheus enables the export of metrics related to your HyperPod cluster resources, providing insights into their performance, utilization, and health. The integration with Amazon Managed Grafana enables the visualization of these metrics through various Grafana dashboards that offer intuitive interface for monitoring and analyzing the cluster's behavior. By leveraging these services, you gain a centralized and unified view of your HyperPod cluster, facilitating proactive monitoring, troubleshooting, and optimization of your distributed training workloads.

**Tip**  
To find practical examples and solutions, see also the [SageMaker HyperPod workshop](https://catalog.workshops.aws/sagemaker-hyperpod).

![\[An overview of configuring SageMaker HyperPod with Amazon Managed Service for Prometheus and Amazon Managed Grafana.\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/images/hyperpod-observability-architecture.png)


Figure: This architecture diagram shows an overview of configuring SageMaker HyperPod with Amazon Managed Service for Prometheus and Amazon Managed Grafana.

Proceed to the following topics to set up for SageMaker HyperPod cluster observability.

**Topics**
+ [

# Prerequisites for SageMaker HyperPod cluster observability
](sagemaker-hyperpod-cluster-observability-slurm-prerequisites.md)
+ [

# Installing metrics exporter packages on your HyperPod cluster
](sagemaker-hyperpod-cluster-observability-slurm-install-exporters.md)
+ [

# Validating Prometheus setup on the head node of a HyperPod cluster
](sagemaker-hyperpod-cluster-observability-slurm-validate-prometheus-setup.md)
+ [

# Setting up an Amazon Managed Grafana workspace
](sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws.md)
+ [

# Exported metrics reference
](sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference.md)
+ [

# Amazon SageMaker HyperPod Slurm metrics
](smcluster-slurm-metrics.md)

# Prerequisites for SageMaker HyperPod cluster observability
<a name="sagemaker-hyperpod-cluster-observability-slurm-prerequisites"></a>

Before proceeding with the steps to [Installing metrics exporter packages on your HyperPod cluster](sagemaker-hyperpod-cluster-observability-slurm-install-exporters.md), ensure that the following prerequisites are met.

## Enable IAM Identity Center
<a name="sagemaker-hyperpod-cluster-observability-slurm-prerequisites-iam-id-center"></a>

To enable observability for your SageMaker HyperPod cluster, you must first enable IAM Identity Center. This is a prerequisite for deploying an Amazon CloudFormation stack that sets up the Amazon Managed Grafana workspace and Amazon Managed Service for Prometheus. Both of these services also require the IAM Identity Center for authentication and authorization, ensuring secure user access and management of the monitoring infrastructure.

For detailed guidance on enabling IAM Identity Center, see the [Enabling IAM Identity Center](https://docs.amazonaws.cn/singlesignon/latest/userguide/get-set-up-for-idc.html) section in the *Amazon IAM Identity Center User Guide*. 

After successfully enabling IAM Identity Center, set up a user account that will serve as the administrative user throughout the following configuration precedures.

## Create and deploy an Amazon CloudFormation stack for SageMaker HyperPod observability
<a name="sagemaker-hyperpod-cluster-observability-slurm-prerequisites-cloudformation-stack"></a>

Create and deploy a CloudFormation stack for SageMaker HyperPod observability to monitor HyperPod cluster metrics in real time using Amazon Managed Service for Prometheus and Amazon Managed Grafana. To deploy the stack, note that you also should enable your [IAM Identity Center](https://console.amazonaws.cn/singlesignon) beforehand.

Use the sample CloudFormation script [https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/4.prometheus-grafana/cluster-observability.yaml](https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/4.prometheus-grafana/cluster-observability.yaml) that helps you set up Amazon VPC subnets, Amazon FSx for Lustre file systems, Amazon S3 buckets, and IAM roles required to create a HyperPod cluster observability stack.

# Installing metrics exporter packages on your HyperPod cluster
<a name="sagemaker-hyperpod-cluster-observability-slurm-install-exporters"></a>

In the [base configuration lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) that the SageMaker HyperPod team provides also includes installation of various metric exporter packages. To activate the installation step, the only thing you need to do is to set the parameter `enable_observability=True` in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py) file. The lifecycle scripts are designed to bootstrap your cluster with the following open-source metric exporter packages.


|  |  |  | 
| --- |--- |--- |
| Name | Script deployment target node | Exporter description | 
| [Slurm exporter for Prometheus](https://github.com/vpenso/prometheus-slurm-exporter) | Head (controller) node |  Exports Slurm Accounting metrics.  | 
|  [Elastic Fabric Adapter (EFA) node exporter](https://github.com/aws-samples/awsome-distributed-training/tree/main/4.validation_and_observability/3.efa-node-exporter)  |  Compute node  |  Exports metrics from cluster nodes and EFA. The package is a fork of the [Prometheus node exporter](https://github.com/prometheus/node_exporter).  | 
|  [NVIDIA Data Center GPU Management (DCGM) exporter](https://github.com/NVIDIA/dcgm-exporter)  | Compute node |  Exports NVIDIA DCGM metrics about health and performance of NVIDIA GPUs.  | 

With `enable_observability=True` in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py) file, the following installation step is activated in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py) script. 

```
# Install metric exporting software and Prometheus for observability
if Config.enable_observability:
    if node_type == SlurmNodeType.COMPUTE_NODE:
        ExecuteBashScript("./utils/install_docker.sh").run()
        ExecuteBashScript("./utils/install_dcgm_exporter.sh").run()
        ExecuteBashScript("./utils/install_efa_node_exporter.sh").run()

    if node_type == SlurmNodeType.HEAD_NODE:
        wait_for_scontrol()
        ExecuteBashScript("./utils/install_docker.sh").run()
        ExecuteBashScript("./utils/install_slurm_exporter.sh").run()
        ExecuteBashScript("./utils/install_prometheus.sh").run()
```

On the compute nodes, the script installs the NVIDIA Data Center GPU Management (DCGM) exporter and the Elastic Fabric Adapter (EFA) node exporter. The DCGM exporter is an exporter for Prometheus that collects metrics from NVIDIA GPUs, enabling monitoring of GPU usage, performance, and health. The EFA node exporter, on the other hand, gathers metrics related to the EFA network interface, which is essential for low-latency and high-bandwidth communication in HPC clusters.

On the head node, the script installs the Slurm exporter for Prometheus and the [Prometheus open-source software](https://prometheus.io/docs/introduction/overview/). The Slurm exporter provides Prometheus with metrics related to Slurm jobs, partitions, and node states.

Note that the lifecycle scripts are designed to install all the exporter packages as docker containers, so the Docker package also should be installed on both the head and compute nodes. The scripts for these components are conveniently provided in the [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils) folder of the *Awsome Distributed Training GitHub repository*.

After you have successfully set up your HyperPod cluster installed with the exporter packages, proceed to the next topic to finish setting up Amazon Managed Service for Prometheus and Amazon Managed Grafana.

# Validating Prometheus setup on the head node of a HyperPod cluster
<a name="sagemaker-hyperpod-cluster-observability-slurm-validate-prometheus-setup"></a>

After you have successfully set up your HyperPod cluster installed with the exporter packages, check if Prometheus is properly set up on the head node of your HyperPod cluster.

1. Connect to the head node of your cluster. For instructions on accessing a node, see [Accessing your SageMaker HyperPod cluster nodes](sagemaker-hyperpod-run-jobs-slurm-access-nodes.md).

1. Run the following command to verify the Prometheus config and service file created by the lifecycle script `install_prometheus.sh` is running on the controller node. The output should show the Active status as **active (running)**.

   ```
   $ sudo systemctl status prometheus
   • prometheus service - Prometheus Exporter
   Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; preset:disabled)
   Active: active (running) since DAY YYYY-MM-DD HH:MM:SS UTC; Ss ago
   Main PID: 12345 (prometheus)
   Tasks: 7 (limit: 9281)
   Memory: 35M
   CPU: 234ms
   CGroup: /system.slice/prometheus.service
           -12345 /usr/bin/prometheus--config.file=/etc/prometheus/prometheus.yml
   ```

1. Validate the Prometheus configuration file as follows. The output must be similar to the following, with three exporter configured with the right compute node IP addresses.

   ```
   $ cat /etc/prometheus/prometheus.yml
   global:
     scrape_interval: 15s
     evaluation_interval: 15s
     scrape_timeout: 15s
   
   scrape_configs:
     - job_name: 'slurm_exporter'
       static_configs:
         - targets:
             - 'localhost:8080'
     - job_name: 'dcgm_exporter'
       static_configs:
         - targets:
             - '<ComputeNodeIP>:9400'
             - '<ComputeNodeIP>:9400'
     - job_name: 'efa_node_exporter'
       static_configs:
         - targets:
             - '<ComputeNodeIP>:9100'
             - '<ComputeNodeIP>:9100'
   
   remote_write:
     - url: <AMPReoteWriteURL>
       queue_config:
         max_samples_per_send: 1000
         max_shards: 200
         capacity: 2500
       sigv4:
         region: <Region>
   ```

1. To test if Prometheus is exporting Slurm, DCGM, and EFA metrics properly, run the following `curl` command for Prometheus on port `:9090` on the head node.

   ```
   $ curl -s http://localhost:9090/metrics | grep -E 'slurm|dcgm|efa'
   ```

   With the metrics exported to Amazon Managed Service for Prometheus Workspace through the Prometheus remote write configuration from the controller node, you can proceed to the next topic to set up Amazon Managed Grafana dashboards to display the metrics.

# Setting up an Amazon Managed Grafana workspace
<a name="sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws"></a>

Create a new Amazon Managed Grafana workspace or update an existing Amazon Managed Grafana workspace with Amazon Managed Service for Prometheus as the data source.

**Topics**
+ [

## Create a Grafana workspace and set Amazon Managed Service for Prometheus as a data source
](#sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws-create)
+ [

## Open the Grafana workspace and finish setting up the data source
](#sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws-connect-data-source)
+ [

## Import open-source Grafana dashboards
](#sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws-import-dashboards)

## Create a Grafana workspace and set Amazon Managed Service for Prometheus as a data source
<a name="sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws-create"></a>

To visualize metrics from Amazon Managed Service for Prometheus, create an Amazon Managed Grafana workspace and set it up to use Amazon Managed Service for Prometheus as a data source.

1. To create a Grafana workspace, follow the instructions at [Creating a workspace](https://docs.amazonaws.cn/grafana/latest/userguide/AMG-create-workspace.html#creating-workspace) in the *Amazon Managed Service for Prometheus User Guide*.

   1. In Step 13, select Amazon Managed Service for Prometheus as the data source.

   1. In Step 17, you can add the admin user and also other users in your IAM Identity Center.

For more information, see also the following resources.
+ [Set up Amazon Managed Grafana for use with Amazon Managed Service for Prometheus](https://docs.amazonaws.cn/prometheus/latest/userguide/AMP-amg.html) in the *Amazon Managed Service for Prometheus User Guide*
+ [Use Amazon data source configuration to add Amazon Managed Service for Prometheus as a data source](https://docs.amazonaws.cn/grafana/latest/userguide/AMP-adding-AWS-config.html) in the *Amazon Managed Grafana User Guide*

## Open the Grafana workspace and finish setting up the data source
<a name="sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws-connect-data-source"></a>

After you have successfully created or updated an Amazon Managed Grafana workspace, select the workspace URL to open the workspace. This prompts you to enter a user name and the password of the user that you have set up in IAM Identity Center. You should log in using the admin user to finish setting up the workspace.

1. In the workspace **Home** page, choose **Apps**, **Amazon Data Sources**, and **Data sources**.

1. In the **Data sources** page, and choose the **Data sources** tab.

1. For **Service**, choose Amazon Managed Service for Prometheus.

1. In the **Browse and provision data sources** section, choose the Amazon region where you provisioned an Amazon Managed Service for Prometheus workspace.

1. From the list of data sources in the selected Region, choose the one for Amazon Managed Service for Prometheus. Make sure that you check the resource ID and the resource alias of the Amazon Managed Service for Prometheus workspace that you have set up for HyperPod observability stack.

## Import open-source Grafana dashboards
<a name="sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws-import-dashboards"></a>

After you've successfully set up your Amazon Managed Grafana workspace with Amazon Managed Service for Prometheus as the data source, you'll start collecting metrics to Prometheus, and then should start seeing the various dashboards showing charts, information, and more. The Grafana open source software provides various dashboards, and you can import them into Amazon Managed Grafana.

**To import open-source Grafana dashboards to Amazon Managed Grafana**

1. In the **Home** page of your Amazon Managed Grafana workspace, choose **Dashboards**.

1. Choose the drop down menu button with the UI text **New**, and select **Import**.

1. Paste the URL to the [Slurm Dashboard](https://grafana.com/grafana/dashboards/4323-slurm-dashboard/).

   ```
   https://grafana.com/grafana/dashboards/4323-slurm-dashboard/
   ```

1. Select **Load**.

1. Repeat the previous steps to import the following dashboards.

   1. [Node Exporter Full Dashboard](https://grafana.com/grafana/dashboards/1860-node-exporter-full/)

      ```
      https://grafana.com/grafana/dashboards/1860-node-exporter-full/
      ```

   1. [NVIDIA DCGM Exporter Dashboard](https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/)

      ```
      https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/
      ```

   1. [EFA Metrics Dashboard](https://grafana.com/grafana/dashboards/20579-efa-metrics-dev/)

      ```
      https://grafana.com/grafana/dashboards/20579-efa-metrics-dev/
      ```

   1. [FSx for Lustre Metrics Dashboard](https://grafana.com/grafana/dashboards/20906-fsx-lustre/)

      ```
      https://grafana.com/grafana/dashboards/20906-fsx-lustre/
      ```

# Exported metrics reference
<a name="sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference"></a>

The following sections present comprehensive lists of metrics exported from SageMaker HyperPod to Amazon Managed Service for Prometheus upon the successful configuration of the Amazon CloudFormation stack for SageMaker HyperPod observability. You can start monitoring these metrics visualized in the Amazon Managed Grafana dashboards.

## Slurm exporter dashboard
<a name="sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference-slurm-exporter"></a>

Provides visualized information of Slurm clusters on SageMaker HyperPod.

**Types of metrics**
+ **Cluster Overview:** Displaying the total number of nodes, jobs, and their states.
+ **Job Metrics:** Visualizing job counts and states over time.
+ **Node Metrics:** Showing node states, allocation, and available resources.
+ **Partition Metrics:** Monitoring partition-specific metrics such as CPU, memory, and GPU utilization.
+ **Job Efficiency:** Calculating job efficiency based on resources utilized.

**List of metrics**


| Metric name | Description | 
| --- | --- | 
| slurm\$1job\$1count | Total number of jobs in the Slurm cluster | 
| slurm\$1job\$1state\$1count | Count of jobs in each state (e.g., running, pending, completed) | 
| slurm\$1node\$1count  | Total number of nodes in the Slurm cluster | 
| slurm\$1node\$1state\$1count  | Count of nodes in each state (e.g., idle, alloc, mix) | 
| slurm\$1partition\$1node\$1count  | Count of nodes in each partition | 
| slurm\$1partition\$1job\$1count  | Count of jobs in each partition | 
| slurm\$1partition\$1alloc\$1cpus  | Total number of allocated CPUs in each partition | 
| slurm\$1partition\$1free\$1cpus  | Total number of available CPUs in each partition | 
| slurm\$1partition\$1alloc\$1memory  | Total allocated memory in each partition | 
| slurm\$1partition\$1free\$1memory  | Total available memory in each partition | 
| slurm\$1partition\$1alloc\$1gpus  | Total allocated GPUs in each partition | 
| slurm\$1partition\$1free\$1gpus  | Total available GPUs in each partition | 

## Node exporter dashboard
<a name="sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference-node-exporter"></a>

Provides visualized information of system metrics collected by the [Prometheus node exporter](https://github.com/prometheus/node_exporter) from the HyperPod cluster nodes.

**Types of metrics**
+ **System overview:** Displaying CPU load averages and memory usage.
+ **Memory metrics:** Visualizing memory utilization including total memory, free memory, and swap space.
+ **Disk usage:** Monitoring disk space utilization and availability.
+ **Network traffic:** Showing network bytes received and transmitted over time.
+ **File system metrics:** Analyzing file system usage and availability.
+ **Disk I/O metrics:** Visualizing disk read and write activity.

**List of metrics**

For a complete list of metrics exported, see the [Node exporter ](https://github.com/prometheus/node_exporter?tab=readme-ov-file#enabled-by-default) and [procfs](https://github.com/prometheus/procfs?tab=readme-ov-file) GitHub repositories. The following table shows a subset of the metrics that provides insights into system resource utilization such as CPU load, memory usage, disk space, and network activity.


| Metric name | Description | 
| --- | --- | 
|  node\$1load1  | 1-minute load average | 
|  node\$1load5  | 5-minute load average | 
|  node\$1load15  | 15-minute load average | 
|  node\$1memory\$1MemTotal  | Total system memory | 
|  node\$1memory\$1MemFree  | Free system memory | 
|  node\$1memory\$1MemAvailable  | Available memory for allocation to processes | 
|  node\$1memory\$1Buffers  | Memory used by the kernel for buffering | 
|  node\$1memory\$1Cached  | Memory used by the kernel for caching file system data | 
|  node\$1memory\$1SwapTotal  | Total swap space available | 
|  node\$1memory\$1SwapFree  | Free swap space | 
|  node\$1memory\$1SwapCached  | Memory that once was swapped out, is swapped back in but still in swap | 
|  node\$1filesystem\$1avail\$1bytes  | Available disk space in bytes | 
|  node\$1filesystem\$1size\$1bytes  | Total disk space in bytes | 
|  node\$1filesystem\$1free\$1bytes  | Free disk space in bytes | 
|  node\$1network\$1receive\$1bytes  | Network bytes received | 
|  node\$1network\$1transmit\$1bytes  | Network bytes transmitted | 
|  node\$1disk\$1read\$1bytes  | Disk bytes read | 
|  node\$1disk\$1written\$1bytes  | Disk bytes written | 

## NVIDIA DCGM exporter dashboard
<a name="sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference-nvidia-dcgm-exporter"></a>

Provides visualized information of NVIDIA GPU metrics collected by the [NVIDIA DCGM exporter](https://github.com/NVIDIA/dcgm-exporter).

**Types of metrics**
+ **GPU Overview:** Displaying GPU utilization, temperatures, power usage, and memory usage. 
+ **Temperature Metrics:** Visualizing GPU temperatures over time. 
+ **Power Usage:** Monitoring GPU power draw and power usage trends. 
+ **Memory Utilization:** Analyzing GPU memory usage including used, free, and total memory. 
+ **Fan Speed:** Showing GPU fan speeds and variations. 
+ **ECC Errors:** Tracking GPU memory ECC errors and pending errors.

**List of metrics**

The following table shows a list of the metrics that provides insights into the NVIDIA GPU health and performance, including clock frequencies, temperatures, power usage, memory utilization, fan speeds, and error metrics.


| Metric name | Description | 
| --- | --- | 
|  DCGM\$1FI\$1DEV\$1SM\$1CLOCK  | SM clock frequency (in MHz) | 
|  DCGM\$1FI\$1DEV\$1MEM\$1CLOCK  | Memory clock frequency (in MHz) | 
|  DCGM\$1FI\$1DEV\$1MEMORY\$1TEMP  | Memory temperature (in C) | 
|  DCGM\$1FI\$1DEV\$1GPU\$1TEMP  | GPU temperature (in C) | 
|  DCGM\$1FI\$1DEV\$1POWER\$1USAGE  | Power draw (in W) | 
|  DCGM\$1FI\$1DEV\$1TOTAL\$1ENERGY\$1CONSUMPTION  | Total energy consumption since boot (in mJ) | 
|  DCGM\$1FI\$1DEV\$1PCIE\$1REPLAY\$1COUNTER  | Total number of PCIe retries | 
|  DCGM\$1FI\$1DEV\$1MEM\$1COPY\$1UTIL  | Memory utilization (in %) | 
|  DCGM\$1FI\$1DEV\$1ENC\$1UTIL  | Encoder utilization (in %) | 
|  DCGM\$1FI\$1DEV\$1DEC\$1UTIL  | Decoder utilization (in %) | 
|  DCGM\$1FI\$1DEV\$1XID\$1ERRORS  | Value of the last XID error encountered | 
|  DCGM\$1FI\$1DEV\$1FB\$1FREE  | Frame buffer memory free (in MiB) | 
|  DCGM\$1FI\$1DEV\$1FB\$1USED  | Frame buffer memory used (in MiB) | 
|  DCGM\$1FI\$1DEV\$1NVLINK\$1BANDWIDTH\$1TOTAL  | Total number of NVLink bandwidth counters for all lanes | 
|  DCGM\$1FI\$1DEV\$1VGPU\$1LICENSE\$1STATUS  | vGPU License status | 
|  DCGM\$1FI\$1DEV\$1UNCORRECTABLE\$1REMAPPED\$1ROWS  | Number of remapped rows for uncorrectable errors | 
|  DCGM\$1FI\$1DEV\$1CORRECTABLE\$1REMAPPED\$1ROWS  | Number of remapped rows for correctable errors | 
|  DCGM\$1FI\$1DEV\$1ROW\$1REMAP\$1FAILURE  | Whether remapping of rows has failed | 

## EFA metrics dashboard
<a name="sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference-efa-exporter"></a>

Provides visualized information of the metrics from [Amazon Elastic Fabric Adapter (EFA)](https://docs.amazonaws.cn/AWSEC2/latest/UserGuide/efa.html) equipped on P instances collected by the [EFA node exporter](https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/3.efa-node-exporter/README.md).

**Types of metrics**
+ **EFA error metrics:** Visualizing errors such as allocation errors, command errors, and memory map errors.
+ **EFA network traffic:** Monitoring received and transmitted bytes, packets, and work requests.
+ **EFA RDMA performance:** Analyzing RDMA read and write operations, including bytes transferred and error rates.
+ **EFA port lifespan:** Displaying the lifespan of EFA ports over time.
+ **EFA keep-alive packets:** Tracking the number of keep-alive packets received.

**List of metrics**

The following table shows a list of the metrics that provides insights into various aspects of EFA operation, including errors, completed commands, network traffic, and resource utilization.


| Metric name | Description | 
| --- | --- | 
|  node\$1amazonefa\$1info  | Non-numeric data from /sys/class/infiniband/, value is always 1. | 
|  node\$1amazonefa\$1lifespan  | Lifespan of the port | 
|  node\$1amazonefa\$1rdma\$1read\$1bytes  | Number of bytes read with RDMA | 
|  node\$1amazonefa\$1rdma\$1read\$1resp\$1bytes  | Number of read response bytes with RDMA | 
|  node\$1amazonefa\$1rdma\$1read\$1wr\$1err  | Number of read write errors with RDMA | 
|  node\$1amazonefa\$1rdma\$1read\$1wrs  | Number of read rs with RDMA | 
|  node\$1amazonefa\$1rdma\$1write\$1bytes  | Number of bytes written with RDMA | 
|  node\$1amazonefa\$1rdma\$1write\$1recv\$1bytes  | Number of bytes written and received with RDMA | 
|  node\$1amazonefa\$1rdma\$1write\$1wr\$1err  | Number of bytes written with error RDMA | 
|  node\$1amazonefa\$1rdma\$1write\$1wrs  | Number of bytes written wrs RDMA | 
|  node\$1amazonefa\$1recv\$1bytes  | Number of bytes received | 
|  node\$1amazonefa\$1recv\$1wrs  | Number of bytes received wrs | 
|  node\$1amazonefa\$1rx\$1bytes  | Number of bytes received | 
|  node\$1amazonefa\$1rx\$1drops  | Number of packets dropped | 
|  node\$1amazonefa\$1rx\$1pkts  | Number of packets received | 
|  node\$1amazonefa\$1send\$1bytes  | Number of bytes sent | 
|  node\$1amazonefa\$1send\$1wrs  | Number of wrs sent | 
|  node\$1amazonefa\$1tx\$1bytes  | Number of bytes transmitted | 
|  node\$1amazonefa\$1tx\$1pkts  | Number of packets transmitted | 

## FSx for Lustre metrics dashboard
<a name="sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference-fsx-exporter"></a>

Provides visualized information of the [metrics from Amazon FSx for Lustre](https://docs.amazonaws.cn/fsx/latest/LustreGuide/monitoring-cloudwatch.html) file system collected by [Amazon CloudWatch](https://docs.amazonaws.cn/fsx/latest/LustreGuide/monitoring-cloudwatch.html).

**Note**  
The Grafana FSx for Lustre dashboard utilizes Amazon CloudWatch as its data source, which differs from the other dashboards that you have configured to use Amazon Managed Service for Prometheus. To ensure accurate monitoring and visualization of metrics related to your FSx for Lustre file system, configure the FSx for Lustre dashboard to use Amazon CloudWatch as the data source, specifying the same Amazon Web Services Region where your FSx for Lustre file system is deployed.

**Types of metrics**
+ **DataReadBytes:** The number of bytes for file system read operations.
+ **DataWriteBytes:** The number of bytes for file system write operations.
+ **DataReadOperations:** The number of read operations.
+ **DataWriteOperations:** The number of write operations.
+ **MetadataOperations:** The number of meta data operations.
+ **FreeDataStorageCapacity:** The amount of available storage capacity.

# Amazon SageMaker HyperPod Slurm metrics
<a name="smcluster-slurm-metrics"></a>

Amazon SageMaker HyperPod provides a set of Amazon CloudWatch metrics that you can use to monitor the health and performance of your HyperPod clusters. These metrics are collected from the Slurm workload manager running on your HyperPod clusters and are available in the `/aws/sagemaker/Clusters` CloudWatch namespace.

## Cluster level metrics
<a name="smcluster-slurm-metrics-cluster"></a>

The following cluster-level metrics are available for HyperPod. These metrics use the `ClusterId` dimension to identify the specific HyperPod cluster.


| CloudWatch metric name | Notes | Amazon EKS Container Insights metric name | 
| --- | --- | --- | 
| cluster\$1node\$1count | Total number of nodes in the cluster | cluster\$1node\$1count | 
| cluster\$1idle\$1node\$1count | Number of idle nodes in the cluster | N/A | 
| cluster\$1failed\$1node\$1count | Number of failed nodes in the cluster | cluster\$1failed\$1node\$1count | 
| cluster\$1cpu\$1count | Total CPU cores in the cluster | node\$1cpu\$1limit | 
| cluster\$1idle\$1cpu\$1count | Number of idle CPU cores in the cluster | N/A | 
| cluster\$1gpu\$1count | Total GPUs in the cluster | node\$1gpu\$1limit | 
| cluster\$1idle\$1gpu\$1count | Number of idle GPUs in the cluster | N/A | 
| cluster\$1running\$1task\$1count | Number of running Slurm jobs in the cluster | N/A | 
| cluster\$1pending\$1task\$1count | Number of pending Slurm jobs in the cluster | N/A | 
| cluster\$1preempted\$1task\$1count | Number of preempted Slurm jobs in the cluster | N/A | 
| cluster\$1avg\$1task\$1wait\$1time | Average wait time for Slurm jobs in the cluster | N/A | 
| cluster\$1max\$1task\$1wait\$1time | Maximum wait time for Slurm jobs in the cluster | N/A | 

## Instance level metrics
<a name="smcluster-slurm-metrics-instance"></a>

The following instance-level metrics are available for HyperPod. These metrics also use the `ClusterId` dimension to identify the specific HyperPod cluster.


| CloudWatch metric name | Notes | Amazon EKS Container Insights metric name | 
| --- | --- | --- | 
| node\$1gpu\$1utilization | Average GPU utilization across all instances | node\$1gpu\$1utilization | 
| node\$1gpu\$1memory\$1utilization | Average GPU memory utilization across all instances | node\$1gpu\$1memory\$1utilization | 
| node\$1cpu\$1utilization | Average CPU utilization across all instances | node\$1cpu\$1utilization | 
| node\$1memory\$1utilization | Average memory utilization across all instances | node\$1memory\$1utilization | 

# SageMaker HyperPod cluster resiliency
<a name="sagemaker-hyperpod-resiliency-slurm"></a>

SageMaker HyperPod through Slurm orchestration provides the following cluster resiliency features.

**Topics**
+ [

# Health monitoring agent
](sagemaker-hyperpod-resiliency-slurm-cluster-health-check.md)
+ [

# Automatic node recovery and auto-resume
](sagemaker-hyperpod-resiliency-slurm-auto-resume.md)
+ [

# Manually replace or reboot a node using Slurm
](sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance.md)

# Health monitoring agent
<a name="sagemaker-hyperpod-resiliency-slurm-cluster-health-check"></a>

This section describes the set of health checks that SageMaker HyperPod uses to regularly monitor cluster instance health for issues with devices such as accelerators (GPU and Trainium cores) and networking (EFA). SageMaker HyperPod health-monitoring agent (HMA) continuously monitors the health status of each GPU-based or Trainium-based instance. When it detects any instance or GPU failures, the agent marks the instance as unhealthy.

SageMaker HyperPod HMA performs the same health checks for both EKS and Slurm orchestrators. For more information about HMA, see [Health Monitoring System](sagemaker-hyperpod-eks-resiliency-health-monitoring-agent.md).

# Automatic node recovery and auto-resume
<a name="sagemaker-hyperpod-resiliency-slurm-auto-resume"></a>

**Note**  
As of September 11, 2025, HyperPod with Slurm orchestration now supports health monitoring agents. Run [UpdateClusterSoftware](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) and update to the latest version of the AMI in order to use this functionality.

This section talks about Amazon SageMaker HyperPod's two complementary resilience features: automatic node recovery that replaces faulty infrastructure without manual intervention, and auto-resume functionality that restarts training jobs from the last checkpoint after hardware failures.

## How automatic node recovery works
<a name="sagemaker-hyperpod-resiliency-slurm-auto-resume-how"></a>

During cluster creation or update, cluster admin users can select the node (instance) recovery option between `Automatic` (Recommended) and `None` at the cluster level. If set to `Automatic`, SageMaker HyperPod reboots or replaces faulty nodes automatically. 

**Important**  
We recommend setting the `Automatic` option. By default, the clusters are set up with Automatic node recovery.

Automatic node recovery runs when issues are found from health-monitoring agent, basic health checks, and deep health checks. If set to `None`, the health monitoring agent will label the instances when a fault is detected, but it will not automatically initiate any repair or recovery actions on the affected nodes. We do not recommend this option.

## Running a training job with the Amazon SageMaker HyperPod auto-resume functionality
<a name="sagemaker-hyperpod-resiliency-slurm-auto-resume-job"></a>

This section describes how to run a training job with the SageMaker HyperPod auto-resume functionality, which provides a zero-touch resiliency infrastructure to automatically recover a training job from the last saved checkpoint in the event of a hardware failure.

With the auto-resume functionality, if a job fails due to a hardware failure or any transient issues in-between training, SageMaker HyperPod auto-resume starts the node replacement workflow and restarts the job after the faulty nodes are replaced. The following hardware checks are run whenever a job fails while using auto-resume:


| Category | Utility name | Instance type compatibility | Description | 
| --- | --- | --- | --- | 
| Accelerator | NVIDIA SMI | GPU | [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface) utility is a well-known CLI to manage and monitor GPUs. The built-in health checker parses the output from nvidia-smi to determine the health of the instance. | 
| Accelerator | Neuron sysfs | Trainium | For Trainium-powered instances, the health of the Neuron devices is determined by reading counters from [Neuron sysfs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-sysfs-user-guide.html) propagated directly by the Neuron driver. | 
| Network | EFA | GPU and Trainium | To aid in the diagnostic of Elastic Fabric Adaptor (EFA) devices, the EFA health checker runs a series of connectivity tests using all available EFA cards within the instance. | 

**Note**  
When [Generic Resources (GRES)](https://slurm.schedmd.com/gres.html) are attached to a Slurm node, Slurm typically doesn't permit changes in the node allocation, such as replacing nodes, and thus doesn’t allow to resume a failed job. Unless explicitly forbidden, the HyperPod auto-resume functionality automatically re-queues any faulty job associated with the GRES-enabled nodes. This process involves stopping the job, placing it back into the job queue, and then restarting the job from the beginning.

**Using the SageMaker HyperPod auto-resume functionality with Slurm**

When you use SageMaker HyperPod auto-resume with Slurm, you should run the job inside an exclusive allocation acquired either by using `salloc` or `sbatch`. In any case, you need to modify the entrypoint script to make sure that all setup steps run in a single `srun` command when resuming the job. Through the entrypoint script, it is important to set up the environment on the replaced node to be consistent with the environment that the job step was running before it was stopped. The following procedure shows how to prepare an entrypoint script to keep the environment consistent and run it as a single `srun` command.

**Tip**  
If you use `sbatch`, you can keep the batch script simple by creating a separate script for setting up the environment and using a single `srun` command.

1. Create a script using the following code example and save it as `train_auto_resume.sh`. This script deploys training environment setups assuming that there is no manual configuration previously made to the replaced node. This ensures that the environment is node-agnostic, so that when a node is replaced, the same environment is provisioned on the node before resuming the job.
**Note**  
The following code example shows how to discover the Slurm node list associated with the job. Do not use the `$SLURM_JOB_NODELIST` environment variable provided by Slurm, because its value might be outdated after SageMaker HyperPod auto-resumes the job. The following code example shows how to define a new `NODE_LIST` variable to replace `SLURM_JOB_NODELIST`, and then set up the `MASTER_NODE` and `MASTER_ADDR` variables off of the `NODE_LIST` variable.

   ```
   #!/bin/bash
   
   # Filename: train_auto_resume.sh
   # Sample containerized script to launch a training job with a single srun which can be auto-resumed.
   
   # Place your training environment setup here. 
   # Example: Install conda, docker, activate virtual env, etc.
   
   # Get the list of nodes for a given job
   NODE_LIST=$(scontrol show jobid=$SLURM_JOBID | \ # Show details of the SLURM job
               awk -F= '/NodeList=/{print $2}' | \  # Extract NodeList field
               grep -v Exc)                         # Exclude nodes marked as excluded
   
   # Determine the master node from the node list
   MASTER_NODE=$(scontrol show hostname $NODE_LIST | \ # Convert node list to hostnames
                 head -n 1)                            # Select the first hostname as master node
   
   # Get the master node address
   MASTER_ADDR=$(scontrol show node=$MASTER_NODE | \ # Show node information
                 awk -F= '/NodeAddr=/{print $2}' | \ # Extract NodeAddr
                 awk '{print $1}')                   # Print the first part of NodeAddr
   
   
   # Torchrun command to launch the training job
   torchrun_cmd="torchrun --nnodes=$SLURM_NNODES \
                          --nproc_per_node=1 \
                          --node_rank=$SLURM_NODE \
                          --master-addr=$MASTER_ADDR \
                          --master_port=1234 \
                          <your_training_script.py>"
   
   # Execute the torchrun command in the 'pytorch' Conda environment, 
   # streaming output live
   /opt/conda/bin/conda run --live-stream -n pytorch $torchrun_cmd
   ```
**Tip**  
You can use the preceding script to add more commands for installing any additional dependencies for your job. However, we recommend that you keep the dependency installation scripts to the [set of lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) that are used during cluster creation. If you use a virtual environment hosted on a shared directory, you can also utilize this script to activate the virtual environment.

1. Launch the job with SageMaker HyperPod auto-resume enabled by adding the flag `--auto-resume=1` to indicate that the `srun` command should be automatically retried in case of hardware failure. 
**Note**  
If you have set up a resource allocation using `sbatch` or `salloc`, you can run multiple `srun` commands within the allocation. In the event of a failure, the SageMaker HyperPod auto-resume functionality only operates in the current [job step](https://slurm.schedmd.com/job_launch.html#step_allocation) of the `srun` command with the flag `--auto-resume=1`. In other words, activating auto-resume in an `srun` command doesn't apply to other `srun` commands launched within a resource allocation session.

   The following are `srun` command examples with `auto-resume` enabled.

   **Using sbatch**

   Because most of the logic for setting up the environment is already in `train_auto_resume.sh`, the batch script should be simple and similar to the following code example. Assume that the following batch script is saved as `batch.sh`.

   ```
   #!/bin/bash
   #SBATCH --nodes 2
   #SBATCH --exclusive
   srun --auto-resume=1 train_auto_resume.sh
   ```

   Run the preceding batch script using the following command.

   ```
   sbatch batch.sh
   ```

   **Using salloc**

   Start by acquiring an exclusive allocation, and run the `srun` command with the `--auto-resume` flag and the entrypoint script.

   ```
   salloc -N 2 --exclusive
   srun --auto-resume=1 train_auto_resume.sh
   ```

## How automatic node recovery and auto-resume work together
<a name="sagemaker-hyperpod-resiliency-slurm-auto-resume-node-recovery"></a>

When both automatic node recovery and auto-resume systems are active, they follow a coordinated approach to handling failures. If the HMA detects a hardware fault, the node is marked for drain regardless of job-level status. With node automatic recovery enabled, the nodes are automatically replaced once all the jobs running in the nodes exit. In this scenario, for jobs with auto-resume enabled, if there is a non-zero exit status in the step, the auto resume kicks in (the jobs resume once nodes are replaced). Jobs without auto-resume enabled will simply exit, requiring manual resubmission by administrators or users.

**Note**  
If you use auto-resume, the nodes are always replaced (no reboots) when hardware failures are detected.

# Manually replace or reboot a node using Slurm
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance"></a>

This section talks about when you should manually reboot or replace a node, with instructions on how to do both.

## When to manually reboot or replace a node
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-when"></a>

The HyperPod auto-resume functionality monitors if the state of your Slurm nodes turns to `fail` or `down`. You can check the state of Slurm nodes by running `sinfo`.

If a node remains stuck or unresponsive and the auto-resume process does not recover it, you can manually initiate recovery. The choice between rebooting and replacing a node depends on the nature of the issue. Consider rebooting when facing temporary or software-related problems, such as system hangs, memory leaks, GPU driver issues, kernel updates, or hung processes. However, if you encounter persistent or hardware-related problems like failing GPUs, memory or networking faults, repeated health check failures, or nodes that remain unresponsive after multiple reboot attempts, node replacement is the more appropriate solution.

## Ways to manually reboot or replace nodes
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-ways"></a>

SageMaker HyperPod offers two methods for manual node recovery. The preferred approach is using the SageMaker HyperPod Reboot and Replace APIs, which provides a faster and more transparent recovery process that works across all orchestrators. Alternatively, you can use traditional Slurm commands like `scontrol update`, though this legacy method requires direct access to the Slurm's controller node. Both methods activate the same SageMaker HyperPod recovery processes.

## Manually reboot a node using reboot API
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-reboot-api"></a>

 You can use the **BatchRebootClusterNodes** to manually reboot a faulty node in your SageMaker HyperPod cluster.

 Here is an example of running the reboot operation on two Instances of a cluster using the Amazon Command Line Interface:

```
 aws sagemaker batch-reboot-cluster-nodes \
                --cluster-name arn:aws:sagemaker:ap-northeast-1:123456789:cluster/test-cluster \
                --node-ids i-0123456789abcdef0 i-0fedcba9876543210
```

## Manually replace a node using replace API
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-replace-api"></a>

 You can use the **BatchReplaceClusterNodes** to manually replace a faulty node in your SageMaker HyperPod cluster.

 Here is an example of running the replace operation on two Instances of a cluster using the Amazon Command Line Interface:

```
 aws sagemaker batch-replace-cluster-nodes \
                --cluster-name arn:aws:sagemaker:ap-northeast-1:123456789:cluster/test-cluster \
                --node-ids i-0123456789abcdef0 i-0fedcba9876543210
```

## Manually reboot a node using Slurm
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-reboot"></a>

You can also use the scontrol Slurm commands to trigger node recovery. These commands interact directly with the Slurm control plane and invoke the same underlying SageMaker HyperPod recovery mechanisms. 

In the following command , replace <ip-ipv4> with the Slurm node name (host name) of the faulty instance you want to reboot.

```
scontrol update node=<ip-ipv4> state=fail reason="Action:Reboot"
```

This marks the node as FAIL with the specified reason. SageMaker HyperPod detects this and reboots the instance. Avoid changing the node state or restarting the Slurm controller during the operation.

## Manually replace a node using Slurm
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-replace"></a>

You can use the scontrol update command as follows to replace a node.

In the following command, replace `<ip-ipv4>` with the Slurm node name (host name) of the faulty instance you want to replace.

```
scontrol update node=<ip-ipv4> state=fail reason="Action:Replace"
```

After running this command, the node will go into the `fail` state, waits for the currently running jobs to finish, is replaced with a healthy instance, and is recovered with the same host name. This process takes time depending on the available instances in your Availability Zone and the time it takes to run your lifecycle scripts. During the update and replacement processes, avoid changing the state of the node manually again or restarting the Slurm controller; doing so can lead to a replacement failure. If the node does not get recovered nor turn to the `idle` state after a long time, contact [Amazon Support](https://console.aws.amazon.com/support/).

## Manually force change a node
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-force"></a>

If the faulty node is continuously stuck in the `fail` state, the last resort you might try is to manually force change the node state to `down`. This requires administrator privileges (sudo permissions).

**Warning**  
Proceed carefully before you run the following command as it forces kill all jobs, and you might lose all unsaved work.

```
scontrol update node=<ip-ipv4> state=down reason="Action:Replace"
```

# Continuous provisioning for enhanced cluster operations with Slurm
<a name="sagemaker-hyperpod-scaling-slurm"></a>

Amazon SageMaker HyperPod clusters created with Slurm orchestration now support continuous provisioning, a capability that enables greater flexibility and efficiency when running large-scale AI/ML workloads. Continuous provisioning lets you start training quickly, scale seamlessly, perform maintenance without disrupting operations, and have granular visibility into cluster operations.

**Note**  
Continuous provisioning is available as an optional configuration for new HyperPod clusters created with Slurm orchestration. Existing clusters using the previous scaling model cannot be migrated to continuous provisioning at this time.

## How it works
<a name="sagemaker-hyperpod-scaling-slurm-how"></a>

The continuous provisioning system introduces a desired-state architecture that replaces the traditional all-or-nothing scaling model. In the previous model, if any instance group could not be fully provisioned, the entire cluster creation or update operation failed and rolled back. With continuous provisioning, the system accepts partial capacity and continues to provision remaining instances asynchronously.

The continuous provisioning system:
+ **Accepts the request**: Records the target instance count for each instance group.
+ **Initiates provisioning**: Begins launching instances for all instance groups in parallel.
+ **Provisions priority nodes first**: The cluster transitions to `InService` after at least one controller node (and one login node, if a login instance group is specified) is successfully provisioned.
+ **Tracks progress**: Monitors each instance launch attempt and records the status.
+ **Handles failures**: Automatically retries failed launches for worker nodes asynchronously.

Continuous provisioning is disabled by default. To use this feature, set `NodeProvisioningMode` to `Continuous` in your `CreateCluster` request.

With continuous provisioning enabled, you can initiate multiple scaling operations simultaneously without waiting for previous operations to complete. This lets you scale different instance groups in the same cluster concurrently and submit multiple scaling requests to the same instance group.

## Priority-based provisioning
<a name="sagemaker-hyperpod-scaling-slurm-priority"></a>

Slurm clusters require a controller node to be operational before worker nodes can register and accept jobs. Continuous provisioning handles this automatically through priority-based provisioning:

1. The controller instance group is provisioned first.

1. Once one controller node is healthy, login nodes and worker nodes begin provisioning in parallel.

1. The cluster transitions to `InService` when one controller node is up and one login node is up (if a login instance group is specified). If no login instance group is specified, the cluster transitions to `InService` as soon as the controller node is provisioned.

1. Worker nodes that cannot be immediately provisioned due to capacity constraints enter an asynchronous retry loop and are added to the Slurm cluster automatically as they become available.

## Controller failure handling
<a name="sagemaker-hyperpod-scaling-slurm-controller-failure"></a>

During cluster creation, if the controller node fails to provision, the behavior depends on whether the error is retryable or non-retryable.

**Retryable errors** (for example, unhealthy instance or transient failures):
+ HyperPod continuously replaces the instance and retries provisioning until the controller comes up.
+ Worker and login nodes that have already been provisioned remain available, but the cluster does not transition to `InService` until the controller is healthy.

**Non-retryable errors** (for example, no capacity available for the controller instance type or lifecycle script failure):
+ The cluster is marked as `Failed`.
+ You are notified of the failure reason and must take corrective action, such as choosing a different instance type, fixing lifecycle scripts, or retrying in a different Availability Zone.

## Prerequisites
<a name="sagemaker-hyperpod-scaling-slurm-prerequisites"></a>

Continuous provisioning requires that Slurm provisioning parameters (node types, partition names) are provided via the API payload in each instance group's `SlurmConfig` field. Clusters that rely on the legacy `provisioning_parameters.json` file in Amazon S3 are not compatible with continuous provisioning.

**Note**  
The following features are not currently supported with continuous provisioning on Slurm clusters: migration of existing clusters, multi-head node configuration via API-based Slurm topology, and `SlurmConfigStrategy`. Continuous provisioning operates exclusively in merge mode for `slurm.conf` management.

## Usage metering
<a name="sagemaker-hyperpod-scaling-slurm-metering"></a>

HyperPod clusters with continuous provisioning use instance-level metering to provide accurate billing that reflects actual resource usage. This metering approach differs from traditional cluster-level billing by tracking each instance independently.

**Instance-level billing**

With continuous provisioning, billing starts and stops at the individual instance level rather than waiting for cluster-level state changes. This approach provides the following benefits:
+ **Precise billing accuracy**: Billing starts when the lifecycle script execution begins. If the lifecycle script fails, the instance provision will be retried and you are charged for the duration of the lifecycle script runtime.
+ **Independent metering**: Each instance's billing lifecycle is managed separately, preventing cascading billing errors.
+ **Real-time billing updates**: Billing starts when an instance begins executing its lifecycle configuration script and stops when the instance enters a terminating state.

**Billing lifecycle**

Each instance in your HyperPod cluster follows this billing lifecycle:
+ **Billing starts**: When the instance successfully launches and begins executing its lifecycle configuration script.
+ **Billing continues**: Throughout the instance's operational lifetime.
+ **Billing stops**: When the instance enters a terminating state, regardless of the reason for termination.

**Note**  
Billing does not start for instances that fail to launch. If an instance launch fails due to insufficient capacity or other issues, you are not charged for that failed attempt. Billing is calculated at the instance level and costs are aggregated and reported under your cluster's Amazon Resource Name (ARN).

## Create a cluster with continuous provisioning enabled
<a name="sagemaker-hyperpod-scaling-slurm-create"></a>

**Note**  
Prepare a lifecycle configuration script and upload it to an Amazon S3 bucket that your execution role can access. For more information, see [SageMaker HyperPod Slurm cluster operations](sagemaker-hyperpod-operate-slurm.md).

Prepare a `CreateCluster` API request file in JSON format. Set `NodeProvisioningMode` to `Continuous` and provide Slurm topology information in each instance group's `SlurmConfig` field.

```
// create_cluster.json
{
    "ClusterName": "my-training-cluster",
    "NodeProvisioningMode": "Continuous",
    "Orchestrator": {
        "Slurm": {}
    },
    "InstanceGroups": [
        {
            "InstanceGroupName": "controller-group",
            "InstanceType": "ml.m5.xlarge",
            "InstanceCount": 1,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://amzn-s3-demo-bucket/lifecycle-scripts/src/",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster",
            "SlurmConfig": {
                "NodeType": "Controller"
            }
        },
        {
            "InstanceGroupName": "login-group",
            "InstanceType": "ml.m5.xlarge",
            "InstanceCount": 1,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://amzn-s3-demo-bucket/lifecycle-scripts/src/",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster",
            "SlurmConfig": {
                "NodeType": "Login"
            }
        },
        {
            "InstanceGroupName": "worker-gpu-a",
            "InstanceType": "ml.p5.48xlarge",
            "InstanceCount": 16,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://amzn-s3-demo-bucket/lifecycle-scripts/src/",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster",
            "SlurmConfig": {
                "NodeType": "Compute",
                "PartitionNames": ["gpu-training"]
            }
        }
    ],
    "VpcConfig": {
        "SecurityGroupIds": ["sg-12345678"],
        "Subnets": ["subnet-12345678"]
    }
}
```

Run the `create-cluster` command to submit the request.

```
aws sagemaker create-cluster \
    --cli-input-json file://complete/path/to/create_cluster.json
```

This returns the ARN of the new cluster.

```
{
    "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/abcde12345"
}
```

## Slurm configuration management
<a name="sagemaker-hyperpod-scaling-slurm-config"></a>

Continuous provisioning operates exclusively in merge mode for `slurm.conf` partition management. In merge mode, HyperPod applies its partition configuration changes additively on top of whatever you have modified in `slurm.conf`. HyperPod only updates the partition-related sections of `slurm.conf` (such as partition name and node name entries); other Slurm configuration parameters are not modified. This means:
+ Your manual edits to `slurm.conf` are preserved.
+ There is no automated drift detection or resolution of conflicts between your modifications and HyperPod's expected state.

The `SlurmConfigStrategy` parameter (`Managed`, `Merge`, `Overwrite`) is not supported with continuous provisioning. Passing any `SlurmConfigStrategy` value results in an API error.

# SageMaker HyperPod cluster management
<a name="sagemaker-hyperpod-cluster-management-slurm"></a>

The following topics discuss logging and managing SageMaker HyperPod clusters.

## Logging SageMaker HyperPod events
<a name="sagemaker-hyperpod-cluster-management-slurm-logging-hyperpod-events"></a>

All events and logs from SageMaker HyperPod are saved to Amazon CloudWatch under the log group name `/aws/sagemaker/Clusters/[ClusterName]/[ClusterID]`. Every call to the `CreateCluster` API creates a new log group. The following list contains all of the available log streams collected in each log group.


|  |  | 
| --- |--- |
| Log Group Name | Log Stream Name | 
| /aws/sagemaker/Clusters/[ClusterName]/[ClusterID] | LifecycleConfig/[instance-group-name]/[instance-id] | 

## Logging SageMaker HyperPod at instance level
<a name="sagemaker-hyperpod-cluster-management-slurm-logging-at-instance-level"></a>

You can access the LifecycleScript logs published to CloudWatch during cluster instance configuration. Every instance within the created cluster generates a separate log stream, distinguishable by the `LifecycleConfig/[instance-group-name]/[instance-id]` format. 

All logs that are written to `/var/log/provision/provisioning.log` are uploaded to the preceding CloudWatch stream. Sample LifecycleScripts at [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config) redirect their `stdout` and `stderr` to this location. If you are using your custom scripts, write your logs to the `/var/log/provision/provisioning.log` location for them to be available in CloudWatch.

**Lifecycle script log markers**

CloudWatch logs for lifecycle scripts include specific markers to help you track execution progress and identify issues:


|  |  | 
| --- |--- |
| Marker | Description | 
| START | Indicates the beginning of lifecycle script logs for the instance | 
| [SageMaker] Lifecycle scripts were provided, with S3 uri: [s3://bucket-name/] and entrypoint script: [script-name.sh] | Indicates the S3 location and entrypoint script that will be used | 
| [SageMaker] Downloading lifecycle scripts | Indicates scripts are being downloaded from the specified S3 location | 
| [SageMaker] Lifecycle scripts have been downloaded | Indicates scripts have been successfully downloaded from S3 | 
| [SageMaker] The lifecycle scripts succeeded | Indicates successful completion of all lifecycle scripts | 
| [SageMaker] The lifecycle scripts failed | Indicates failed execution of lifecycle scripts | 

These markers help you quickly identify where in the lifecycle script execution process an issue occurred. When troubleshooting failures, review the log entries to identify where the process stopped or failed.

**Lifecycle script failure messages**

If the lifecycle script exists but fails during execution, you will receive an error message that includes the CloudWatch log group name and log stream name. In the event that there are lifecycle script failures across multiple instances, the error message will indicate only one failed instance, but the log group should contain streams for all instances.

You can view the error message by running the [DescribeCluster](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_DescribeCluster.html) API or by viewing the cluster details page in the SageMaker console. In the console, a **View lifecycle script logs** button is provided that navigates directly to the CloudWatch log stream. The error message has the following format:

```
Instance [instance-id] failed to provision with the following error: "Lifecycle scripts did not run successfully. To view lifecycle script logs,
visit log group ‘/aws/sagemaker/Clusters/[cluster-name]/[cluster-id]' and log stream ‘LifecycleConfig/[instance-group-name]/[instance-id]’.
If you cannot find corresponding lifecycle script logs in CloudWatch, please make sure you follow one of the options here:
https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-faq-slurm.html#hyperpod-faqs-q1.” Note that multiple instances may be impacted.
```

## Tagging resources
<a name="sagemaker-hyperpod-cluster-management-slurm-tagging"></a>

Amazon Tagging system helps manage, identify, organize, search for, and filter resources. SageMaker HyperPod supports tagging, so you can manage the clusters as an Amazon resource. During cluster creation or editing an existing cluster, you can add or edit tags for the cluster. To learn more about tagging in general, see [Tagging your Amazon resources](https://docs.amazonaws.cn/tag-editor/latest/userguide/tagging.html).

### Using the SageMaker HyperPod console UI
<a name="sagemaker-hyperpod-cluster-management-slurm-tagging-in-console"></a>

When you are [creating a new cluster](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-create-cluster) and [editing a cluster](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-edit-clusters), you can add, remove, or edit tags.

### Using the SageMaker HyperPod APIs
<a name="sagemaker-hyperpod-cluster-management-slurm-tagging-in-api-request"></a>

When you write a [CreateCluster](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateCluster.html) or [UpdateCluster](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_UpdateCluster.html) API request file in JSON format, edit the `Tags` section.

### Using the Amazon CLI tagging commands for SageMaker AI
<a name="sagemaker-hyperpod-cluster-management-slurm-tagging-using-cli"></a>

**To tag a cluster**

Use [https://docs.amazonaws.cn/cli/latest/reference/sagemaker/add-tags.html](https://docs.amazonaws.cn/cli/latest/reference/sagemaker/add-tags.html) as follows.

```
aws sagemaker add-tags --resource-arn cluster_ARN --tags Key=string,Value=string
```

**To untag a cluster**

Use [https://docs.amazonaws.cn/cli/latest/reference/sagemaker/delete-tags.html](https://docs.amazonaws.cn/cli/latest/reference/sagemaker/delete-tags.html) as follows.

```
aws sagemaker delete-tags --resource-arn cluster_ARN --tag-keys "tag_key"
```

**To list tags for a resource**

Use [https://docs.amazonaws.cn/cli/latest/reference/sagemaker/list-tags.html](https://docs.amazonaws.cn/cli/latest/reference/sagemaker/list-tags.html) as follows.

```
aws sagemaker list-tags --resource-arn cluster_ARN
```

# SageMaker HyperPod FAQs
<a name="sagemaker-hyperpod-faq-slurm"></a>

Use the following frequently asked questions to troubleshoot problems with using SageMaker HyperPod.

**Topics**
+ [

## Why can't I find log groups of my SageMaker HyperPod cluster in Amazon CloudWatch?
](#hyperpod-faqs-q1)
+ [

## What particular configurations does HyperPod manage in Slurm configuration files such as `slurm.conf` and `gres.conf`?
](#hyperpod-faqs-q2)
+ [

## How do I run Docker on Slurm nodes on HyperPod?
](#hyperpod-faqs-q3)
+ [

## Why does my parallel training job fail when I use NVIDIA Collective Communications Library (NCCL) with Slurm on SageMaker HyperPod platform?
](#hyperpod-faqs-q4)
+ [

## How do I use local NVMe store of P instances for launching Docker or Enroot containers with Slurm?
](#hyperpod-faqs-q5)
+ [

## How to set up EFA security groups?
](#hyperpod-faqs-q6)
+ [

## How do I monitor my HyperPod cluster nodes? Are there any CloudWatch metrics exported from HyperPod?
](#hyperpod-faqs-q7)
+ [

## Can I add an additional storage to the HyperPod cluster nodes? The cluster instances have limited local instance store.
](#hyperpod-faqs-q8)
+ [

## Why are my compute nodes showing as "DOWN" or "DRAINED" after a reboot?
](#hyperpod-faqs-q9)
+ [

## Why do my nodes keep getting drained due to Out of Memory (OOM) issues?
](#hyperpod-faqs-q10)
+ [

## How can I ensure resources are properly cleaned up after jobs complete?
](#hyperpod-faqs-q11)

## Why can't I find log groups of my SageMaker HyperPod cluster in Amazon CloudWatch?
<a name="hyperpod-faqs-q1"></a>

By default, agent logs and instance start-up logs are sent to the HyperPod platform account’s CloudWatch. In case of user lifecycle scripts, lifecycle configuration logs are sent to your account’s CloudWatch.

If you use the [sample lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) provided by the HyperPod service team, you can expect to find the lifecycle configuration logs written to `/var/log/provision/provisioning.log`, and you wouldn’t encounter this problem.

However, if you use custom paths for collecting logs from lifecycle provisioning and can’t find the log groups appearing in your account's CloudWatch, it might be due to mismatches in the log file paths specified in your lifecycle scripts and what the CloudWatch agent running on the HyperPod cluster instances looks for. In this case, it means that you need to properly set up your lifecycle scripts to send logs to the CloudWatch agent, and also set up the CloudWatch agent configuration accordingly. To resolve the problem, choose one of the following options.
+ **Option 1:** Update your lifecycle scripts to write logs to `/var/log/provision/provisioning.log`.
+ **Option 2:** Update the CloudWatch agent to look for your custom paths for logging lifecycle provisioning.

  1. Each HyperPod cluster instance contains a CloudWatch agent configuration file in JSON format at `/opt/aws/amazon-cloudwatch-agent/sagemaker_cwagent_config.json`. In the configuration file, find the field name `logs.logs_collected.files.collect_list.file_path`. With the default setup by HyperPod, the key-value pair should be `"file_path": "/var/log/provision/provisioning.log"` as documented at [Logging SageMaker HyperPod at instance level](sagemaker-hyperpod-cluster-management-slurm.md#sagemaker-hyperpod-cluster-management-slurm-logging-at-instance-level). The following code snippet shows how the JSON file looks with the HyperPod default configuration.

     ```
     "logs": {
         "logs_collected": {
             "files": {
                 "collect_list": [
                     {
                         "file_path": "/var/log/provision/provisioning.log",
                         "log_group_name": "/aws/sagemaker/Clusters/[ClusterName]/[ClusterID]",
                         "log_stream_name": "LifecycleConfig/[InstanceGroupName]/{instance_id}",
                         "retention_in_days": -1
                     }
                 ]
             }
         },
         "force_flush_interval": 3
     }
     ```

  1. Replace the value for the `"file_path"` field name with the custom path you use in your lifecycle scripts. For example, if you have set up your lifecycle scripts to write to `/var/log/custom-provision/custom-provisioning.log`, update the value to match with it as follows.

     ```
     "file_path": "/var/log/custom-provision/custom-provisioning.log"
     ```

  1. Restart the CloudWatch agent with the configuration file to finish applying the custom path. For example, the following CloudWatch command shows how to restart the CloudWatch agent with the CloudWatch agent configuration file from step 1. For more information, see also [Troubleshooting the CloudWatch agent](https://docs.amazonaws.cn/AmazonCloudWatch/latest/monitoring/troubleshooting-CloudWatch-Agent.html).

     ```
     sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
         -a fetch-config -m ec2 -s -c \
         file:/opt/aws/amazon-cloudwatch-agent/sagemaker_cwagent_config.json
     ```

## What particular configurations does HyperPod manage in Slurm configuration files such as `slurm.conf` and `gres.conf`?
<a name="hyperpod-faqs-q2"></a>

When you create a Slurm cluster on HyperPod, the HyperPod agent sets up the [https://slurm.schedmd.com/slurm.conf.html](https://slurm.schedmd.com/slurm.conf.html) and [https://slurm.schedmd.com/gres.conf.html](https://slurm.schedmd.com/gres.conf.html) files at `/opt/slurm/etc/` to manage the Slurm cluster based on your HyperPod cluster creation request and lifecycle scripts. The following list shows what specific parameters the HyperPod agent handles and overwrites. 

**Important**  
We strongly recommend that you DON’T change these parameters managed by HyperPod.
+ In [https://slurm.schedmd.com/slurm.conf.html](https://slurm.schedmd.com/slurm.conf.html), HyperPod sets up the following basic parameters: `ClusterName`, `SlurmctldHost`, `PartitionName`, and `NodeName`.

  Also, to enable the [Automatic node recovery and auto-resume](sagemaker-hyperpod-resiliency-slurm-auto-resume.md) functionality, HyperPod requires the `TaskPlugin` and `SchedulerParameters` parameters set as follows. The HyperPod agent sets up these two parameters with the required values by default.

  ```
  TaskPlugin=task/none
  SchedulerParameters=permit_job_expansion
  ```
+ In [https://slurm.schedmd.com/gres.conf.html](https://slurm.schedmd.com/gres.conf.html), HyperPod manages `NodeName` for GPU nodes.

## How do I run Docker on Slurm nodes on HyperPod?
<a name="hyperpod-faqs-q3"></a>

To help you run Docker on your Slurm nodes running on HyperPod, the HyperPod service team provides setup scripts that you can include as part of the lifecycle configuration for cluster creation. To learn more, see [Base lifecycle scripts provided by HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) and [Running Docker containers on a Slurm compute node on HyperPod](sagemaker-hyperpod-run-jobs-slurm-docker.md).

## Why does my parallel training job fail when I use NVIDIA Collective Communications Library (NCCL) with Slurm on SageMaker HyperPod platform?
<a name="hyperpod-faqs-q4"></a>

By default, the Linux OS sets the `#RemoveIPC=yes` flag. Slurm and mpirun jobs that use NCCL generate inter-process communication (IPC) resources under non-root user sessions. These user sessions might log out during the job process.

 When you run jobs with Slurm or mpirun, if `systemd` detects that the user isn't logged in, it cleans up the IPC resources. Slurm and mpirun jobs can run without the user being logged in, but that requires that you disable cleanup at the systemd level and set it up at the Slurm level instead. For more information, see [ Systemd in the NCCL documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#systemd). 

To disable cleanup at the systemd level, complete the following steps.

1. Set the flag `#RemoveIPC=no` in the file `/etc/systemd/logind.conf` if you're running training jobs that use Slurm and NCCL.

1.  By default, Slurm doesn't clean up shared resources. We recommend that you set up a Slurm epilog script to clean up shared resources. This cleanup is useful when you have a lot of shared resources and want to clean them up after training jobs. The following is an example script.

   ```
   #!/bin/bash
   : <<'SUMMARY'
   Script: epilog.sh
   
   Use this script with caution, as it can potentially delete unnecessary resources and cause issues if you don't use it correctly.
   
   Note: You must save this script in a shared in a shared location that is accessible to all nodes in the cluster, such as /fsx volume.
   Workers must be able to access the script to run the script after jobs.
   
   SUMMARY
   
   # Define the log directory and create it if it doesn't exist
   LOG_DIR="/<PLACEHOLDER>/epilogue" #NOTE: Update PLACEHOLDER to be a shared value path, such as /fsx/epilogue.
   mkdir -p "$LOG_DIR"
   
   # Name the log file using the Slurm job name and job ID
   log_file="$LOG_DIR/epilogue-${SLURM_JOB_NAME}_${SLURM_JOB_ID}.log"
   
   logging() {
       echo "[$(date)] $1" | tee -a "$log_file"
   }
   
   # Slurm epilogue script to clean up IPC resources
   logging "Starting IPC cleanup for Job $SLURM_JOB_ID"
   
   # Clean up shared memory segments by username
   for seg in $(ipcs -m | awk -v owner="$SLURM_JOB_USER" '$3 == owner {print $2}'); do
       if ipcrm -m "$seg"; then
           logging "Removed shared memory segment $seg"
       else
           logging "Failed to remove shared memory segment $seg"
       fi
   done
   
   # Clean up semaphores by username
   for sem in $(ipcs -s | awk -v user="$SLURM_JOB_USER" '$3 == user {print $2}'); do
       if ipcrm -s "$sem"; then
           logging "Removed semaphore $sem"
       else
           logging "Failed to remove semaphore $sem"
       fi
   done
   
   # Clean up NCCL IPC
   NCCL_IPC_PATH="/dev/shm/nccl-*"
   for file in $NCCL_IPC_PATH; do
       if [ -e "$file" ]; then
           if rm "$file"; then
               logging "Removed NCCL IPC file $file"
           else
               logging "Failed to remove NCCL IPC file $file"
           fi
       fi
   done
   logging "IPC cleanup completed for Job $SLURM_JOB_ID"
   exit 0
   ```

   For more information about the Epilog parameter, see [Slurm documentation](https://slurm.schedmd.com/slurm.conf.html#OPT_Epilog).

1. In the `slurm.conf` file from the controller node, add in a line to point to the epilog script you created.

   ```
   Epilog="/path/to/epilog.sh"  #For example: /fsx/epilogue/epilog.sh
   ```

1. Run the following commands to change permissions of the script and to make it executable.

   ```
   chown slurm:slurm /path/to/epilog.sh
   chmod +x  /path/to/epilog.sh
   ```

1. To apply all of your changes, run `scontrol reconfigure`.

## How do I use local NVMe store of P instances for launching Docker or Enroot containers with Slurm?
<a name="hyperpod-faqs-q5"></a>

Because the default root volume of your head node usually is limited by 100GB EBS volume, you need to set up Docker and Enroot to use local NVMe instance store. To learn how to set up NVMe store and use it for launching Docker containers, see [Running Docker containers on a Slurm compute node on HyperPod](sagemaker-hyperpod-run-jobs-slurm-docker.md).

## How to set up EFA security groups?
<a name="hyperpod-faqs-q6"></a>

If you want to create a HyperPod cluster with EFA-enabled instances, make sure that you set up a security group to allow all inbound and outbound traffic to and from the security group itself. To learn more, see [Step 1: Prepare an EFA-enabled security group](https://docs.amazonaws.cn/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security) in the *Amazon EC2 User Guide*.

## How do I monitor my HyperPod cluster nodes? Are there any CloudWatch metrics exported from HyperPod?
<a name="hyperpod-faqs-q7"></a>

To gain observability into the resource utilization of your HyperPod cluster, we recommend that you integrate the HyperPod cluster with Amazon Managed Grafana and Amazon Managed Service for Prometheus. With various open-source Grafana dashboards and exporter packages, you can export and visualize metrics related to the HyperPod cluster resources. To learn more about setting up SageMaker HyperPod with Amazon Managed Grafana and Amazon Managed Service for Prometheus, see [SageMaker HyperPod cluster resources monitoring](sagemaker-hyperpod-cluster-observability-slurm.md). Note that SageMaker HyperPod currently doesn't support the exportation of system metrics to Amazon CloudWatch.

## Can I add an additional storage to the HyperPod cluster nodes? The cluster instances have limited local instance store.
<a name="hyperpod-faqs-q8"></a>

If the default instance storage is insufficient for your workload, you can configure additional storage per instance. Starting from the [release on June 20, 2024](sagemaker-hyperpod-release-notes.md#sagemaker-hyperpod-release-notes-20240620), you can add an additional Amazon Elastic Block Store (EBS) volume to each instance in your SageMaker HyperPod cluster. Note that this capability cannot be applied to existing instance groups of SageMaker HyperPod clusters created before June 20, 2024. You can utilize this capability by patching existing SageMaker HyperPod clusters created before June 20, 2024 and adding new instance groups to them. This capability is fully effective for any SageMaker HyperPod clusters created after June 20, 2024.

## Why are my compute nodes showing as "DOWN" or "DRAINED" after a reboot?
<a name="hyperpod-faqs-q9"></a>

This typically occurs when nodes are rebooted using `sudo reboot` instead of Slurm's control interface. To properly reboot nodes, use the Slurm command `scontrol reboot nextstate=resume <list_of_nodes>`. This ensures Slurm maintains proper control of the node state and resumes normal operation after reboot.

For GPU instances (like NVIDIA P5), this can also happen if the node can't complete its boot process within Slurm's default time limit (60 seconds). To resolve this, increase the `TimeToResume` parameter in `slurm.conf` to 300 seconds. This gives GPU instances sufficient time to boot and initialize drivers.

## Why do my nodes keep getting drained due to Out of Memory (OOM) issues?
<a name="hyperpod-faqs-q10"></a>

OOM issues occur when jobs exceed the node's memory capacity. To prevent this, implement `cgroups` to enforce memory limits per job. This prevents a single job from affecting the entire node and improves isolation and stability.

Example setup in `slurm.conf`: 

```
TaskPlugin=task/cgroup
```

Example setup in `cgroup.conf`:

```
CgroupAutomount=yes
ConstrainCores=yes
CgroupPlugin=autodetect
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
SignalChildrenProcesses=yes
MaxRAMPercent=99
MaxSwapPercent=80
MinRAMSpace=100
```

For more information, see [Control Group in Slurm](https://slurm.schedmd.com/cgroups.html), [Cgroup and PAM-based login control for Slurm compute nodes](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/pam_adopt_cgroup_wheel.sh#L197), and [Configure Cgroups for Slurm](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/07-tips-and-tricks/16-enable-cgroups).

## How can I ensure resources are properly cleaned up after jobs complete?
<a name="hyperpod-faqs-q11"></a>

Implement epilogue scripts to automatically clean up resources after jobs complete. Resources might not be cleared correctly when jobs crash unexpectedly, contain bugs that prevent normal cleanup, or when shared memory buffers (include those shared between processes and GPU drivers) retain allocated.

Epilogue scripts can perform tasks such as clearing GPU memory, removing temporary files, and unmounting file systems. These scripts have limitations when resources are not exclusively allocated to a single job. For detailed instructions and sample scripts, see the second bullet point of the question [Why does my parallel training job fail when I use NVIDIA Collective Communications Library (NCCL) with Slurm on SageMaker HyperPod platform?](#hyperpod-faqs-q4). For more information, see [Enable Slurm epilog script](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/07-tips-and-tricks/18-slurm-epilogue).