

# Customizing SageMaker HyperPod clusters using lifecycle scripts
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm"></a>

SageMaker HyperPod offers always up-and-running compute clusters, which are highly customizable as you can write lifecycle scripts to tell SageMaker HyperPod how to set up the cluster resources. The following topics are best practices for preparing lifecycle scripts to set up SageMaker HyperPod clusters with open source workload manager tools.

The following topics discuss in-depth best practices for preparing lifecycle scripts to set up Slurm configurations on SageMaker HyperPod.

## High-level overview
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-highlevel-overview"></a>

The following procedure is the main flow of provisioning a HyperPod cluster and setting it up with Slurm. The steps are put in order of a ***bottom-up*** approach.

1. Plan how you want to create Slurm nodes on a HyperPod cluster. For example, if you want to configure two Slurm nodes, you'll need to set up two instance groups in a HyperPod cluster.

1. Prepare Slurm configuration. Choose one of the following approaches:
   + **Option A: API-driven configuration (recommended)** – Define Slurm node types and partitions directly in the `CreateCluster` API payload using `SlurmConfig` within each instance group. With this approach:
     + No `provisioning_parameters.json` file is needed
     + Slurm topology is defined in the API payload alongside instance group definitions
     + FSx filesystems are configured per-instance-group via `InstanceStorageConfigs`
     + Configuration strategy is controlled via `Orchestrator.Slurm.SlurmConfigStrategy`

     Example `SlurmConfig` in an instance group:

     ```
     {
         "InstanceGroupName": "gpu-compute",
         "InstanceType": "ml.p4d.24xlarge",
         "InstanceCount": 8,
         "SlurmConfig": {
             "NodeType": "Compute",
             "PartitionNames": ["gpu-training"]
         }
     }
     ```
   + **Option B: Legacy configuration** – Prepare a `provisioning_parameters.json` file, which is a [Configuration form for provisioning\$1parameters.json](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-provisioning-forms-slurm). `provisioning_parameters.json` should contain Slurm node configuration information to be provisioned on the HyperPod cluster. This should reflect the design of Slurm nodes from Step 1.

1. Prepare a set of lifecycle scripts to set up Slurm on HyperPod to install software packages and set up an environment in the cluster for your use case. You should structure the lifecycle scripts to collectively run in order in a central Python script (`lifecycle_script.py`), and write an entrypoint shell script (`on_create.sh`) to run the Python script. The entrypoint shell script is what you need to provide to a HyperPod cluster creation request later in Step 5. 

   Also, note that you should write the scripts to expect `resource_config.json` that will be generated by HyperPod during cluster creation. `resource_config.json` contains HyperPod cluster resource information such as IP addresses, instance types, and ARNs, and is what you need to use for configuring Slurm.

1. Collect all the files from the previous steps into a folder. The folder structure depends on the configuration approach you selected in Step 2.

   If you selected Option A (API-driven configuration):

   Your folder only needs lifecycle scripts for custom setup tasks. Slurm configuration and FSx mounting are handled automatically by HyperPod based on the API payload.

   ```
   └── lifecycle_files // your local folder
   
       ├── on_create.sh
       ├── lifecycle_script.py
       └── ... // more setup scripts to be fed into lifecycle_script.py
   ```
**Note**  
The `provisioning_parameters.json` file is not required when using API-driven configuration.

   If you selected Option B (legacy configuration):

   Your folder must include `provisioning_parameters.json` and the full set of lifecycle scripts.

   ```
   └── lifecycle_files // your local folder
   
       ├── provisioning_parameters.json
       ├── on_create.sh
       ├── lifecycle_script.py
       └── ... // more setup scrips to be fed into lifecycle_script.py
   ```

1. Upload all the files to an S3 bucket. Copy and keep the S3 bucket path. Note that you should create an S3 bucket path starting with `sagemaker-` because you need to choose an [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod) attached with [`AmazonSageMakerClusterInstanceRolePolicy`](security-iam-awsmanpol-AmazonSageMakerClusterInstanceRolePolicy.md), which only allows S3 bucket paths starting with the prefix `sagemaker-`. The following command is an example command to upload all the files to an S3 bucket.

   ```
   aws s3 cp --recursive ./lifecycle_files s3://sagemaker-hyperpod-lifecycle/src
   ```

1. Prepare a HyperPod cluster creation request. 
   + Option 1: If you use the Amazon CLI, write a cluster creation request in JSON format (`create_cluster.json`) following the instructions at [Create a new cluster](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-create-cluster).
   + Option 2: If you use the SageMaker AI console UI, fill the **Create a cluster** request form in the HyperPod console UI following the instructions at [Create a SageMaker HyperPod cluster](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-create-cluster).

   At this stage, make sure that you create instance groups in the same structure that you planned in Step 1 and 2. Also, make sure that you specify the S3 bucket from Step 5 in the request forms.

1. Submit the cluster creation request. HyperPod provisions a cluster based on the request, and then creates a `resource_config.json` file in the HyperPod cluster instances, and sets up Slurm on the cluster running the lifecycle scripts.

The following topics walk you through and dive deep into details on how to organize configuration files and lifecycle scripts to work properly during HyperPod cluster creation.

**Topics**
+ [High-level overview](#sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-highlevel-overview)
+ [Base lifecycle scripts provided by HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md)
+ [What particular configurations HyperPod manages in Slurm configuration files](sagemaker-hyperpod-lifecycle-best-practices-slurm-what-hyperpod-overrides-in-slurm-conf.md)
+ [Slurm log rotations](sagemaker-hyperpod-slurm-log-rotation.md)
+ [Mounting Amazon FSx for Lustre and Amazon FSx for OpenZFS to a HyperPod cluster](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-setup-with-fsx.md)
+ [Validating the JSON configuration files before creating a Slurm cluster on HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-validate-json-files.md)
+ [Validating runtime before running production workloads on a HyperPod Slurm cluster](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-validate-runtime.md)
+ [Developing lifecycle scripts interactively on a HyperPod cluster node](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-develop-lifecycle-scripts.md)