Customizing SageMaker HyperPod clusters using lifecycle scripts
SageMaker HyperPod offers always up-and-running compute clusters, which are highly customizable as you can write lifecycle scripts to tell SageMaker HyperPod how to set up the cluster resources. The following topics are best practices for preparing lifecycle scripts to set up SageMaker HyperPod clusters with open source workload manager tools.
The following topics discuss in-depth best practices for preparing lifecycle scripts to set up Slurm configurations on SageMaker HyperPod.
High-level overview
The following procedure is the main flow of provisioning a HyperPod cluster and setting it up with Slurm. The steps are put in order of a bottom-up approach.
- 
                Plan how you want to create Slurm nodes on a HyperPod cluster. For example, if you want to configure two Slurm nodes, you'll need to set up two instance groups in a HyperPod cluster. 
- 
                Prepare a provisioning_parameters.jsonfile, which is a Configuration form for provisioning Slurm nodes on HyperPod.provisioning_parameters.jsonshould contain Slurm node configuration information to be provisioned on the HyperPod cluster. This should reflect the design of Slurm nodes from Step 1.
- 
                Prepare a set of lifecycle scripts to set up Slurm on HyperPod to install software packages and set up an environment in the cluster for your use case. You should structure the lifecycle scripts to collectively run in order in a central Python script ( lifecycle_script.py), and write an entrypoint shell script (on_create.sh) to run the Python script. The entrypoint shell script is what you need to provide to a HyperPod cluster creation request later in Step 5.Also, note that you should write the scripts to expect resource_config.jsonthat will be generated by HyperPod during cluster creation.resource_config.jsoncontains HyperPod cluster resource information such as IP addresses, instance types, and ARNs, and is what you need to use for configuring Slurm.
- 
                Collect all the files from the previous steps into a folder. └── lifecycle_files // your local folder ├── provisioning_parameters.json ├── on_create.sh ├── lifecycle_script.py └── ... // more setup scrips to be fed into lifecycle_script.py
- 
                Upload all the files to an S3 bucket. Copy and keep the S3 bucket path. Note that you should create an S3 bucket path starting with sagemaker-because you need to choose an IAM role for SageMaker HyperPod attached with AmazonSageMakerClusterInstanceRolePolicy, which only allows S3 bucket paths starting with the prefixsagemaker-. The following command is an example command to upload all the files to an S3 bucket.aws s3 cp --recursive./lifecycle_filess3://sagemaker-hyperpod-lifecycle/src
- 
                Prepare a HyperPod cluster creation request. - 
                        Option 1: If you use the Amazon CLI, write a cluster creation request in JSON format ( create_cluster.json) following the instructions at Create a new cluster.
- 
                        Option 2: If you use the SageMaker AI console UI, fill the Create a cluster request form in the HyperPod console UI following the instructions at Create a SageMaker HyperPod cluster. 
 At this stage, make sure that you create instance groups in the same structure that you planned in Step 1 and 2. Also, make sure that you specify the S3 bucket from Step 5 in the request forms. 
- 
                        
- 
                Submit the cluster creation request. HyperPod provisions a cluster based on the request, and then creates a resource_config.jsonfile in the HyperPod cluster instances, and sets up Slurm on the cluster running the lifecycle scripts.
The following topics walk you through and dive deep into details on how to organize configuration files and lifecycle scripts to work properly during HyperPod cluster creation.
Topics
- What particular configurations HyperPod manages in Slurm configuration files 
- Validating the JSON configuration files before creating a Slurm cluster on HyperPod 
- Validating runtime before running production workloads on a HyperPod Slurm cluster 
- Developing lifecycle scripts interactively on a HyperPod cluster node