SageMaker HyperPod lifecycle configuration best practices
SageMaker HyperPod offers always up-and-running compute clusters, which are highly customizable as you can write lifecycle scripts to tell SageMaker HyperPod how to set up the cluster resources. The following topics are best practices for preparing lifecycle scripts to set up SageMaker HyperPod clusters with open source workload manager tools.
Prepare lifecycle scripts for setting up Slurm on SageMaker HyperPod
The following topics discuss how to prepare lifecycle scripts to set up Slurm
Topics
- High-level overview
- Start with base lifecycle scripts provided by HyperPod
- What particular configurations HyperPod manages in Slurm configuration files
- Mount Amazon FSx for Lustre to your HyperPod cluster
- Validate the JSON configuration files before creating a Slurm cluster on HyperPod
- Validate runtime before running production workloads on a Slurm cluster on HyperPod
- Develop lifecycle scripts interactively on a cluster node
- Update a cluster with new or updated lifecycle scripts
- Considerations
High-level overview
The following procedure is the main flow of provisioning a HyperPod cluster and setting it up with Slurm. The steps are put in order of a bottom-up approach.
-
Plan how you want to create Slurm nodes on a HyperPod cluster. For example, if you want to configure two Slurm nodes, you'll need to set up two instance groups in a HyperPod cluster.
-
Prepare a
provisioning_parameters.json
file, which is a Configuration form for provisioning Slurm nodes on HyperPod.provisioning_parameters.json
should contain Slurm node configuration information to be provisioned on the HyperPod cluster. This should reflect the design of Slurm nodes from Step 1. -
Prepare a set of lifecycle scripts to set up Slurm on HyperPod to install software packages and set up an environment in the cluster for your use case. You should structure the lifecycle scripts to collectively run in order in a central Python script (
lifecycle_script.py
), and write an entrypoint shell script (on_create.sh
) to run the Python script. The entrypoint shell script is what you need to provide to a HyperPod cluster creation request later in Step 5.Also, note that you should write the scripts to expect
resource_config.json
that will be generated by HyperPod during cluster creation.resource_config.json
contains HyperPod cluster resource information such as IP addresses, instance types, and ARNs, and is what you need to use for configuring Slurm. -
Collect all the files from the previous steps into a folder.
└── lifecycle_files // your local folder ├── provisioning_parameters.json ├── on_create.sh ├── lifecycle_script.py └── ... // more setup scrips to be fed into lifecycle_script.py
-
Upload all the files to an S3 bucket. Copy and keep the S3 bucket path. Note that you should create an S3 bucket path starting with
sagemaker-
because you need to choose an IAM role for SageMaker HyperPod attached with AmazonSageMakerClusterInstanceRolePolicy, which only allows S3 bucket paths starting with the prefixsagemaker-
. The following command is an example command to upload all the files to an S3 bucket.aws s3 cp --recursive
./lifecycle_files
s3://sagemaker-hyperpod-lifecycle/src
-
Prepare a HyperPod cluster creation request.
-
Option 1: If you use the Amazon CLI, write a cluster creation request in JSON format (
create_cluster.json
) following the instructions at Create a new cluster. -
Option 2: If you use the SageMaker console UI, fill the Create a cluster request form in the HyperPod console UI following the instructions at Create a SageMaker HyperPod cluster.
At this stage, make sure that you create instance groups in the same structure that you planned in Step 1 and 2. Also, make sure that you specify the S3 bucket from Step 5 in the request forms.
-
-
Submit the cluster creation request. HyperPod provisions a cluster based on the request, and then creates a
resource_config.json
file in the HyperPod cluster instances, and sets up Slurm on the cluster running the lifecycle scripts.
The following section walks you through and dives deep into details on how to organize configuration files and lifecycle scripts to work properly during HyperPod cluster creation.
Start with base lifecycle scripts provided by HyperPod
This section walks you through every component of the basic flow of setting up
Slurm on HyperPod in a top-down approach. It starts from preparing a
HyperPod cluster creation request to run the CreateCluster
API, and dives deep into the hierarchical structure down to lifecycle scripts. Use
the sample lifecycle scripts provided in the Awsome
Distributed Training GitHub repository
git clone https://github.com/aws-samples/awsome-distributed-training/
The base lifecycle scripts for setting up a Slurm cluster on SageMaker HyperPod
are available at 1.architectures/5.sagemaker_hyperpods/LifecycleScripts/base-config
cd awsome-distributed-training/1.architectures/5.sagemaker_hyperpods/LifecycleScripts/base-config
The following flowchart shows a detailed overview of how you should design the
base lifecycle scripts. The descriptions below the diagram and the procedural guide
explain how they work during the HyperPod CreateCluster
API
call.
The following procedural guide explains what happens during HyperPod cluster creation and how the base lifecycle scripts are designed.
-
create_cluster.json
– To submit a HyperPod cluster creation request, you prepare aCreateCluster
request file in JSON format. In this best practices example, we assume that the request file is namedcreate_cluster.json
. Writecreate_cluster.json
to provision a HyperPod cluster with instance groups. The best practice is to add the same number of instance groups as the number of Slurm nodes you plan to configure on the HyperPod cluster. Make sure that you give distinctive names to the instance groups that you'll assign to Slurm nodes you plan to set up.Also, you are required to specify an S3 bucket path to store your entire set of configuration files and lifecycle scripts to the field name
InstanceGroups.LifeCycleConfig.SourceS3Uri
in theCreateCluster
request form, and specify the file name of an entrypoint shell script (assume that it's namedon_create.sh
) toInstanceGroups.LifeCycleConfig.OnCreate
.Note
If you are using the Create a cluster submission form in the HyperPod console UI, the console manages filling and submitting the
CreateCluster
request on your behalf, and runs theCreateCluster
API in the backend. In this case, you don't need to createcreate_cluster.json
; instead, make sure that you specify the correct cluster configuration information to the Create a cluster submission form. -
on_create.sh
– For each instance group, you need to provide an entrypoint shell script,on_create.sh
, to run commands, run scripts to install software packages, and set up the HyperPod cluster environment with Slurm. The two things you need to prepare are aprovisioning_parameters.json
required by HyperPod for setting up Slurm and a set of lifecycle scripts for installing software packages. This script should be written to find and run the following files as shown in the sample script aton_create.sh
. Note
Make sure that you upload the entire set of lifecycle scripts to the S3 location you specify in
create_cluster.json
. You should also place yourprovisioning_parameters.json
in the same location.-
provisioning_parameters.json
– This is a Configuration form for provisioning Slurm nodes on HyperPod. Theon_create.sh
script finds this JSON file and defines environment variable for identifying the path to it. Through this JSON file, you can configure Slurm nodes and storage options such as Amazon FSx for Lustre for Slurm to communicate with. Inprovisioning_parameters.json
, make sure that you assign the HyperPod cluster instance groups using the names you specified increate_cluster.json
to the Slurm nodes appropriately based on how you plan to set them up.The following diagram shows an example of how the two JSON configuration files
create_cluster.json
andprovisioning_parameters.json
should be written to assign HyperPod instance groups to Slurm nodes. In this example, we assume a case of setting up three Slurm nodes: controller (management) node, log-in node (which is optional), and compute (worker) node.Tip
To help you validate these two JSON files, the HyperPod service team provides a validation script,
validate-config.py
. To learn more, see Validate the JSON configuration files before creating a Slurm cluster on HyperPod. -
resource_config.json
– During cluster creation, thelifecycle_script.py
script is written to expect aresource_config.json
file from HyperPod. This file contains information about the cluster, such as instance types and IP addresses.When you run the
CreateCluster
API, HyperPod creates a resource configuration file at/opt/ml/config/resource_config.json
based on thecreate_cluster.json
file. The file path is saved to the environment variable namedSAGEMAKER_RESOURCE_CONFIG_PATH
.Important
The
resource_config.json
file is auto-generated by the HyperPod platform, and you DO NOT need to create it. The following code is to show an example ofresource_config.json
that would be created from the cluster creation based oncreate_cluster.json
in the previous step, and to help you understand what happens in the backend and how an auto-generatedresource_config.json
would look.{ "ClusterConfig": { "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/abcde01234yz", "ClusterName": "your-hyperpod-cluster" }, "InstanceGroups": [ { "Name": "controller-machine", "InstanceType": "ml.c5.xlarge", "Instances": [ { "InstanceName": "controller-machine-1", "AgentIpAddress": "111.222.333.444", "CustomerIpAddress": "111.222.333.444", "InstanceId": "i-12345abcedfg67890" } ] }, { "Name": "login-group", "InstanceType": "ml.m5.xlarge", "Instances": [ { "InstanceName": "login-group-1", "AgentIpAddress": "111.222.333.444", "CustomerIpAddress": "111.222.333.444", "InstanceId": "i-12345abcedfg67890" } ] }, { "Name": "compute-nodes", "InstanceType": "ml.trn1.32xlarge", "Instances": [ { "InstanceName": "compute-nodes-1", "AgentIpAddress": "111.222.333.444", "CustomerIpAddress": "111.222.333.444", "InstanceId": "i-12345abcedfg67890" }, { "InstanceName": "compute-nodes-2", "AgentIpAddress": "111.222.333.444", "CustomerIpAddress": "111.222.333.444", "InstanceId": "i-12345abcedfg67890" }, { "InstanceName": "compute-nodes-3", "AgentIpAddress": "111.222.333.444", "CustomerIpAddress": "111.222.333.444", "InstanceId": "i-12345abcedfg67890" }, { "InstanceName": "compute-nodes-4", "AgentIpAddress": "111.222.333.444", "CustomerIpAddress": "111.222.333.444", "InstanceId": "i-12345abcedfg67890" } ] } ] }
-
lifecycle_script.py
– This is the main Python script that collectively runs lifecycle scripts setting up Slurm on the HyperPod cluster while being provisioned. This script reads inprovisioning_parameters.json
andresource_config.json
from the paths that are specified or identified inon_create.sh
, passes the relevant information to each lifecycle script, and then runs the lifecycle scripts in order.Lifecycle scripts are a set of scripts that you have a complete flexibility to customize to install software packages and set up necessary or custom configurations during cluster creation, such as setting up Slurm, creating users, installing Conda or Docker. The sample
lifecycle_script.py
script is prepared to run other base lifecycle scripts in the repository, such as launching Slurm deamons ( start_slurm.sh
), mounting Amazon FSx for Lustre ( mount_fsx.sh
), and setting up MariaDB accounting ( setup_mariadb_accounting.sh
) and RDS accounting ( setup_rds_accounting.sh
). You can also add more scripts, package them under the same directory, and add code lines to lifecycle_script.py
to let HyperPod run the scripts. For more information about the base lifecycle scripts, see also 3.1 Lifecycle scriptsin the Awsome Distributed Training GitHub repository. In addition to the default setups, more scripts for installing the following software are available under the
utils
folder. The lifecycle_script.py
file is already prepared to include code lines for running the installation scripts, so see the following items to search those lines and uncomment to activate them.-
The following code lines are for installing Docker
, Enroot , and Pyxis . These packages are required to run Docker containers on a Slurm cluster. To enable this installation step, set the
enable_docker_enroot_pyxis
parameter toTrue
in theconfig.py
file. # Install Docker/Enroot/Pyxis if Config.enable_docker_enroot_pyxis: ExecuteBashScript("./utils/install_docker.sh").run() ExecuteBashScript("./utils/install_enroot_pyxis.sh").run(node_type)
-
You can integrate your HyperPod cluster with Amazon Managed Service for Prometheus and Amazon Managed Grafana to export metrics about the HyperPod cluster and cluster nodes to Amazon Managed Grafana dashboards. To export metrics and use the Slurm dashboard
, the NVIDIA DCGM Exporter dashboard , and the EFA Metrics dashboard on Amazon Managed Grafana, you need to install the Slurm exporter for Prometheus , the NVIDIA DCGM exporter , and the EFA node exporter . For more information about installing the exporter packages and using Grafana dashboards on an Amazon Managed Grafana workspace, see Monitor SageMaker HyperPod cluster resources. To enable this installation step, set the
enable_observability
parameter toTrue
in theconfig.py
file. # Install metric exporting software and Prometheus for observability if Config.enable_observability: if node_type == SlurmNodeType.COMPUTE_NODE: ExecuteBashScript("./utils/install_docker.sh").run() ExecuteBashScript("./utils/install_dcgm_exporter.sh").run() ExecuteBashScript("./utils/install_efa_node_exporter.sh").run() if node_type == SlurmNodeType.HEAD_NODE: wait_for_scontrol() ExecuteBashScript("./utils/install_docker.sh").run() ExecuteBashScript("./utils/install_slurm_exporter.sh").run() ExecuteBashScript("./utils/install_prometheus.sh").run()
-
-
Make sure that you upload all configuration files and setup scripts from Step 2 to the S3 bucket you provide in the
CreateCluster
request in Step 1. For example, assume that yourcreate_cluster.json
has the following."LifeCycleConfig": { "SourceS3URI": "
s3://sagemaker-hyperpod-lifecycle/src
", "OnCreate": "on_create.sh
" }Then, your
"s3://sagemaker-hyperpod-lifecycle/src"
should containon_create.sh
,lifecycle_script.py
,provisioning_parameters.json
, and all other setup scripts. Assume that you have prepared the files in a local folder as follows.└── lifecycle_files // your local folder ├── provisioning_parameters.json ├── on_create.sh ├── lifecycle_script.py └── ... // more setup scrips to be fed into lifecycle_script.py
To upload the files, use the S3 command as follows.
aws s3 cp --recursive
./lifecycle_scripts
s3://sagemaker-hyperpod-lifecycle/src
What particular configurations HyperPod manages in Slurm configuration files
When you create a Slurm cluster on HyperPod, the HyperPod agent
sets up the slurm.conf
gres.conf
/opt/slurm/etc/
to manage the Slurm cluster based on your
HyperPod cluster creation request and lifecycle scripts. The following list
shows which specific parameters the HyperPod agent handles and overwrites.
Important
We strongly recommend that you do not change these parameters managed by HyperPod.
-
In
slurm.conf
, HyperPod sets up the following basic parameters: ClusterName
,SlurmctldHost
,PartitionName
, andNodeName
.Also, to enable the Auto-resume functionality, HyperPod requires the
TaskPlugin
andSchedulerParameters
parameters set as follows. The HyperPod agent sets up these two parameters with the required values by default.TaskPlugin=task/none SchedulerParameters=permit_job_expansion
-
In
gres.conf
, HyperPod manages NodeName
for GPU nodes.
Mount Amazon FSx for Lustre to your HyperPod cluster
To mount an Amazon FSx for Lustre shared file system to your HyperPod cluster, set up the following.
-
Use your Amazon VPC.
-
For HyperPod cluster instances to communicate within your VPC, make sure that you attach the (Optional) Additional permissions for using SageMaker HyperPod with Amazon Virtual Private Cloud to the IAM role for SageMaker HyperPod.
-
In
create_cluster.json
, include the following VPC information."VpcConfig": { "SecurityGroupIds": [ "
string
" ], "Subnets": [ "string
" ] }For more tips about setting up Amazon VPC, see (Optional) Set up SageMaker HyperPod with your Amazon VPC.
-
-
To finish configuring Slurm with Amazon FSx for Lustre, specify the Amazon FSx DNS name and Amazon FSx mount name in
provisioning_parameters.json
as shown in the figure in the Start with base lifecycle scripts provided by HyperPod section. You can find the Amazon FSx information either from the Amazon FSx for Lustre console in your account or by running the following Amazon CLI command,aws fsx describe-file-systems
."fsx_dns_name": "
fs-12345678a90b01cde
.fsx.us-west-2
.amazonaws.com", "fsx_mountname": "1abcdefg
"
Validate the JSON configuration files before creating a Slurm cluster on HyperPod
To validate the JSON configuration files before submitting a cluster creation
request, use the configuration validation script validate-config.py
create_cluster.json
and provisioning_parameters.json
files from the Start with
base lifecycle scripts provided by HyperPod
section, run the validation script as follows.
python3 validate-config.py --cluster-config
create_cluster.json
--provisioning-parametersprovisioning_parameters.json
The following is an example output of a successful validation.
✔️ Validated instance group name worker-group-1 is correct ... ✔️ Validated subnet subnet-012345abcdef67890 ... ✔️ Validated security group sg-012345abcdef67890 ingress rules ... ✔️ Validated security group sg-012345abcdef67890 egress rules ... ✔️ Validated FSx Lustre DNS name fs-012345abcdef67890.fsx.us-east-1.amazonaws.com ✔️ Validated FSx Lustre mount name abcdefgh ✅ Cluster Validation succeeded
Validate runtime before running production workloads on a Slurm cluster on HyperPod
To check the runtime before running any production workloads on a Slurm cluster on
HyperPod, use the runtime validation script hyperpod-precheck.py
To run the script on multiple nodes at once, use srun
as shown in the
following example command of running the script on a Slurm cluster of 8
nodes.
# The following command runs on 8 nodes srun -N
8
python3 hyperpod-precheck.py
Note
To learn more about the validation script such as what runtime validation
functions the script provides and guidelines to resolve issues that don't pass
the validations, see Runtime validation before running workloads
Develop lifecycle scripts interactively on a cluster node
This section explains how you can interactively develop lifecycle scripts without repeatedly creating and deleting a HyperPod cluster.
-
Create a HyperPod cluster with the base lifecycle scripts.
-
Log in to a cluster node.
-
Develop a script (
configure_xyz.sh
) by editing and running it repeatedly on the node.-
HyperPod runs the lifecycle scripts as the root user, so we recommend that you run the
configure_xyz.sh
as the root user while developing to make sure that the script is tested under the same condition while run by HyperPod.
-
-
Integrate the script into
lifecycle_script.py
by adding a code line similar to the following.ExecuteBashScript("./utils/
configure_xyz.sh
").run() -
Upload the updated lifecycle scripts to the S3 bucket that you initially used for uploading the base lifecycle scripts.
-
Test the integrated version of
lifecycle_script.py
by creating a new HyperPod cluster.
Update a cluster with new or updated lifecycle scripts
There are three ways to update the HyperPod software.
-
The
UpdateClusterSoftware
API for patching the HyperPod software re-runs the lifecycle scripts on the entire instance group. -
The
UpdateCluster
API only runs the lifecycle scripts for new instance groups. -
You can also run lifecycle scripts directly in the HyperPod instances.
Considerations
Consider the following when using SageMaker HyperPod.
-
HyperPod runs SageMaker HyperPod DLAMI on each instance of a cluster, and the AMI has pre-installed software packages complying compatibilities between them and HyperPod functionalities. Note that if you reinstall any of the pre-installed packages, you are responsible for installing compatible packages and note that some HyperPod functionalities might not work as expected.