Automatic node recovery and auto-resume
Note
As of September 11, 2025, HyperPod with Slurm orchestration now supports health monitoring agents. Run UpdateClusterSoftware and update to the latest version of the AMI in order to use this functionality.
This section talks about Amazon SageMaker HyperPod's two complementary resilience features: automatic node recovery that replaces faulty infrastructure without manual intervention, and auto-resume functionality that restarts training jobs from the last checkpoint after hardware failures.
How automatic node recovery works
During cluster creation or update, cluster admin users can select the node
(instance) recovery option between Automatic
(Recommended) and
None
at the cluster level. If set to Automatic
,
SageMaker HyperPod reboots or replaces faulty nodes automatically.
Important
We recommend setting the Automatic
option. By default, the
clusters are set up with Automatic node recovery.
Automatic node recovery runs when issues are found from health-monitoring agent,
basic health checks, and deep health checks. If set to None
, the health
monitoring agent will label the instances when a fault is detected, but it will not
automatically initiate any repair or recovery actions on the affected nodes. We do
not recommend this option.
Running a training job with the Amazon SageMaker HyperPod auto-resume functionality
This section describes how to run a training job with the SageMaker HyperPod auto-resume functionality, which provides a zero-touch resiliency infrastructure to automatically recover a training job from the last saved checkpoint in the event of a hardware failure.
With the auto-resume functionality, if a job fails due to a hardware failure or any transient issues in-between training, SageMaker HyperPod auto-resume starts the node replacement workflow and restarts the job after the faulty nodes are replaced. The following hardware checks are run whenever a job fails while using auto-resume:
Category | Utility name | Instance type compatibility | Description |
---|---|---|---|
Accelerator | NVIDIA SMI | GPU | nvidia-sminvidia-smi to determine the health of the
instance. |
Accelerator | Neuron sysfs | Trainium | For Trainium-powered instances, the health of the Neuron devices
is determined by reading counters from Neuron sysfs |
Network | EFA | GPU and Trainium | To aid in the diagnostic of Elastic Fabric Adaptor (EFA) devices, the EFA health checker runs a series of connectivity tests using all available EFA cards within the instance. |
Note
When Generic Resources
(GRES)
Using the SageMaker HyperPod auto-resume functionality with Slurm
When you use SageMaker HyperPod auto-resume with Slurm, you should run the job inside an
exclusive allocation acquired either by using salloc
or
sbatch
. In any case, you need to modify the entrypoint script to
make sure that all setup steps run in a single srun
command when
resuming the job. Through the entrypoint script, it is important to set up the
environment on the replaced node to be consistent with the environment that the job
step was running before it was stopped. The following precedure shows how to prepare
an entrypoint script to keep the environment consistent and run it as a single
srun
command.
Tip
If you use sbatch
, you can keep the batch script simple by
creating a separate script for setting up the environment and using a single
srun
command.
-
Create a script using the following code example and save it as
train_auto_resume.sh
. This script deploys training environment setups assuming that there is no manual configuration previously made to the replaced node. This ensures that the environment is node-agnostic, so that when a node is replaced, the same environment is provisioned on the node before resuming the job.Note
The following code example shows how to discover the Slurm node list associated with the job. Do not use the
$SLURM_JOB_NODELIST
environment variable provided by Slurm, because its value might be outdated after SageMaker HyperPod auto-resumes the job. The following code example shows how to define a newNODE_LIST
variable to replaceSLURM_JOB_NODELIST
, and then set up theMASTER_NODE
andMASTER_ADDR
variables off of theNODE_LIST
variable.#!/bin/bash # Filename: train_auto_resume.sh # Sample containerized script to launch a training job with a single srun which can be auto-resumed. # Place your training environment setup here. # Example: Install conda, docker, activate virtual env, etc. # Get the list of nodes for a given job NODE_LIST=$(scontrol show jobid=$SLURM_JOBID | \ # Show details of the SLURM job awk -F= '/NodeList=/{print $2}' | \ # Extract NodeList field grep -v Exc) # Exclude nodes marked as excluded # Determine the master node from the node list MASTER_NODE=$(scontrol show hostname $NODE_LIST | \ # Convert node list to hostnames head -n 1) # Select the first hostname as master node # Get the master node address MASTER_ADDR=$(scontrol show node=$MASTER_NODE | \ # Show node information awk -F= '/NodeAddr=/{print $2}' | \ # Extract NodeAddr awk '{print $1}') # Print the first part of NodeAddr # Torchrun command to launch the training job torchrun_cmd="torchrun --nnodes=$SLURM_NNODES \ --nproc_per_node=1 \ --node_rank=$SLURM_NODE \ --master-addr=$MASTER_ADDR \ --master_port=
1234
\<your_training_script.py>
" # Execute the torchrun command in the 'pytorch' Conda environment, # streaming output live /opt/conda/bin/conda run --live-stream -n pytorch $torchrun_cmdTip
You can use the preceding script to add more commands for installing any additional dependencies for your job. However, we recommend that you keep the dependency installation scripts to the set of lifecycle scripts that are used during cluster creation. If you use a virtual environment hosted on a shared directory, you can also utilize this script to activate the virtual environment.
-
Launch the job with SageMaker HyperPod auto-resume enabled by adding the flag
--auto-resume=1
to indicate that thesrun
command should be automatically retried in case of hardware failure.Note
If you have set up a resource allocation using
sbatch
orsalloc
, you can run multiplesrun
commands within the allocation. In the event of a failure, the SageMaker HyperPod auto-resume functionality only operates in the current job stepof the srun
command with the flag--auto-resume=1
. In other words, activating auto-resume in ansrun
command doesn't apply to othersrun
commands launched within a resource allocation session.The following are
srun
command examples withauto-resume
enabled.Using sbatch
Because most of the logic for setting up the environment is already in
train_auto_resume.sh
, the batch script should be simple and similar to the following code example. Assume that the following batch script is saved asbatch.sh
.#!/bin/bash #SBATCH --nodes 2 #SBATCH --exclusive srun --auto-resume=1
train_auto_resume.sh
Run the preceding batch script using the following command.
sbatch
batch.sh
Using salloc
Start by acquiring an exclusive allocation, and run the
srun
command with the--auto-resume
flag and the entrypoint script.salloc -N 2 --exclusive srun --auto-resume=1
train_auto_resume.sh
How automatic node recovery and auto-resume work together
When both automatic node recovery and auto-resume systems are active, they follow a coordinated approach to handling failures. If the HMA detects a hardware fault, the node is marked for drain regardless of job-level status. With node automatic recovery enabled, the nodes are automatically replaced once all the jobs running in the nodes exit. In this scenario, for jobs with auto-resume enabled, if there is A non-zero status exit status in the step, the auto resume kicks in (the jobs resume once nodes are replaced). Jobs without auto-resume enabled will simply exit, requiring manual resubmission by administrators or users.
Note
If you use auto-resume, the nodes are always replaced (no reboots) when hardware failures are detected.