Manually quarantine, replace, or reboot a node - Amazon SageMaker AI
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Manually quarantine, replace, or reboot a node

Learn how to manually quarantine, replace, and reboot a faulty node in SageMaker HyperPod clusters orchestrated with Amazon EKS.

To quarantine a node and force delete a training pod

kubectl cordon <node-name>

After quarantine, force ejecting the Pod. This is useful when you see a pod is stuck in termination for more than 30min or kubectl describe pod shows ‘Node is not ready’ in Events

kubectl delete pods <pod-name> --grace-period=0 --force

To replace a node

Label the node to replace with sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplacement, which triggers the SageMaker HyperPod Automatic node recovery. Note that you also need to activate automatic node recovery during cluster creation or update.

kubectl label nodes <node-name> \ sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplacement

To reboot a node

Label the node to reboot with sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReboot, which triggers the SageMaker HyperPod Automatic node recovery. Note that you also need to activate automatic node recovery during cluster creation or update.

kubectl label nodes <node-name> \ sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReboot

After the labels UnschedulablePendingReplacement or UnschedulablePendingReboot are applied, you should be able to see the node is terminated or reboot in few minutes.