Creating a SageMaker HyperPod cluster - Amazon SageMaker AI
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Creating a SageMaker HyperPod cluster

Learn how to create SageMaker HyperPod clusters orchestrated by Amazon EKS using the Amazon CLI.

  1. Before creating an SageMaker HyperPod cluster:

    1. Ensure that you have an existing Amazon EKS cluster up and running. For detailed instructions about how to set up an Amazon EKS cluster, see Create an Amazon EKS cluster in the Amazon EKS User Guide.

    2. Install the Helm chart as instructed in Installing packages on the Amazon EKS cluster using Helm.

  2. Prepare a lifecycle configuration script and upload to an Amazon S3 bucket, such as s3://amzn-s3-demo-bucket/Lifecycle-scripts/base-config/.

    For a quick start, download the sample script on_create.sh from the Amazonome Distributed Training GitHub repository, and upload it to the S3 bucket. This script sets up the logging file /var/log/provision/provisioning.log required for CloudWatch to gather logs from Pod containers. You can also include additional setup instructions, a series of setup scripts, or commands to be executed during the HyperPod cluster provisioning stage.

    Important

    If you create an IAM role for SageMaker HyperPod attaching only the managed AmazonSageMakerClusterInstanceRolePolicy, your cluster has access to Amazon S3 buckets with the specific prefix sagemaker-.

  3. Prepare a CreateCluster API request file in JSON format. For ExecutionRole, provide the ARN of the IAM role you created with the managed AmazonSageMakerClusterInstanceRolePolicy from the section IAM role for SageMaker HyperPod.

    Note

    Ensure that your SageMaker HyperPod cluster is deployed within the same Virtual Private Cloud (VPC) as your Amazon EKS cluster. The subnets and security groups specified in the SageMaker HyperPod cluster configuration must allow network connectivity and communication with the Amazon EKS cluster's API server endpoint.

    // create_cluster.json { "ClusterName": "string", "InstanceGroups": [{ "InstanceGroupName": "string", "InstanceType": "string", "InstanceCount": number, "LifeCycleConfig": { "SourceS3Uri": "s3://amzn-s3-demo-bucket-sagemaker>/<lifecycle-script-directory>/src/", "OnCreate": "on_create.sh" }, "ExecutionRole": "string", "ThreadsPerCore": number, "OnStartDeepHealthChecks": [ "InstanceStress", "InstanceConnectivity" ] }], "VpcConfig": { "SecurityGroupIds": ["string"], "Subnets": ["string"] }, "Tags": [{ "Key": "string", "Value": "string" }], "Orchestrator": { "Eks": { "ClusterArn": "string", } }, "NodeRecovery": "Automatic" }

    Note the following when configuring to create a new SageMaker HyperPod cluster associating with an EKS cluster.

    • You can configure up to 20 instance groups under the InstanceGroups parameter.

    • For Orchestator.Eks.ClusterArn, specify the ARN of the EKS cluster you want to use as the orchestrator.

    • For OnStartDeepHealthChecks, add InstanceStress and InstanceConnectivity to enable Deep health checks.

    • For NodeRecovery, specify Automatic to enable automatic node recovery. SageMaker HyperPod replaces or reboots instances (nodes) when issues are found by the health-monitoring agent.

    • For the Tags parameter, you can add custom tags for managing the SageMaker HyperPod cluster as an Amazon resource. You can add tags to your cluster in the same way you add them in other Amazon services that support tagging. To learn more about tagging Amazon resources in general, see Tagging Amazon Resources User Guide.

    • For the VpcConfig parameter, specify the information of the VPC used in the EKS cluster. The subnets must be private.

  4. Run the create-cluster command as follows.

    Important

    When running the create-cluster command with the --cli-input-json parameter, you must include the file:// prefix before the complete path to the JSON file. This prefix is required to ensure that the Amazon CLI recognizes the input as a file path. Omitting the file:// prefix results in a parsing parameter error.

    aws sagemaker create-cluster \ --cli-input-json file://complete/path/to/create_cluster.json

    This should return the ARN of the new cluster.