Cluster management with custom AMIs - Amazon SageMaker AI
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Cluster management with custom AMIs

After the custom AMI is built, you can use it for creating or updating an Amazon SageMaker HyperPod cluster. You can also scale up or add instance groups that use the new AMI.

Permissions required for cluster operations

Add the following permissions to the cluster admin user who operates and configures SageMaker HyperPod clusters. The following policy example includes the minimum set of permissions for cluster administrators to run the SageMaker HyperPod core APIs and manage SageMaker HyperPod clusters with custom AMI.

Note that AMI and AMI EBS snapshot sharing permissions are included through ModifyImageAttribute and ModifySnapshotAttribute API permissions as part of the following policy. For scoping down the sharing permissions, you can take the following steps:

  • Add tags to control the AMI sharing permissions to AMI and AMI snapshot. For example, you can tag the AMI with AllowSharing as true.

  • Add the context key in the policy to only allow AMI sharing for AMIs tagged with certain tags.

The following policy is a scoped down policy to ensure only AMIs tagged with AllowSharing as true are allowed.

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "iam:PassRole", "Resource": "arn:aws:iam::111122223333:role/your-execution-role-name" }, { "Effect": "Allow", "Action": [ "sagemaker:CreateCluster", "sagemaker:DeleteCluster", "sagemaker:DescribeCluster", "sagemaker:DescribeCluterNode", "sagemaker:ListClusterNodes", "sagemaker:ListClusters", "sagemaker:UpdateCluster", "sagemaker:UpdateClusterSoftware", "sagemaker:BatchDeleteClusterNodes", "eks:DescribeCluster", "eks:CreateAccessEntry", "eks:DescribeAccessEntry", "eks:DeleteAccessEntry", "eks:AssociateAccessPolicy", "iam:CreateServiceLinkedRole", "ec2:DescribeImages", "ec2:DescribeSnapshots" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "ec2:ModifyImageAttribute", "ec2:ModifySnapshotAttribute" ], "Resource": "*", "Condition": { "StringEquals": { "ec2:ResourceTag/AllowSharing": "true" } } } ] }

Create a cluster

You can specify your custom AMI in the ImageId field for the CreateCluster operation.

The following example shows how to create a cluster with a custom AMI:

aws sagemaker create-cluster \ --cluster-name <exampleClusterName> \ --orchestrator 'Eks={ClusterArn='<eks_cluster_arn>'}' \ --node-provisioning-mode Continuous \ --instance-groups '{ "InstanceGroupName": "<exampleGroupName>", "InstanceType": "ml.c5.2xlarge", "InstanceCount": 2, "LifeCycleConfig": { "SourceS3Uri": "<s3://amzn-s3-demo-bucket>", "OnCreate": "on_create_noop.sh" }, "ImageId": "<your_custom_ami>", "ExecutionRole": "<arn:aws:iam::444455556666:role/Admin>", "ThreadsPerCore": 1, "InstanceStorageConfigs": [ { "EbsVolumeConfig": { "VolumeSizeInGB": 200 } } ] }' --vpc-config '{ "SecurityGroupIds": ["<security_group>"], "Subnets": ["<subnet>"] }'

Update the cluster software

If you want to update an existing instance group on your cluster with your custom AMI, you can use the UpdateClusterSoftware operation and specify your custom AMI in the ImageId field. Note that unless you specify the name of a specific instance group in your request, then the new image is applied to all of the instance groups in your cluster.

The following example shows how to update a cluster's platform software with a custom AMI:

aws sagemaker update-cluster-software \ --cluster-name <exampleClusterName> \ --instance-groups <instanceGroupToUpdate> \ --image-id <customAmiId>

Scale up an instance group

The following example shows how to scale up an instance group for a cluster using a custom AMI:

aws sagemaker update-cluster \ --cluster-name <exampleClusterName> --instance-groups '[{ "InstanceGroupName": "<exampleGroupName>", "InstanceType": "ml.c5.2xlarge", "InstanceCount": 2, "LifeCycleConfig": { "SourceS3Uri": "<s3://amzn-s3-demo-bucket>", "OnCreate": "on_create_noop.sh" }, "ExecutionRole": "<arn:aws:iam::444455556666:role/Admin>", "ThreadsPerCore": 1, "ImageId": "<your_custom_ami>" }]'

Add an instance group

The following example shows how to add an instance group to a cluster using a custom AMI:

aws sagemaker update-cluster \ --cluster-name "<exampleClusterName>" \ --instance-groups '{ "InstanceGroupName": "<exampleGroupName>", "InstanceType": "ml.c5.2xlarge", "InstanceCount": 2, "LifeCycleConfig": { "SourceS3Uri": "<s3://amzn-s3-demo-bucket>", "OnCreate": "on_create_noop.sh" }, "ExecutionRole": "<arn:aws:iam::444455556666:role/Admin>", "ThreadsPerCore": 1, "ImageId": "<your_custom_ami>" }' '{ "InstanceGroupName": "<exampleGroupName2>", "InstanceType": "ml.c5.2xlarge", "InstanceCount": 1, "LifeCycleConfig": { "SourceS3Uri": "<s3://amzn-s3-demo-bucket>", "OnCreate": "on_create_noop.sh" }, "ExecutionRole": "<arn:aws:iam::444455556666:role/Admin>", "ThreadsPerCore": 1, "ImageId": "<your_custom_ami>" }'