SageMaker HyperPod prerequisites - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

SageMaker HyperPod prerequisites

The following sections walk you through prerequisites you need to prepare before you get started with SageMaker HyperPod.

SageMaker HyperPod quotas

You can create SageMaker HyperPod clusters given the quotas for cluster usage in your Amazon account.

Important

To learn more about SageMaker HyperPod pricing, see SageMaker HyperPod pricing and Amazon SageMaker Pricing.

View Amazon SageMaker HyperPod quotas using the Amazon Management Console

Look up the default and applied values of a quota, also referred to as a limit, for cluster usage, which is used for SageMaker HyperPod.

  1. Open the Service Quotas console.

  2. In the left navigation pane, choose Amazon services.

  3. From the Amazon services list, search for and select Amazon SageMaker.

  4. In the Service quotas list, you can see the service quota name, applied value (if it's available), Amazon default quota, and whether the quota value is adjustable.

  5. In the search bar, type cluster usage. This shows quotas for cluster usage, applied quotas, and the default quotas.

To increase Amazon SageMaker HyperPod quotas using the Amazon Management Console

Increase your quotas at the account or resource level.

  1. To increase the quota of instances for cluster usage, select the quota that you want to increase.

  2. If the quota is adjustable, you can request a quota increase at either the account level or resource level based on the value listed in the Adjustability column.

  3. For Increase quota value, enter the new value. The new value must be greater than the current value.

  4. Choose Request.

  5. To view any pending or recently resolved requests in the console, navigate to the Request history tab from the service's details page, or choose Dashboard from the navigation pane. For pending requests, choose the status of the request to open the request receipt. The initial status of a request is Pending. After the status changes to Quota requested, you see the case number with Amazon Web Services Support. Choose the case number to open the ticket for your request.

To learn more about requesting a quota increase in general, see Requesting a Quota Increase in the Amazon Service Quotas User Guide.

Set up IAM users and roles for SageMaker HyperPod users and resources

Important

Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see Provide permissions for tagging SageMaker resources.

Amazon Managed Policies for Amazon SageMaker that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

There are three main layers of SageMaker HyperPod users: Amazon account admin, cluster administrators (such as cloud architects), and cluster users (such as machine learning scientists). The Amazon account admin should set up IAM users by attaching the right permissions or policies for cluster administrators. For cluster administrators, the Amazon account admin also should create IAM roles that the cluster administrators can use for SageMaker HyperPod clusters to assume to run and communicate with necessary Amazon resources, such as Amazon S3, Amazon CloudWatch, and Amazon Systems Manager (SSM). Finally, cluster administrators can grant cluster users permissions to log into the SageMaker HyperPod clusters through SSM Agent.

Set up IAM users for cluster administrators

Cluster administrators are cloud architects who operate and configure SageMaker HyperPod clusters, performing the tasks in Operate SageMaker HyperPod. The following policy example includes the minimum set of permissions for cluster administrators to run the SageMaker HyperPod core APIs and manage any cluster within your Amazon account.

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "sagemaker:CreateCluster", "sagemaker:ListClusters" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "sagemaker:DeleteCluster", "sagemaker:DescribeCluster", "sagemaker:DescribeClusterNode", "sagemaker:ListClusterNodes", "sagemaker:UpdateCluster", "sagemaker:UpdateClusterSoftware" ], "Resource": "arn:aws:sagemaker:region:account-id:cluster/*" } ] }

To grant permissions to access the SageMaker console, use the sample policy provided at Permissions Required to Use the Amazon SageMaker Console.

To grant permissions to access the SSM console, use the sample policy provided at Using the Amazon Systems Manager console in the Amazon Systems Manager User Guide.

You might also consider attaching the AmazonSageMakerFullAccess policy to the IAM users; however, note that the AmazonSageMakerFullAccess policy grants permissions to the entire SageMaker API calls, features, and resources.

For guidance on IAM users in general, see IAM users in the Amazon Identity and Access Management User Guide.

Set up IAM users for cluster users

Cluster users are machine learning engineers who log into and run ML workloads on SageMaker HyperPod cluster nodes provisioned by cluster administrators. For cluster users in your Amazon account, you should grant the permission "ssm:StartSession" to run the SSM start-session command. The following is a policy example for IAM users.

IAM permissions to all resources

Add the following policy to give an IAM user SSM session permissions to connect to an SSM target for all resources.

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ssm:StartSession", "ssm:TerminateSession" ], "Resource": "*" } ] }

IAM role for SageMaker HyperPod

For SageMaker HyperPod clusters to run and communicate with necessary Amazon resources, you need to attach the managed AmazonSageMakerClusterInstanceRolePolicy to the cluster instance groups. Given this Amazon managed policy, SageMaker HyperPod cluster instance groups assume the role to communicate with Amazon CloudWatch, Amazon S3, and Amazon Systems Manager Agent (SSM Agent). This managed policy is the minimum requirement for SageMaker HyperPod resources to run properly, so you must provide an IAM role with this policy to all instance groups. The AmazonSageMakerClusterInstanceRolePolicy has the following permissions:

  • logs - Needed to allow SageMaker HyperPod to publish log streams.

  • cloudwatch – Needed to allow SageMaker HyperPod to post CloudWatch metrics.

  • s3 - Needed to allow SageMaker HyperPod to list and retrieve files from an Amazon S3 bucket in your account with the prefix sagemaker-.

  • ssmmessages - Needed to allow the SSM Agent to communicate with the SSM backend services. Principals can use SSM Agent for creating and opening control and data channels. SageMaker starts and manages the SSM Agent when it initiates a cluster instance.

Tip

Depending on your preference on designing the level of permissions for multiple instance groups, you can also set up multiple IAM roles and attach them to different instance groups. When you set up your cluster user access to specific SageMaker HyperPod cluster nodes, the nodes assume the role with the selective permissions you manually attached.

When you, as a Amazon account admin or cluster administrator, set up the cluster user access to specific cluster nodes through Amazon Systems Manager (see also Set up Amazon Systems Manager and Run As for cluster user access control), the cluster nodes assume the role with the selective permissions you manually attach.

After you are done with creating IAM roles, make notes of their names and ARNs. You use the roles when creating a SageMaker HyperPod cluster, granting the correct permissions required for each instance group to communicate with necessary Amazon resources.

(Optional) Additional permissions for using SageMaker HyperPod with Amazon Virtual Private Cloud

If you want to use your own Amazon Virtual Private Cloud (VPC) instead of the default SageMaker VPC, you should add the following additional permissions to the IAM role for SageMaker HyperPod.

{ "Effect": "Allow", "Action": [ "ec2:CreateNetworkInterface", "ec2:CreateNetworkInterfacePermission", "ec2:DeleteNetworkInterface", "ec2:DeleteNetworkInterfacePermission", "ec2:DescribeNetworkInterfaces", "ec2:DescribeVpcs", "ec2:DescribeDhcpOptions", "ec2:DescribeSubnets", "ec2:DescribeSecurityGroups", "ec2:DetachNetworkInterface" ], "Resource": "*" } { "Effect": "Allow", "Action": "ec2:CreateTags", "Resource": [ "arn:aws:ec2:*:*:network-interface/*" ] }

The following list breaks down which permissions are needed to enable SageMaker HyperPod cluster functionalities when you configure the cluster with your own Amazon VPC.

  • The following ec2 permissions are required to enable configuring a SageMaker HyperPod cluster with your VPC.

    { "Effect": "Allow", "Action": [ "ec2:CreateNetworkInterface", "ec2:CreateNetworkInterfacePermission", "ec2:DeleteNetworkInterface", "ec2:DeleteNetworkInterfacePermission", "ec2:DescribeNetworkInterfaces", "ec2:DescribeVpcs", "ec2:DescribeDhcpOptions", "ec2:DescribeSubnets", "ec2:DescribeSecurityGroups" ], "Resource": "*" }
  • The following ec2 permission is required to enable the SageMaker HyperPod auto-resume functionality.

    { "Effect": "Allow", "Action": [ "ec2:DetachNetworkInterface" ], "Resource": "*" }
  • The following ec2 permission allows SageMaker HyperPod to create tags on the network interfaces within your account.

    { "Effect": "Allow", "Action": "ec2:CreateTags", "Resource": [ "arn:aws:ec2:*:*:network-interface/*" ] }

Set up Amazon Systems Manager and Run As for cluster user access control

SageMaker HyperPod DLAMI comes with Amazon Systems Manager (SSM) out of the box to help you manage access to your SageMaker HyperPod cluster instance groups. This section describes how to create operating system (OS) users in your SageMaker HyperPod clusters and associate them with IAM users and roles. This is useful to authenticate SSM sessions using the credentials of the OS user account.

Enable Run As in your Amazon account

As an Amazon account admin or a cloud administrator, you can manage access to SageMaker HyperPod clusters at an IAM role or user level by using the Run As feature in SSM. With this feature, you can start each SSM session using the OS user associated to the IAM role or user.

To enable Run As in your Amazon account, follow the steps in Turn on Run As support for Linux and macOS managed nodes. If you already created OS users in your cluster, make sure that you associate them with IAM roles or users by tagging them as guided in Option 2 of step 5 under To turn on Run As support for Linux and macOS managed nodes.

Set up Linux users using an Amazon FSx file system attached to SageMaker HyperPod as a shared space

To complete setting up cluster users to access a HyperPod cluster through SSM and a shared space, you need to configure a script for adding users while preparing lifecycle configuration scripts for creating a HyperPod cluster. In the GitHub repository introduced in the section Start with base lifecycle scripts provided by HyperPod, there is a script named add_users.sh that reads user data from shared_users.txt. Note that you'll need to upload the two files as part of preparing and uploading lifecycle scripts to an S3 bucket, which you'll learn in the section Getting started with SageMaker HyperPod and the section Set up a multi-user environment through the Amazon FSx shared space.

(Optional) Set up SageMaker HyperPod with your Amazon VPC

If you don't provide a VPC, SageMaker HyperPod uses the default SageMaker VPC. To set up a SageMaker HyperPod cluster with your Amazon VPC, check the following items.

  • If you want to use your own VPC to connect SageMaker HyperPod with Amazon resources in your VPC, you need to provide the VPC name, ID, Amazon Web Services Region, subnet ID, and security group ID when you create SageMaker HyperPod. If you want to create a new VPC, see Create a default VPC or Create a VPC in the Amazon Virtual Private Cloud User Guide.

  • It is important that you should create all your resources in the same Amazon Web Services Region and Availability Zone, and configure security group rules to allow connection between the resources in your VPC. For example, assume that you create a VPC in us-west-2. You should create a subnet in this VPC in Availability Zone us-west-2a, and create a security group that allows all incoming (inbound) traffic from inside the security group and all outbound traffic.

  • You also need to ensure that your VPC has connection to Amazon Simple Storage Service (S3). If you configure a VPC, SageMaker HyperPod instance groups don't have access to the internet, and therefore can't connect to Amazon S3 for accessing or storing files such as lifecycle scripts, training data, and model artifacts. To establish connection with Amazon S3 while using VPC, you should create a VPC endpoint. By creating a VPC endpoint, you can allow the SageMaker HyperPod instance groups to access the S3 buckets within the same VPC. We recommend that you also create a custom policy that only allows requests from your private VPC to access your S3 buckets. For more information, see Endpoints for Amazon S3 in the Amazon PrivateLink Guide.

  • If you want to create a HyperPod cluster with EFA-enabled instances, make sure that you set up a security group to allow all inbound and outbound traffic to and from the security group itself. To learn more, see Step 1: Prepare an EFA-enabled security group in the Amazon EC2 User Guide.

(Optional) Set up SageMaker HyperPod with Amazon FSx for Lustre

To start using SageMaker HyperPod and mapping data paths between the cluster and your FSx for Lustre file system, select one of the Amazon Web Services Regions supported by SageMaker HyperPod. After choosing the Amazon Web Services Region you prefer, you also should determine which Availability Zone (AZ) to use. If you use SageMaker HyperPod compute nodes in AZs different from the AZs where your FSx for Lustre file system is set up within the same Amazon Web Services Region, there might be communication and network overhead. We recommend that you to use the same physical AZ as the one for the SageMaker HyperPod service account to avoid any cross-AZ traffic between SageMaker HyperPod clusters and your FSx for Lustre file system. Also, make sure that you have configured it with your VPC. If you want to use Amazon FSx as the main file system for storage, you must configure SageMaker HyperPod clusters with VPC.