设置 HyperPod 集群以进行模型部署 - 亚马逊 SageMaker AI
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅 中国的 Amazon Web Services 服务入门 (PDF)

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

设置 HyperPod 集群以进行模型部署

本指南向您展示如何在 Amazon SageMaker HyperPod 集群上启用推理功能。您将设置机器学习工程师部署和管理推理端点所需的基础架构、权限和操作员。

注意

要创建预装了推理运算符的集群,请参阅。创建 EKS 编排集群 SageMaker HyperPod要在现有集群上安装推理运算符,请继续执行以下步骤。

您可以使用 SageMaker AI 控制台安装推理运算符以获得简化的体验,也可以使用 Amazon CLI 进行更多控制。本指南涵盖了两种安装方法。

方法 1:通过 SageMaker AI 控制台安装 HyperPod 推理插件(推荐)

A SageMaker I 控制台提供最简化的体验,有两个安装选项:

  • 快速安装:使用优化的默认值自动创建所有必需的资源,包括 IAM 角色、Amazon S3 存储桶和依赖项插件。将创建一个新的 Studio 域,该域具有将 JumpStart 模型部署到相关集群所需的权限。此选项非常适合以最少的配置决策快速入门。

  • 自定义安装:提供指定现有资源或自定义配置的灵活性,同时保持一键式体验。客户可以根据自己的组织要求选择重复使用现有 IAM 角色、Amazon S3 存储桶或依赖项插件。

先决条件

  • 采用 Amazon EKS 编排功能的现有 HyperPod 集群

  • Amazon EKS 集群管理的 IAM 权限

  • 为集群访问配置了 kubectl

安装步骤

  1. 导航到 SageMaker AI 控制台,然后转到HyperPod 集群集群管理

  2. 选择要在其中安装推理运算符的集群。

  3. 导航到 “推理” 选项卡。选择 “快速安装” 进行自动设置,或选择 “自定义安装” 以获得配置灵活性。

  4. 如果选择 “自定义安装”,请指定现有资源或根据需要自定义设置。

  5. 单击 “安装” 开始自动安装过程。

  6. 通过控制台或运行以下命令验证安装状态:

    kubectl get pods -n hyperpod-inference-system
    aws eks describe-addon --cluster-name CLUSTER-NAME --addon-name amazon-sagemaker-hyperpod-inference --region REGION

成功安装插件后,您可以使用模型部署文档部署模型或导航至验证推理运算符是否正常工作

方法 2:使用 CLI 安装推理运算符 Amazon

C Amazon LI 安装方法可以更好地控制安装过程,适用于自动化和高级配置。

先决条件

推理运算符支持在 Amazon EKS 集群上部署和管理机器学习推理终端节点。安装之前,请确保您的集群具有所需的安全配置和支持基础架构。完成以下步骤以配置 IAM 角色、安装 Amazon 负载均衡器控制器、设置 Amazon S3 和 Amazon FSx CSI 驱动程序以及部署 KEDA 和证书管理器:

注意

或者,您可以使用 Amazon CloudFormation 模板自动完成先决条件设置。有关更多信息,请参阅 使用 Amazon CloudFormation 模板创建必备堆栈

Connect 连接到您的集群并设置环境变量

在继续操作之前,请验证您的 Amazon 凭证配置是否正确并具有必要的权限。使用具有管理员权限和 Amazon EKS 集群的集群管理员权限的 IAM 委托人运行以下步骤。确保您已使用创建 HyperPod 集群使用 Amazon EKS 编排创建 SageMaker HyperPod 集群。安装 helm、eksctl 和 kubectl 命令行实用程序。

要获得 Kubernetes 对 Amazon EKS 集群的管理权限,请打开 Amazon EKS 控制台并选择您的集群。在 “访问权限” 选项卡中,选择 IAM 访问条目。如果您的 IAM 委托人没有条目,请选择创建访问条目。选择所需的 IAM 委托人并将其AmazonEKSClusterAdminPolicy与之关联。

  1. 将 kubectl 配置为连接到由 Amazon EKS HyperPod 集群编排的新创建的集群。指定区域和 HyperPod 集群名称。

    export HYPERPOD_CLUSTER_NAME=<hyperpod-cluster-name> export REGION=<region> # S3 bucket where tls certificates will be uploaded export BUCKET_NAME="hyperpod-tls-<your-bucket-suffix>" # Bucket should have prefix: hyperpod-tls-* export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \ --query 'Orchestrator.Eks.ClusterArn' --output text | \ cut -d'/' -f2) aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION
    注意

    如果使用的自定义存储桶名称不是以开头hyperpod-tls-,请将以下策略附加到您的执行角色:

    { "Version": "2012-10-17", "Statement": [ { "Sid": "TLSBucketDeleteObjectsPermission", "Effect": "Allow", "Action": ["s3:DeleteObject"], "Resource": ["arn:aws:s3:::${BUCKET_NAME}/*"], "Condition": { "StringEquals": { "aws:ResourceAccount": "${aws:PrincipalAccount}" } } }, { "Sid": "TLSBucketGetObjectAccess", "Effect": "Allow", "Action": ["s3:GetObject"], "Resource": ["arn:aws:s3:::${BUCKET_NAME}/*"] }, { "Sid": "TLSBucketPutObjectAccess", "Effect": "Allow", "Action": ["s3:PutObject", "s3:PutObjectTagging"], "Resource": ["arn:aws:s3:::${BUCKET_NAME}/*"], "Condition": { "StringEquals": { "aws:ResourceAccount": "${aws:PrincipalAccount}" } } } ] }
  2. 设置默认 env 变量。

    HYPERPOD_INFERENCE_ROLE_NAME="SageMakerHyperPodInference-$HYPERPOD_CLUSTER_NAME" HYPERPOD_INFERENCE_NAMESPACE="hyperpod-inference-system"
  3. 从集群 ARN 中提取 Amazon EKS 集群名称,更新本地 kubeconfig,并通过列出命名空间中的所有容器组(pod)来验证连接。

    kubectl get pods --all-namespaces
  4. (可选)安装 NVIDIA 设备插件以在集群上启用 GPU 支持。

    # Install nvidia device plugin kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml # Verify that GPUs are visible to k8s kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu

为推理操作员配置 IAM 角色

  1. 收集在 Amazon EKS、A SageMaker I 和 IAM 组件之间配置服务集成 ARNs 所需的基本 Amazon 资源标识符。

    %%bash -x export ACCOUNT_ID=$(aws --region $REGION sts get-caller-identity --query 'Account' --output text) export OIDC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5) export EKS_CLUSTER_ROLE=$(aws eks --region $REGION describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.roleArn' --output text)
  2. 将 IAM OIDCidentity 提供商与您的 EKS 集群关联。

    eksctl utils associate-iam-oidc-provider --region=$REGION --cluster=$EKS_CLUSTER_NAME --approve
  3. 创建 HyperPod 推理运算符 IAM 角色所需的信任策略。这些策略支持在 Amazon EKS、A SageMaker I 和其他服务之间进行安全的跨 Amazon 服务通信。

    %%bash -x # Create trust policy JSON cat << EOF > trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "sagemaker.amazonaws.com" ] }, "Action": "sts:AssumeRole" }, { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com", "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:hyperpod-inference-system:hyperpod-inference-controller-manager" } } } ] } EOF
  4. 为推理运算符创建执行角色。

    aws iam create-role --role-name $HYPERPOD_INFERENCE_ROLE_NAME --assume-role-policy-document file://trust-policy.json aws iam attach-role-policy --role-name $HYPERPOD_INFERENCE_ROLE_NAME --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerHyperPodInferenceAccess
  5. 为推理运算符资源创建命名空间

    kubectl create namespace $HYPERPOD_INFERENCE_NAMESPACE

创建 ALB 控制器角色

  1. 创建信任策略和权限策略。

    # Create trust policy cat <<EOF > /tmp/alb-trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:hyperpod-inference-system:aws-load-balancer-controller", "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com" } } } ] } EOF # Create permissions policy export ALBController_IAM_POLICY_NAME=HyperPodInferenceALBControllerIAMPolicy curl -o AWSLoadBalancerControllerIAMPolicy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.13.0/docs/install/iam_policy.json # Create the role aws iam create-role \ --role-name alb-role \ --assume-role-policy-document file:///tmp/alb-trust-policy.json # Create the policy ALB_POLICY_ARN=$(aws iam create-policy \ --policy-name $ALBController_IAM_POLICY_NAME \ --policy-document file://AWSLoadBalancerControllerIAMPolicy.json \ --query 'Policy.Arn' \ --output text) # Attach the policy to the role aws iam attach-role-policy \ --role-name alb-role \ --policy-arn $ALB_POLICY_ARN
  2. 将标签 (kubernetes.io.role/elb) 应用于 Amazon EKS 集群中的所有子网(包括公有子网和私有子网)。

    export VPC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.resourcesVpcConfig.vpcId' --output text) # Add Tags aws ec2 describe-subnets \ --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \ --query 'Subnets[*].SubnetId' --output text | \ tr '\t' '\n' | \ xargs -I{} aws ec2 create-tags --resources {} --tags Key=kubernetes.io/role/elb,Value=1 # Verify Tags are added aws ec2 describe-subnets \ --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \ --query 'Subnets[*].SubnetId' --output text | \ tr '\t' '\n' | xargs -n1 -I{} aws ec2 describe-tags --filters "Name=resource-id,Values={}" "Name=key,Values=kubernetes.io/role/elb" --query "Tags[0].Value" --output text
  3. 创建 Amazon S3 VPC 端点。

    aws ec2 create-vpc-endpoint \ --region ${REGION} \ --vpc-id ${VPC_ID} \ --vpc-endpoint-type Gateway \ --service-name "com.amazonaws.${REGION}.s3" \ --route-table-ids $(aws ec2 describe-route-tables --region $REGION --filters "Name=vpc-id,Values=${VPC_ID}" --query 'RouteTables[].Associations[].RouteTableId' --output text | tr ' ' '\n' | sort -u | tr '\n' ' ')

创建 KEDA 运算符角色

  1. 创建信任策略和权限策略。

    # Create trust policy cat <<EOF > /tmp/keda-trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:hyperpod-inference-system:keda-operator", "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com" } } } ] } EOF # Create permissions policy cat <<EOF > /tmp/keda-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "cloudwatch:GetMetricData", "cloudwatch:GetMetricStatistics", "cloudwatch:ListMetrics" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "aps:QueryMetrics", "aps:GetLabels", "aps:GetSeries", "aps:GetMetricMetadata" ], "Resource": "*" } ] } EOF # Create the role aws iam create-role \ --role-name keda-operator-role \ --assume-role-policy-document file:///tmp/keda-trust-policy.json # Create the policy KEDA_POLICY_ARN=$(aws iam create-policy \ --policy-name KedaOperatorPolicy \ --policy-document file:///tmp/keda-policy.json \ --query 'Policy.Arn' \ --output text) # Attach the policy to the role aws iam attach-role-policy \ --role-name keda-operator-role \ --policy-arn $KEDA_POLICY_ARN
  2. 如果您使用的是门控模型,请创建一个 IAM 角色来访问门控模型。

    1. 创建一个 IAM 策略。

      %%bash -s $REGION JUMPSTART_GATED_ROLE_NAME="JumpstartGatedRole-${REGION}-${HYPERPOD_CLUSTER_NAME}" cat <<EOF > /tmp/trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:*:hyperpod-inference-service-account*", "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com" } } }, { "Effect": "Allow", "Principal": { "Service": "sagemaker.amazonaws.com" }, "Action": "sts:AssumeRole" } ] } EOF
    2. 创建 IAM 角色。

      # Create the role using existing trust policy aws iam create-role \ --role-name $JUMPSTART_GATED_ROLE_NAME \ --assume-role-policy-document file:///tmp/trust-policy.json aws iam attach-role-policy \ --role-name $JUMPSTART_GATED_ROLE_NAME \ --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerHyperPodGatedModelAccess
      JUMPSTART_GATED_ROLE_ARN_LIST= !aws iam get-role --role-name=$JUMPSTART_GATED_ROLE_NAME --query "Role.Arn" --output text JUMPSTART_GATED_ROLE_ARN = JUMPSTART_GATED_ROLE_ARN_LIST[0] !echo $JUMPSTART_GATED_ROLE_ARN

安装依赖项 EKS 附加组件

在安装推理运算符之前,您必须在集群上安装以下必需的 EKS 插件。如果缺少任何依赖项,则推理运算符将无法安装。为了与 Inference 插件兼容,每个插件都有最低版本要求。

重要

在尝试安装推理运算符之前,请安装所有依赖项插件。缺少依赖关系将导致安装失败并显示特定的错误消息。

必需的附加组件

  1. 亚马逊 S3 Mountpoint CSI 驱动程序(最低版本:v1.14.1-eksbuild.1)

    在推理工作负载中将 S3 存储桶作为永久卷挂载是必需的。

    aws eks create-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name aws-mountpoint-s3-csi-driver \ --region $REGION \ --service-account-role-arn $S3_CSI_ROLE_ARN

    有关包括所需的 IAM 权限在内的详细安装说明,请参阅 Mountpoint for Amazon S3 CSI 驱动程序

  2. 亚马逊 FSx CSI 驱动程序(最低版本:v1.6.0-eksbuild.1)

    为高性能模型存储挂载 FSx 文件系统所必需的。

    aws eks create-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name aws-fsx-csi-driver \ --region $REGION \ --service-account-role-arn $FSX_CSI_ROLE_ARN

    有关包括所需的 IAM 权限在内的详细安装说明,请参阅 A mazon f FSx or Lustre CSI 驱动程序

  3. Metrics Server(最低版本:v0.7.2-eksbuild.4)

    自动缩放功能和资源指标收集所必需的。

    aws eks create-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name metrics-server \ --region $REGION

    有关详细的安装说明,请参阅 Metrics 服务器

  4. 证书管理器(最低版本:v1.18.2-eksbuild.2)

    用于安全推理端点的 TLS 证书管理所必需的。

    aws eks create-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name cert-manager \ --region $REGION

    有关详细的安装说明,请参阅证书管理器

验证附加组件的安装

安装所需的插件后,请验证它们是否正常运行:

# Check add-on status aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-mountpoint-s3-csi-driver --region $REGION aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-fsx-csi-driver --region $REGION aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name metrics-server --region $REGION aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name cert-manager --region $REGION # Verify pods are running kubectl get pods -n kube-system | grep -E "(mountpoint|fsx|metrics-server)" kubectl get pods -n cert-manager

在继续安装推理运算符之前,所有插件的状态都应显示为 “ACTIVE”,并且所有 pod 都应处于 “正在运行” 状态。

注意

如果您使用快速设置或自定义设置选项创建了 HyperPod 集群,则可能已经安装了 FSx CSI 驱动程序和证书管理器。使用上面的命令验证他们的存在。

使用 EKS 附加组件安装推理运算符

EKS 插件安装方法通过自动更新和集成依赖项验证提供托管体验。这是安装推理运算符的推荐方法。

安装推理运算符附加组件
  1. 通过收集所有必需的配置文件 ARNs 并创建配置文件来准备插件配置:

    # Gather required ARNs export EXECUTION_ROLE_ARN=$(aws iam get-role --role-name $HYPERPOD_INFERENCE_ROLE_NAME --query "Role.Arn" --output text) export HYPERPOD_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME --region $REGION --query "ClusterArn" --output text) export KEDA_ROLE_ARN=$(aws iam get-role --role-name keda-operator-role --query 'Role.Arn' --output text) export ALB_ROLE_ARN=$(aws iam get-role --role-name alb-role --query 'Role.Arn' --output text) # Verify all ARNs are set correctly echo "Execution Role ARN: $EXECUTION_ROLE_ARN" echo "HyperPod Cluster ARN: $HYPERPOD_CLUSTER_ARN" echo "KEDA Role ARN: $KEDA_ROLE_ARN" echo "ALB Role ARN: $ALB_ROLE_ARN" echo "TLS S3 Bucket: $BUCKET_NAME"
  2. 创建包含所有必需设置的插件配置文件:

    cat > addon-config.json << EOF { "executionRoleArn": "$EXECUTION_ROLE_ARN", "tlsCertificateS3Bucket": "$BUCKET_NAME", "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN", "jumpstartGatedModelDownloadRoleArn": "$JUMPSTART_GATED_ROLE_ARN", "alb": { "serviceAccount": { "create": true, "roleArn": "$ALB_ROLE_ARN" } }, "keda": { "auth": { "aws": { "irsa": { "roleArn": "$KEDA_ROLE_ARN" } } } } } EOF # Verify the configuration file cat addon-config.json
  3. 安装推理运算符附加组件(最低版本:v1.0.0-eksbuild.1):

    aws eks create-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name amazon-sagemaker-hyperpod-inference \ --configuration-values file://addon-config.json \ --region $REGION
  4. 监控安装进度并验证是否成功完成:

    # Check installation status (repeat until status shows "ACTIVE") aws eks describe-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name amazon-sagemaker-hyperpod-inference \ --region $REGION \ --query "addon.{Status:status,Health:health}" \ --output table # Verify pods are running kubectl get pods -n hyperpod-inference-system # Check operator logs for any issues kubectl logs -n hyperpod-inference-system deployment/hyperpod-inference-controller-manager --tail=50

有关安装问题的详细疑难解答,请参阅HyperPod 推理疑难解答

要验证推理运算符是否正常工作,请继续。验证推理运算符是否正常工作

使用 Amazon CloudFormation 模板创建必备堆栈

除了手动配置先决条件之外,您还可以使用 Amazon CloudFormation 模板为推理运算符自动创建所需的 IAM 角色和策略。

  1. 设置输入变量。将占位符值替换为自己的值:

    #!/bin/bash set -e # ===== INPUT VARIABLES ===== HP_CLUSTER_NAME="my-hyperpod-cluster" # Replace with your HyperPod cluster name REGION="us-east-1" # Replace with your AWS region PREFIX="my-prefix" # Replace with your resource prefix SHORT_PREFIX="12a34d56" # Replace with your short prefix (maximum 8 characters) CREATE_DOMAIN="true" # Set to "false" if you don't need a SageMaker Studio domain STACK_NAME="hyperpod-inference-prerequisites" # Replace with your stack name TEMPLATE_URL="https://aws-sagemaker-hyperpod-cluster-setup-${REGION}-prod.s3.${REGION}.amazonaws.com/templates/main-stack-inference-operator-addon-template.yaml"
  2. 派生集群和网络信息:

    # ===== DERIVE EKS CLUSTER NAME ===== EKS_CLUSTER_NAME=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --region $REGION --query 'Orchestrator.Eks.ClusterArn' --output text | awk -F'/' '{print $NF}') echo "EKS_CLUSTER_NAME=$EKS_CLUSTER_NAME" # ===== GET VPC AND OIDC ===== VPC_ID=$(aws eks describe-cluster --name $EKS_CLUSTER_NAME --region $REGION --query 'cluster.resourcesVpcConfig.vpcId' --output text) echo "VPC_ID=$VPC_ID" OIDC_PROVIDER=$(aws eks describe-cluster --name $EKS_CLUSTER_NAME --region $REGION --query 'cluster.identity.oidc.issuer' --output text | sed 's|https://||') echo "OIDC_PROVIDER=$OIDC_PROVIDER" # ===== GET PRIVATE ROUTE TABLES ===== ALL_ROUTE_TABLES=$(aws ec2 describe-route-tables --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" --query 'RouteTables[].RouteTableId' --output text) EKS_PRIVATE_ROUTE_TABLES="" for rtb in $ALL_ROUTE_TABLES; do HAS_IGW=$(aws ec2 describe-route-tables --region $REGION --route-table-ids $rtb --query 'RouteTables[0].Routes[?GatewayId && starts_with(GatewayId, `igw-`)]' --output text 2>/dev/null) if [ -z "$HAS_IGW" ]; then EKS_PRIVATE_ROUTE_TABLES="${EKS_PRIVATE_ROUTE_TABLES:+$EKS_PRIVATE_ROUTE_TABLES,}$rtb" fi done echo "EKS_PRIVATE_ROUTE_TABLES=$EKS_PRIVATE_ROUTE_TABLES" # ===== CHECK S3 VPC ENDPOINT ===== S3_ENDPOINT_EXISTS=$(aws ec2 describe-vpc-endpoints --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" "Name=service-name,Values=com.amazonaws.$REGION.s3" --query 'VpcEndpoints[0].VpcEndpointId' --output text) CREATE_S3_ENDPOINT_STACK=$([ "$S3_ENDPOINT_EXISTS" == "None" ] && echo "true" || echo "false") echo "CREATE_S3_ENDPOINT_STACK=$CREATE_S3_ENDPOINT_STACK" # ===== GET HYPERPOD DETAILS ===== HYPERPOD_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --region $REGION --query 'ClusterArn' --output text) echo "HYPERPOD_CLUSTER_ARN=$HYPERPOD_CLUSTER_ARN" # ===== GET DEFAULT VPC FOR DOMAIN ===== DOMAIN_VPC_ID=$(aws ec2 describe-vpcs --region $REGION --filters "Name=isDefault,Values=true" --query 'Vpcs[0].VpcId' --output text) echo "DOMAIN_VPC_ID=$DOMAIN_VPC_ID" DOMAIN_SUBNET_IDS=$(aws ec2 describe-subnets --region $REGION --filters "Name=vpc-id,Values=$DOMAIN_VPC_ID" --query 'Subnets[0].SubnetId' --output text) echo "DOMAIN_SUBNET_IDS=$DOMAIN_SUBNET_IDS" # ===== GET INSTANCE GROUPS ===== INSTANCE_GROUPS=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --region $REGION --query 'InstanceGroups[].InstanceGroupName' --output json | python3 -c "import sys, json; groups = json.load(sys.stdin); print('[' + ','.join([f'\\\\\\\"' + g + '\\\\\\\"' for g in groups]) + ']')") echo "INSTANCE_GROUPS=$INSTANCE_GROUPS"
  3. 创建参数文件并部署堆栈:

    # ===== CREATE PARAMETERS JSON ===== cat > /tmp/cfn-params.json << EOF [ {"ParameterKey":"ResourceNamePrefix","ParameterValue":"$PREFIX"}, {"ParameterKey":"ResourceNameShortPrefix","ParameterValue":"$SHORT_PREFIX"}, {"ParameterKey":"VpcId","ParameterValue":"$VPC_ID"}, {"ParameterKey":"EksPrivateRouteTableIds","ParameterValue":"$EKS_PRIVATE_ROUTE_TABLES"}, {"ParameterKey":"EKSClusterName","ParameterValue":"$EKS_CLUSTER_NAME"}, {"ParameterKey":"OIDCProviderURLWithoutProtocol","ParameterValue":"$OIDC_PROVIDER"}, {"ParameterKey":"HyperPodClusterArn","ParameterValue":"$HYPERPOD_CLUSTER_ARN"}, {"ParameterKey":"HyperPodClusterName","ParameterValue":"$HP_CLUSTER_NAME"}, {"ParameterKey":"CreateDomain","ParameterValue":"$CREATE_DOMAIN"}, {"ParameterKey":"DomainVpcId","ParameterValue":"$DOMAIN_VPC_ID"}, {"ParameterKey":"DomainSubnetIds","ParameterValue":"$DOMAIN_SUBNET_IDS"}, {"ParameterKey":"CreateS3EndpointStack","ParameterValue":"$CREATE_S3_ENDPOINT_STACK"}, {"ParameterKey":"TieredStorageConfig","ParameterValue":"{\"Mode\":\"Enable\",\"InstanceMemoryAllocationPercentage\":20}"}, {"ParameterKey":"TieredKVCacheConfig","ParameterValue":"{\"KVCacheMode\":\"Enable\",\"InstanceGroup\":$INSTANCE_GROUPS,\"NVMeMode\":\"Enable\"}"} ] EOF echo -e "\n===== CREATING CLOUDFORMATION STACK =====" aws cloudformation create-stack \ --region $REGION \ --stack-name $STACK_NAME \ --template-url $TEMPLATE_URL \ --parameters file:///tmp/cfn-params.json \ --capabilities CAPABILITY_NAMED_IAM
  4. 监控堆栈创建状态:

    aws cloudformation describe-stacks \ --stack-name $STACK_NAME \ --region $REGION \ --query 'Stacks[0].StackStatus'
  5. 成功创建堆栈后,检索输出值以用于推理运算符安装:

    aws cloudformation describe-stacks \ --stack-name $STACK_NAME \ --region $REGION \ --query 'Stacks[0].Outputs'

创建 Amazon CloudFormation 堆栈后,使用 EKS 附加组件安装推理运算符继续安装推理运算符。

方法 3:安装头盔图

如果您需要对安装配置进行更多控制,或者如果您所在的地区没有 EKS 附加组件,请使用此方法。

先决条件

在继续操作之前,请验证您的 Amazon 凭证配置是否正确并具有必要的权限。以下步骤需要由对 Amazon EKS 集群具有管理员权限和集群管理员访问权限的 IAM 委托人运行。确认您已使用创建 HyperPod 集群使用 Amazon EKS 编排创建 SageMaker HyperPod 集群。确认您已安装 helm、eksctl 和 kubectl 命令行实用程序。

要获得 Kubernetes 对 Amazon EKS 集群的管理权限,请前往 Amazon EKS 控制台并选择您正在使用的集群。在访问选项卡中查看,然后选择“IAM 访问条目”。如果您的 IAM 主体没有条目,请选择创建访问条目。然后选择所需的 IAM 主体并将 AmazonEKSClusterAdminPolicy 与之关联。

  1. 将 kubectl 配置为连接到由 Amazon EKS HyperPod 集群编排的新创建的集群。指定区域和 HyperPod 集群名称。

    export HYPERPOD_CLUSTER_NAME=<hyperpod-cluster-name> export REGION=<region> # S3 bucket where tls certificates will be uploaded BUCKET_NAME="<Enter name of your s3 bucket>" # This should be bucket name, not URI export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \ --query 'Orchestrator.Eks.ClusterArn' --output text | \ cut -d'/' -f2) aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION
  2. 设置默认 env 变量。

    LB_CONTROLLER_POLICY_NAME="AWSLoadBalancerControllerIAMPolicy-$HYPERPOD_CLUSTER_NAME" LB_CONTROLLER_ROLE_NAME="aws-load-balancer-controller-$HYPERPOD_CLUSTER_NAME" S3_MOUNT_ACCESS_POLICY_NAME="S3MountpointAccessPolicy-$HYPERPOD_CLUSTER_NAME" S3_CSI_ROLE_NAME="SM_HP_S3_CSI_ROLE-$HYPERPOD_CLUSTER_NAME" KEDA_OPERATOR_POLICY_NAME="KedaOperatorPolicy-$HYPERPOD_CLUSTER_NAME" KEDA_OPERATOR_ROLE_NAME="keda-operator-role-$HYPERPOD_CLUSTER_NAME" PRESIGNED_URL_ACCESS_POLICY_NAME="PresignedUrlAccessPolicy-$HYPERPOD_CLUSTER_NAME" HYPERPOD_INFERENCE_ACCESS_POLICY_NAME="HyperpodInferenceAccessPolicy-$HYPERPOD_CLUSTER_NAME" HYPERPOD_INFERENCE_ROLE_NAME="HyperpodInferenceRole-$HYPERPOD_CLUSTER_NAME" HYPERPOD_INFERENCE_SA_NAME="hyperpod-inference-operator-controller" HYPERPOD_INFERENCE_SA_NAMESPACE="hyperpod-inference-system" JUMPSTART_GATED_ROLE_NAME="JumpstartGatedRole-$HYPERPOD_CLUSTER_NAME" FSX_CSI_ROLE_NAME="AmazonEKSFSxLustreCSIDriverFullAccess-$HYPERPOD_CLUSTER_NAME"
  3. 从集群 ARN 中提取 Amazon EKS 集群名称,更新本地 kubeconfig,并通过列出命名空间中的所有容器组(pod)来验证连接。

    kubectl get pods --all-namespaces
  4. (可选)安装 NVIDIA 设备插件以在集群上启用 GPU 支持。

    #Install nvidia device plugin kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml # Verify that GPUs are visible to k8s kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu

为推理运算符安装准备环境

  1. 收集在 Amazon EKS、A SageMaker I 和 IAM 组件之间配置服务集成 ARNs 所需的基本 Amazon 资源标识符。

    %%bash -x export ACCOUNT_ID=$(aws --region $REGION sts get-caller-identity --query 'Account' --output text) export OIDC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5) export EKS_CLUSTER_ROLE=$(aws eks --region $REGION describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.roleArn' --output text)
  2. 将 IAM OIDCidentity 提供商与您的 EKS 集群关联。

    eksctl utils associate-iam-oidc-provider --region=$REGION --cluster=$EKS_CLUSTER_NAME --approve
  3. 创建 HyperPod 推理运算符 IAM 角色所需的信任策略和权限策略 JSON 文档。这些策略支持在 Amazon EKS、A SageMaker I 和其他服务之间进行安全的跨 Amazon 服务通信。

    bash # Create trust policy JSON cat << EOF > trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "sagemaker.amazonaws.com" ] }, "Action": "sts:AssumeRole" }, { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com", "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:hyperpod-inference-system:hyperpod-inference-controller-manager" } } } ] } EOF # Create permission policy JSON cat << EOF > permission-policy.json { "Version": "2012-10-17", "Statement": [ { "Sid": "S3Access", "Effect": "Allow", "Action": [ "s3:Get*", "s3:List*", "s3:Describe*", "s3:PutObject" ], "Resource": [ "*" ] }, { "Sid": "ECRAccess", "Effect": "Allow", "Action": [ "ecr:GetAuthorizationToken", "ecr:BatchCheckLayerAvailability", "ecr:GetDownloadUrlForLayer", "ecr:GetRepositoryPolicy", "ecr:DescribeRepositories", "ecr:ListImages", "ecr:DescribeImages", "ecr:BatchGetImage", "ecr:GetLifecyclePolicy", "ecr:GetLifecyclePolicyPreview", "ecr:ListTagsForResource", "ecr:DescribeImageScanFindings" ], "Resource": [ "*" ] }, { "Sid": "EC2Access", "Effect": "Allow", "Action": [ "ec2:AssignPrivateIpAddresses", "ec2:AttachNetworkInterface", "ec2:CreateNetworkInterface", "ec2:DeleteNetworkInterface", "ec2:DescribeInstances", "ec2:DescribeTags", "ec2:DescribeNetworkInterfaces", "ec2:DescribeInstanceTypes", "ec2:DescribeSubnets", "ec2:DetachNetworkInterface", "ec2:ModifyNetworkInterfaceAttribute", "ec2:UnassignPrivateIpAddresses", "ec2:CreateTags", "ec2:DescribeInstances", "ec2:DescribeInstanceTypes", "ec2:DescribeRouteTables", "ec2:DescribeSecurityGroups", "ec2:DescribeSubnets", "ec2:DescribeVolumes", "ec2:DescribeVolumesModifications", "ec2:DescribeVpcs", "ec2:CreateVpcEndpointServiceConfiguration", "ec2:DeleteVpcEndpointServiceConfigurations", "ec2:DescribeVpcEndpointServiceConfigurations", "ec2:ModifyVpcEndpointServicePermissions" ], "Resource": [ "*" ] }, { "Sid": "EKSAuthAccess", "Effect": "Allow", "Action": [ "eks-auth:AssumeRoleForPodIdentity" ], "Resource": [ "*" ] }, { "Sid": "EKSAccess", "Effect": "Allow", "Action": [ "eks:AssociateAccessPolicy", "eks:Describe*", "eks:List*", "eks:AccessKubernetesApi" ], "Resource": [ "*" ] }, { "Sid": "ApiGatewayAccess", "Effect": "Allow", "Action": [ "apigateway:POST", "apigateway:GET", "apigateway:PUT", "apigateway:PATCH", "apigateway:DELETE", "apigateway:UpdateRestApiPolicy" ], "Resource": [ "arn:aws:apigateway:*::/vpclinks", "arn:aws:apigateway:*::/vpclinks/*", "arn:aws:apigateway:*::/restapis", "arn:aws:apigateway:*::/restapis/*" ] }, { "Sid": "ElasticLoadBalancingAccess", "Effect": "Allow", "Action": [ "elasticloadbalancing:CreateLoadBalancer", "elasticloadbalancing:DescribeLoadBalancers", "elasticloadbalancing:DescribeLoadBalancerAttributes", "elasticloadbalancing:DescribeListeners", "elasticloadbalancing:DescribeListenerCertificates", "elasticloadbalancing:DescribeSSLPolicies", "elasticloadbalancing:DescribeRules", "elasticloadbalancing:DescribeTargetGroups", "elasticloadbalancing:DescribeTargetGroupAttributes", "elasticloadbalancing:DescribeTargetHealth", "elasticloadbalancing:DescribeTags", "elasticloadbalancing:DescribeTrustStores", "elasticloadbalancing:DescribeListenerAttributes" ], "Resource": [ "*" ] }, { "Sid": "SageMakerAccess", "Effect": "Allow", "Action": [ "sagemaker:*" ], "Resource": [ "*" ] }, { "Sid": "AllowPassRoleToSageMaker", "Effect": "Allow", "Action": [ "iam:PassRole" ], "Resource": "arn:aws:iam::*:role/*", "Condition": { "StringEquals": { "iam:PassedToService": "sagemaker.amazonaws.com" } } }, { "Sid": "AcmAccess", "Effect": "Allow", "Action": [ "acm:ImportCertificate", "acm:DeleteCertificate" ], "Resource": [ "*" ] } ] } EOF
  4. 为推理运算符创建执行角色。

    aws iam create-policy --policy-name $HYPERPOD_INFERENCE_ACCESS_POLICY_NAME --policy-document file://permission-policy.json export policy_arn="arn:aws:iam::${ACCOUNT_ID}:policy/$HYPERPOD_INFERENCE_ACCESS_POLICY_NAME"
    aws iam create-role --role-name $HYPERPOD_INFERENCE_ROLE_NAME --assume-role-policy-document file://trust-policy.json aws iam put-role-policy --role-name $HYPERPOD_INFERENCE_ROLE_NAME --policy-name InferenceOperatorInlinePolicy --policy-document file://permission-policy.json
  5. 下载并创建 Load Balancer Controller 管理您的 EKS 集群中的应用程序负 Amazon 载均衡器和网络负载均衡器所需的 IAM 策略。

    %%bash -x export ALBController_IAM_POLICY_NAME=HyperPodInferenceALBControllerIAMPolicy curl -o AWSLoadBalancerControllerIAMPolicy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.13.0/docs/install/iam_policy.json aws iam create-policy --policy-name $ALBController_IAM_POLICY_NAME --policy-document file://AWSLoadBalancerControllerIAMPolicy.json
  6. 创建一个 IAM 服务账户,将 Kubernetes 服务账户与 IAM 策略关联起来,使 Amazon 负载均衡器控制器能够通过 IRSA(服务账户的 IAM 角色)获得必要的 Amazon 权限。

    %%bash -x export ALB_POLICY_ARN="arn:aws:iam::$ACCOUNT_ID:policy/$ALBController_IAM_POLICY_NAME" # Create IAM service account with gathered values eksctl create iamserviceaccount \ --approve \ --override-existing-serviceaccounts \ --name=aws-load-balancer-controller \ --namespace=kube-system \ --cluster=$EKS_CLUSTER_NAME \ --attach-policy-arn=$ALB_POLICY_ARN \ --region=$REGION # Print the values for verification echo "Cluster Name: $EKS_CLUSTER_NAME" echo "Region: $REGION" echo "Policy ARN: $ALB_POLICY_ARN"
  7. 将标签 (kubernetes.io.role/elb) 应用于 Amazon EKS 集群中的所有子网(包括公有子网和私有子网)。

    export VPC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.resourcesVpcConfig.vpcId' --output text) # Add Tags aws ec2 describe-subnets \ --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \ --query 'Subnets[*].SubnetId' --output text | \ tr '\t' '\n' | \ xargs -I{} aws ec2 create-tags --resources {} --tags Key=kubernetes.io/role/elb,Value=1 # Verify Tags are added aws ec2 describe-subnets \ --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \ --query 'Subnets[*].SubnetId' --output text | \ tr '\t' '\n' | xargs -n1 -I{} aws ec2 describe-tags --filters "Name=resource-id,Values={}" "Name=key,Values=kubernetes.io/role/elb" --query "Tags[0].Value" --output text
  8. 为 KEDA 和证书管理器创建命名空间。

    kubectl create namespace keda kubectl create namespace cert-manager
  9. 创建 Amazon S3 VPC 端点。

    aws ec2 create-vpc-endpoint \ --vpc-id ${VPC_ID} \ --vpc-endpoint-type Gateway \ --service-name "com.amazonaws.${REGION}.s3" \ --route-table-ids $(aws ec2 describe-route-tables --filters "Name=vpc-id,Values=${VPC_ID}" --query 'RouteTables[].Associations[].RouteTableId' --output text | tr ' ' '\n' | sort -u | tr '\n' ' ')
  10. 配置 S3 存储访问权限:

    1. 创建一个 IAM 策略来授予使用适用于 Amazon S3 的 Mountpoint 所需的 S3 权限,以便文件系统能够从集群内访问 S3 存储桶。

      %%bash -x export S3_CSI_BUCKET_NAME=“<bucketname_for_mounting_through_filesystem>” cat <<EOF> s3accesspolicy.json { "Version": "2012-10-17", "Statement": [ { "Sid": "MountpointAccess", "Effect": "Allow", "Action": [ "s3:ListBucket", "s3:GetObject", "s3:PutObject", "s3:AbortMultipartUpload", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::${S3_CSI_BUCKET_NAME}", "arn:aws:s3:::${S3_CSI_BUCKET_NAME}/*" ] } ] } EOF aws iam create-policy \ --policy-name S3MountpointAccessPolicy \ --policy-document file://s3accesspolicy.json cat <<EOF> s3accesstrustpolicy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/${OIDC_ID}" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "oidc.eks.$REGION.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com", "oidc.eks.$REGION.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:kube-system:${s3-csi-driver-sa}" } } } ] } EOF aws iam create-role --role-name $S3_CSI_ROLE_NAME --assume-role-policy-document file://s3accesstrustpolicy.json aws iam attach-role-policy --role-name $S3_CSI_ROLE_NAME --policy-arn "arn:aws:iam::$ACCOUNT_ID:policy/S3MountpointAccessPolicy"
    2. (可选)为 Amazon S3 CSI 驱动程序创建 IAM 服务账户。Amazon S3 CSI 驱动程序需要一个具有适当权限的 IAM 服务账户,才能将 S3 存储桶作为永久卷挂载到您的 Amazon EKS 集群中。在此步骤中,使用所需的 S3 访问策略创建必要的 IAM 角色和 Kubernetes 服务账户。

      %%bash -x export S3_CSI_ROLE_NAME="SM_HP_S3_CSI_ROLE-$REGION" export S3_CSI_POLICY_ARN=$(aws iam list-policies --query 'Policies[?PolicyName==`S3MountpointAccessPolicy`]' | jq '.[0].Arn' | tr -d '"') eksctl create iamserviceaccount \ --name s3-csi-driver-sa \ --namespace kube-system \ --cluster $EKS_CLUSTER_NAME \ --attach-policy-arn $S3_CSI_POLICY_ARN \ --approve \ --role-name $S3_CSI_ROLE_NAME \ --region $REGION kubectl label serviceaccount s3-csi-driver-sa app.kubernetes.io/component=csi-driver app.kubernetes.io/instance=aws-mountpoint-s3-csi-driver app.kubernetes.io/managed-by=EKS app.kubernetes.io/name=aws-mountpoint-s3-csi-driver -n kube-system --overwrite
    3. (可选)安装 Amazon S3 CSI 驱动程序加载项。此驱动程序使您的容器组(pod)能够将 S3 存储桶作为持久性卷挂载,从而提供从 Kubernetes 工作负载中直接访问 S3 存储的权限。

      %%bash -x export S3_CSI_ROLE_ARN=$(aws iam get-role --role-name $S3_CSI_ROLE_NAME --query 'Role.Arn' --output text) eksctl create addon --name aws-mountpoint-s3-csi-driver --cluster $EKS_CLUSTER_NAME --service-account-role-arn $S3_CSI_ROLE_ARN --force
    4. (可选)为 S3 存储创建持久性卷声明(PVC)。此 PVC 使容器组(pod)能够请求并使用 S3 存储,就像使用传统文件系统一样。

      %%bash -x cat <<EOF> pvc_s3.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: s3-claim spec: accessModes: - ReadWriteMany # supported options: ReadWriteMany / ReadOnlyMany storageClassName: "" # required for static provisioning resources: requests: storage: 1200Gi # ignored, required volumeName: s3-pv EOF kubectl apply -f pvc_s3.yaml
  11. (可选)配置 FSx 存储访问权限。为 Amazon FSx CSI 驱动程序创建一个 IAM 服务账户。 FSx CSI 驱动程序将使用此服务账户代表您的集群与 Amazon FSx 服务进行交互。

    %%bash -x eksctl create iamserviceaccount \ --name fsx-csi-controller-sa \ --namespace kube-system \ --cluster $EKS_CLUSTER_NAME \ --attach-policy-arn arn:aws:iam::aws:policy/AmazonFSxFullAccess \ --approve \ --role-name FSXLCSI-${EKS_CLUSTER_NAME}-${REGION} \ --region $REGION

创建 KEDA 运算符角色

  1. 创建信任策略和权限策略。

    # Create trust policy cat <<EOF > /tmp/keda-trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:kube-system:keda-operator", "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com" } } } ] } EOF # Create permissions policy cat <<EOF > /tmp/keda-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "cloudwatch:GetMetricData", "cloudwatch:GetMetricStatistics", "cloudwatch:ListMetrics" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "aps:QueryMetrics", "aps:GetLabels", "aps:GetSeries", "aps:GetMetricMetadata" ], "Resource": "*" } ] } EOF # Create the role aws iam create-role \ --role-name keda-operator-role \ --assume-role-policy-document file:///tmp/keda-trust-policy.json # Create the policy KEDA_POLICY_ARN=$(aws iam create-policy \ --policy-name KedaOperatorPolicy \ --policy-document file:///tmp/keda-policy.json \ --query 'Policy.Arn' \ --output text) # Attach the policy to the role aws iam attach-role-policy \ --role-name keda-operator-role \ --policy-arn $KEDA_POLICY_ARN
  2. 如果您使用的是门控模型,请创建一个 IAM 角色来访问门控模型。

    1. 创建一个 IAM 策略。

      %%bash -s $REGION cat <<EOF> /tmp/presignedurl-policy.json { "Version": "2012-10-17", "Statement": [ { "Sid": "CreatePresignedUrlAccess", "Effect": "Allow", "Action": [ "sagemaker:CreateHubContentPresignedUrls" ], "Resource": [ "arn:aws:sagemaker:$1:aws:hub/SageMakerPublicHub", "arn:aws:sagemaker:$1:aws:hub-content/SageMakerPublicHub/*/*" ] } ] } EOF aws iam create-policy --policy-name PresignedUrlAccessPolicy --policy-document file:///tmp/presignedurl-policy.json JUMPSTART_GATED_ROLE_NAME="JumpstartGatedRole-${REGION}-${HYPERPOD_CLUSTER_NAME}" cat <<EOF > /tmp/trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringLike": { "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:*:hyperpod-inference-controller-manager", "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com" } } }, { "Effect": "Allow", "Principal": { "Service": "sagemaker.amazonaws.com" }, "Action": "sts:AssumeRole" } ] } EOF
    2. 创建 IAM 角色。

      # Create the role using existing trust policy aws iam create-role \ --role-name $JUMPSTART_GATED_ROLE_NAME \ --assume-role-policy-document file:///tmp/trust-policy.json # Attach the existing PresignedUrlAccessPolicy to the role aws iam attach-role-policy \ --role-name $JUMPSTART_GATED_ROLE_NAME \ --policy-arn arn:aws:iam::${ACCOUNT_ID}:policy/PresignedUrlAccessPolicy
      JUMPSTART_GATED_ROLE_ARN_LIST= !aws iam get-role --role-name=$JUMPSTART_GATED_ROLE_NAME --query "Role.Arn" --output text JUMPSTART_GATED_ROLE_ARN = JUMPSTART_GATED_ROLE_ARN_LIST[0] !echo $JUMPSTART_GATED_ROLE_ARN
    3. SageMakerFullAccess 策略添加到执行角色。

      aws iam attach-role-policy --role-name=$HYPERPOD_INFERENCE_ROLE_NAME --policy-arn=arn:aws:iam::aws:policy/AmazonSageMakerFullAccess

安装推理运算符

  1. 安装 HyperPod 推理运算符。此步骤收集所需的 Amazon 资源标识符,并生成带相应配置参数的 Helm 安装命令。

    https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart 访问掌舵图

    git clone https://github.com/aws/sagemaker-hyperpod-cli cd sagemaker-hyperpod-cli cd helm_chart/HyperPodHelmChart helm dependencies update charts/inference-operator
    %%bash -x HYPERPOD_INFERENCE_ROLE_ARN=$(aws iam get-role --role-name=$HYPERPOD_INFERENCE_ROLE_NAME --query "Role.Arn" --output text) echo $HYPERPOD_INFERENCE_ROLE_ARN S3_CSI_ROLE_ARN=$(aws iam get-role --role-name=$S3_CSI_ROLE_NAME --query "Role.Arn" --output text) echo $S3_CSI_ROLE_ARN HYPERPOD_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME --query "ClusterArn") # Verify values echo "Cluster Name: $EKS_CLUSTER_NAME" echo "Execution Role: $HYPERPOD_INFERENCE_ROLE_ARN" echo "Hyperpod ARN: $HYPERPOD_CLUSTER_ARN" # Run the the HyperPod inference operator installation. helm install hyperpod-inference-operator charts/inference-operator \ -n kube-system \ --set region=$REGION \ --set eksClusterName=$EKS_CLUSTER_NAME \ --set hyperpodClusterArn=$HYPERPOD_CLUSTER_ARN \ --set executionRoleArn=$HYPERPOD_INFERENCE_ROLE_ARN \ --set s3.serviceAccountRoleArn=$S3_CSI_ROLE_ARN \ --set s3.node.serviceAccount.create=false \ --set keda.podIdentity.aws.irsa.roleArn="arn:aws:iam::$ACCOUNT_ID:role/keda-operator-role" \ --set tlsCertificateS3Bucket="s3://$BUCKET_NAME" \ --set alb.region=$REGION \ --set alb.clusterName=$EKS_CLUSTER_NAME \ --set alb.vpcId=$VPC_ID # For JumpStart Gated Model usage, Add # --set jumpstartGatedModelDownloadRoleArn=$UMPSTART_GATED_ROLE_ARN
  2. 为 IAM 集成配置服务账户注释。此注释使运算符的服务账户能够代入必要的 IAM 权限,以便管理推理端点并与 Amazon 服务进行交互。

    %%bash -x EKS_CLUSTER_ROLE_NAME=$(echo $EKS_CLUSTER_ROLE | sed 's/.*\///') # Annotate service account kubectl annotate serviceaccount hyperpod-inference-operator-controller-manager \ -n hyperpod-inference-system \ eks.amazonaws.com/role-arn=arn:aws:iam::${ACCOUNT_ID}:role/${EKS_CLUSTER_ROLE_NAME} \ --overwrite

验证推理运算符是否正常工作

按照以下步骤部署和测试简单模型,验证推理运算符的安装是否正常运行。

部署测试模型以验证操作员
  1. 创建一个模型部署配置文件。这将创建一个 Kubernetes 清单文件,用于为推理运算符定义 JumpStart 模型部署。 HyperPod

    cat <<EOF>> simple_model_install.yaml --- apiVersion: inference.sagemaker.aws.amazon.com/v1 kind: JumpStartModel metadata: name: testing-deployment-bert namespace: default spec: model: modelId: "huggingface-eqa-bert-base-cased" sageMakerEndpoint: name: "hp-inf-ep-for-testing" server: instanceType: "ml.c5.2xlarge" environmentVariables: - name: SAMPLE_ENV_VAR value: "sample_value" maxDeployTimeInSeconds: 1800 EOF
  2. 部署模型并清理配置文件。

    kubectl create -f simple_model_install.yaml rm -f simple_model_install.yaml
  3. 验证服务帐户配置以确保操作员可以获得 Amazon 权限。

    # Get the service account details kubectl get serviceaccount -n hyperpod-inference-system # Check if the service account has the Amazon annotations kubectl describe serviceaccount hyperpod-inference-operator-controller-manager -n hyperpod-inference-system
配置部署设置(如果使用 Studio 用户界面)
  1. 在 “部署设置” 下查看推荐的实例类型。

  2. 如果修改实例类型,请确保与您的 HyperPod 集群兼容。如果没有兼容的实例,请联系您的管理员。

  3. 对于启用了 MIG 的 GPU 分区实例,请从可用的 MIG 配置文件中选择适当的 GPU 分区以优化 GPU 利用率。有关更多信息,请参阅 在亚马逊中使用 GPU 分区 SageMaker HyperPod

  4. 如果使用任务治理,请为模型部署抢占功能配置优先级设置。

  5. 输入管理员提供的命名空间。如有必要,请联系您的管理员获取正确的命名空间。

(可选)在 SageMaker AI Studio Classic 中通过用户 JumpStart 界面设置用户访问权限

有关为 Studio Classic 用户设置 SageMaker HyperPod 访问权限以及为数据科学家用户配置精细的 Kubernetes RBAC 权限的更多背景信息,请阅读和。在 Studio 中设置 Amazon EKS 集群 设置 Kubernetes 基于角色的访问控制

  1. 确定数据科学家用户将使用哪个 IAM 角色来管理模型并将其部署到 A SageMaker I Studio Classic。 SageMaker HyperPod 这通常是 Studio Classic 用户的用户配置文件执行角色或域执行角色。

    %%bash -x export DATASCIENTIST_ROLE_NAME="<Execution Role Name used in SageMaker Studio Classic>" export DATASCIENTIST_POLICY_NAME="HyperPodUIAccessPolicy" export EKS_CLUSTER_ARN=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \ --query 'Orchestrator.Eks.ClusterArn' --output text) export DATASCIENTIST_HYPERPOD_NAMESPACE="team-namespace"
  2. 附加支持模型部署访问的身份策略。

    %%bash -x # Create access policy cat << EOF > hyperpod-deployment-ui-access-policy.json { "Version": "2012-10-17", "Statement": [ { "Sid": "DescribeHyerpodClusterPermissions", "Effect": "Allow", "Action": [ "sagemaker:DescribeCluster" ], "Resource": "$HYPERPOD_CLUSTER_ARN" }, { "Sid": "UseEksClusterPermissions", "Effect": "Allow", "Action": [ "eks:DescribeCluster", "eks:AccessKubernetesApi", "eks:MutateViaKubernetesApi", "eks:DescribeAddon" ], "Resource": "$EKS_CLUSTER_ARN" }, { "Sid": "ListPermission", "Effect": "Allow", "Action": [ "sagemaker:ListClusters", "sagemaker:ListEndpoints" ], "Resource": "*" }, { "Sid": "SageMakerEndpointAccess", "Effect": "Allow", "Action": [ "sagemaker:DescribeEndpoint", "sagemaker:InvokeEndpoint" ], "Resource": "arn:aws:sagemaker:$REGION:$ACCOUNT_ID:endpoint/*" } ] } EOF aws iam put-role-policy --role-name DATASCIENTIST_ROLE_NAME --policy-name HyperPodDeploymentUIAccessInlinePolicy --policy-document file://hyperpod-deployment-ui-access-policy.json
  3. 为用户创建 EKS 访问条目,将其映射到 kubernetes 组。

    %%bash -x aws eks create-access-entry --cluster-name $EKS_CLUSTER_NAME \ --principal-arn "arn:aws:iam::$ACCOUNT_ID:role/$DATASCIENTIST_ROLE_NAME" \ --kubernetes-groups '["hyperpod-scientist-user-namespace-level","hyperpod-scientist-user-cluster-level"]'
  4. 为用户创建 Kubernetes RBAC 策略。

    %%bash -x cat << EOF > cluster_level_config.yaml kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: hyperpod-scientist-user-cluster-role rules: - apiGroups: [""] resources: ["pods"] verbs: ["list"] - apiGroups: [""] resources: ["nodes"] verbs: ["list"] - apiGroups: [""] resources: ["namespaces"] verbs: ["list"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: hyperpod-scientist-user-cluster-role-binding subjects: - kind: Group name: hyperpod-scientist-user-cluster-level apiGroup: rbac.authorization.k8s.io roleRef: kind: ClusterRole name: hyperpod-scientist-user-cluster-role apiGroup: rbac.authorization.k8s.io EOF kubectl apply -f cluster_level_config.yaml cat << EOF > namespace_level_role.yaml kind: Role apiVersion: rbac.authorization.k8s.io/v1 metadata: namespace: $DATASCIENTIST_HYPERPOD_NAMESPACE name: hyperpod-scientist-user-namespace-level-role rules: - apiGroups: [""] resources: ["pods"] verbs: ["create", "get"] - apiGroups: [""] resources: ["nodes"] verbs: ["get", "list"] - apiGroups: [""] resources: ["pods/log"] verbs: ["get", "list"] - apiGroups: [""] resources: ["pods/exec"] verbs: ["get", "create"] - apiGroups: ["kubeflow.org"] resources: ["pytorchjobs", "pytorchjobs/status"] verbs: ["get", "list", "create", "delete", "update", "describe"] - apiGroups: [""] resources: ["configmaps"] verbs: ["create", "update", "get", "list", "delete"] - apiGroups: [""] resources: ["secrets"] verbs: ["create", "get", "list", "delete"] - apiGroups: [ "inference.sagemaker.aws.amazon.com" ] resources: [ "inferenceendpointconfig", "inferenceendpoint", "jumpstartmodel" ] verbs: [ "get", "list", "create", "delete", "update", "describe" ] - apiGroups: [ "autoscaling" ] resources: [ "horizontalpodautoscalers" ] verbs: [ "get", "list", "watch", "create", "update", "patch", "delete" ] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: namespace: $DATASCIENTIST_HYPERPOD_NAMESPACE name: hyperpod-scientist-user-namespace-level-role-binding subjects: - kind: Group name: hyperpod-scientist-user-namespace-level apiGroup: rbac.authorization.k8s.io roleRef: kind: Role name: hyperpod-scientist-user-namespace-level-role apiGroup: rbac.authorization.k8s.io EOF kubectl apply -f namespace_level_role.yaml