对由 Amazon EKS 编排的 SageMaker HyperPod 集群上训练作业的可观察性进行建模 - 亚马逊 SageMaker AI
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅 中国的 Amazon Web Services 服务入门 (PDF)

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

对由 Amazon EKS 编排的 SageMaker HyperPod 集群上训练作业的可观察性进行建模

SageMaker HyperPod 使用 Amazon EKS 编排的集群可以与 Amazon Studi MLflow o 上的应用程序集成。 SageMaker 集群管理员设置 MLflow 服务器并将其与 SageMaker HyperPod 集群连接。数据科学家可以深入了解模型

使用 Amazon CLI 设置 MLflow 服务器

MLflow 跟踪服务器应由集群管理员创建。

  1. 按照使用 CL SageMaker I 创建 MLflow 跟踪服务器中的说明创建 A Amazon I 跟踪服务器

  2. 确保eks-auth:AssumeRoleForPodIdentity权限存在于的 IAM 执行角色中 SageMaker HyperPod。

  3. 如果 EKS 集群上尚未安装 eks-pod-identity-agent 插件,请在 EKS 集群上安装此插件。

    aws eks create-addon \ --cluster-name <eks_cluster_name> \ --addon-name eks-pod-identity-agent \ --addon-version vx.y.z-eksbuild.1
  4. 为 Pod 调用的新角色创建一个trust-relationship.json文件 MLflow APIs。

    cat >trust-relationship.json <<EOF { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowEksAuthToAssumeRoleForPodIdentity", "Effect": "Allow", "Principal": { "Service": "pods.eks.amazonaws.com" }, "Action": [ "sts:AssumeRole", "sts:TagSession" ] } ] } EOF

    运行以下代码创建角色并附加信任关系。

    aws iam create-role --role-name hyperpod-mlflow-role \ --assume-role-policy-document file://trust-relationship.json \ --description "allow pods to emit mlflow metrics and put data in s3"
  5. 创建以下策略,授予 Pod 调用所有 sagemaker-mlflow 操作和将模型构件放入 S3 的权限。跟踪服务器中已存在 S3 权限,但是如果模型工件太大,则会直接从 MLflow 代码调用 s3 来上传工件。

    cat >hyperpod-mlflow-policy.json <<EOF { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "sagemaker-mlflow:AccessUI", "sagemaker-mlflow:CreateExperiment", "sagemaker-mlflow:SearchExperiments", "sagemaker-mlflow:GetExperiment", "sagemaker-mlflow:GetExperimentByName", "sagemaker-mlflow:DeleteExperiment", "sagemaker-mlflow:RestoreExperiment", "sagemaker-mlflow:UpdateExperiment", "sagemaker-mlflow:CreateRun", "sagemaker-mlflow:DeleteRun", "sagemaker-mlflow:RestoreRun", "sagemaker-mlflow:GetRun", "sagemaker-mlflow:LogMetric", "sagemaker-mlflow:LogBatch", "sagemaker-mlflow:LogModel", "sagemaker-mlflow:LogInputs", "sagemaker-mlflow:SetExperimentTag", "sagemaker-mlflow:SetTag", "sagemaker-mlflow:DeleteTag", "sagemaker-mlflow:LogParam", "sagemaker-mlflow:GetMetricHistory", "sagemaker-mlflow:SearchRuns", "sagemaker-mlflow:ListArtifacts", "sagemaker-mlflow:UpdateRun", "sagemaker-mlflow:CreateRegisteredModel", "sagemaker-mlflow:GetRegisteredModel", "sagemaker-mlflow:RenameRegisteredModel", "sagemaker-mlflow:UpdateRegisteredModel", "sagemaker-mlflow:DeleteRegisteredModel", "sagemaker-mlflow:GetLatestModelVersions", "sagemaker-mlflow:CreateModelVersion", "sagemaker-mlflow:GetModelVersion", "sagemaker-mlflow:UpdateModelVersion", "sagemaker-mlflow:DeleteModelVersion", "sagemaker-mlflow:SearchModelVersions", "sagemaker-mlflow:GetDownloadURIForModelVersionArtifacts", "sagemaker-mlflow:TransitionModelVersionStage", "sagemaker-mlflow:SearchRegisteredModels", "sagemaker-mlflow:SetRegisteredModelTag", "sagemaker-mlflow:DeleteRegisteredModelTag", "sagemaker-mlflow:DeleteModelVersionTag", "sagemaker-mlflow:DeleteRegisteredModelAlias", "sagemaker-mlflow:SetRegisteredModelAlias", "sagemaker-mlflow:GetModelVersionByAlias" ], "Resource": "arn:aws:sagemaker:us-west-2:111122223333:mlflow-tracking-server/<ml tracking server name>" }, { "Effect": "Allow", "Action": [ "s3:PutObject" ], "Resource": "arn:aws:s3:::<mlflow-s3-bucket_name>" } ] } EOF
    注意

    ARNs 应该是 MLflow 服务器中的存储桶和 S3 存储桶,在您创建 MLflow 服务器期间按照设置 MLflow 基础架构的说明在服务器上设置的。

  6. 使用上一步中保存的策略文档,将 mlflow-metrics-emit-policy 策略附加到 hyperpod-mlflow-role

    aws iam put-role-policy \ --role-name hyperpod-mlflow-role \ --policy-name mlflow-metrics-emit-policy \ --policy-document file://hyperpod-mlflow-policy.json
  7. 为 Pod 创建一个 Kubernetes 服务账号来访问服务器。 MLflow

    cat >mlflow-service-account.yaml <<EOF apiVersion: v1 kind: ServiceAccount metadata: name: mlflow-service-account namespace: kubeflow EOF

    运行以下命令应用到 EKS 集群。

    kubectl apply -f mlflow-service-account.yaml
  8. 创建容器组身份关联。

    aws eks create-pod-identity-association \ --cluster-name EKS_CLUSTER_NAME \ --role-arn arn:aws:iam::111122223333:role/hyperpod-mlflow-role \ --namespace kubeflow \ --service-account mlflow-service-account

将训练作业中的指标收集到 MLflow服务器

数据科学家需要设置训练脚本和 docker 镜像,以便向服务器发送指标。 MLflow

  1. 在训练脚本的开头添加以下几行。

    import mlflow # Set the Tracking Server URI using the ARN of the Tracking Server you created mlflow.set_tracking_uri(os.environ['MLFLOW_TRACKING_ARN']) # Enable autologging in MLflow mlflow.autolog()
  2. 使用训练脚本构建 Docker 映像,并推送到 Amazon ECR。获取 ECR 容器的 ARN。有关构建和推送 Docker 映像的更多信息,请参阅《ECR 用户指南》中的推送 Docker 映像

    提示

    确保在 Docker 文件中添加 mlflow 和 sagemaker-mlflow 软件包的安装。要详细了解软件包的安装、要求和软件包的兼容版本,请参阅安装 MLflow 和 SageMaker AI MLflow 插件。

  3. 在训练作业 Pod 中添加服务账号使其能够访问 hyperpod-mlflow-role。这允许 Pod 调用 MLflow APIs。运行以下 SageMaker HyperPod CLI 作业提交模板。创建此文件,文件名为 mlflow-test.yaml

    defaults: - override hydra/job_logging: stdout hydra: run: dir: . output_subdir: null training_cfg: entry_script: ./train.py script_args: [] run: name: test-job-with-mlflow # Current run name nodes: 2 # Number of nodes to use for current training # ntasks_per_node: 1 # Number of devices to use per node cluster: cluster_type: k8s # currently k8s only instance_type: ml.c5.2xlarge cluster_config: # name of service account associated with the namespace service_account_name: mlflow-service-account # persistent volume, usually used to mount FSx persistent_volume_claims: null namespace: kubeflow # required node affinity to select nodes with SageMaker HyperPod # labels and passed health check if burn-in enabled label_selector: required: sagemaker.amazonaws.com/node-health-status: - Schedulable preferred: sagemaker.amazonaws.com/deep-health-check-status: - Passed weights: - 100 pullPolicy: IfNotPresent # policy to pull container, can be Always, IfNotPresent and Never restartPolicy: OnFailure # restart policy base_results_dir: ./result # Location to store the results, checkpoints and logs. container: 111122223333.dkr.ecr.us-west-2.amazonaws.com/tag # container to use env_vars: NCCL_DEBUG: INFO # Logging level for NCCL. Set to "INFO" for debug information MLFLOW_TRACKING_ARN: arn:aws:sagemaker:us-west-2:11112223333:mlflow-tracking-server/tracking-server-name
  4. 使用 YAML 文件启动作业,如下所示。

    hyperpod start-job --config-file /path/to/mlflow-test.yaml
  5. 为 MLflow 跟踪服务器生成预签名 URL。您可以在浏览器上打开链接,开始跟踪您的训练作业。

    aws sagemaker create-presigned-mlflow-tracking-server-url \ --tracking-server-name "tracking-server-name" \ --session-expiration-duration-in-seconds 1800 \ --expires-in-seconds 300 \ --region region