使用 SageMaker TensorFlow 和 PyTorch 估算器扩展包含分布式模型并行库的预构建 SageMaker的 Docker 容器使用库创建您自己的 Docker 容器

第 2 步：使用 SageMaker Python 软件开发工具包启动训练 Job

SageMaker Python SDK 支持使用机器学习框架（例如 TensorFlow 和）对模型进行托管训练 PyTorch。要使用其中一个框架启动训练作业，您需要定义估计器、 SageMaker TensorFlow 估计器或 SageMaker 通用 E SageMaker PyTorch stimat or，以使用修改后的训练脚本和模型并行配置。

使用 SageMaker TensorFlow 和 PyTorch 估算器

TensorFlow 和 PyTorch estimator 类包含distribution参数，您可以使用该参数来指定使用分布式训练框架的配置参数。 SageMaker 模型并行库内部使用 MPI 来处理混合数据和模型并行性，因此必须在该库中使用 MPI 选项。

以下 TensorFlow PyTorch 或估算器模板显示了如何配置distribution参数，以便在 MPI 中使用模型 par SageMaker allel 库。

Using the SageMaker TensorFlow estimator


import sagemaker
from sagemaker.tensorflow import TensorFlow

smp_options = {
    "enabled":True,              # Required
    "parameters": {
        "partitions": 2,         # Required
        "microbatches": 4,
        "placement_strategy": "spread",
        "pipeline": "interleaved",
        "optimize": "speed",
        "horovod": True,         # Use this for hybrid model and data parallelism
    }
}

mpi_options = {
    "enabled" : True,            # Required
    "processes_per_host" : 8,    # Required
    # "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none"
}

smd_mp_estimator = TensorFlow(
    entry_point="your_training_script.py", # Specify your train script
    source_dir="location_to_your_script",
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type='ml.p3.16xlarge',
    framework_version='2.6.3',
    py_version='py38',
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    },
    base_job_name="SMD-MP-demo",
)

smd_mp_estimator.fit('s3://my_bucket/my_training_data/')

Using the SageMaker PyTorch estimator


import sagemaker
from sagemaker.pytorch import PyTorch

smp_options = {
    "enabled":True,
    "parameters": {                        # Required
        "pipeline_parallel_degree": 2,     # Required
        "microbatches": 4,
        "placement_strategy": "spread",
        "pipeline": "interleaved",
        "optimize": "speed",
        "ddp": True,
    }
}

mpi_options = {
    "enabled" : True,                      # Required
    "processes_per_host" : 8,              # Required
    # "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none"
}

smd_mp_estimator = PyTorch(
    entry_point="your_training_script.py", # Specify your train script
    source_dir="location_to_your_script",
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type='ml.p3.16xlarge',
    framework_version='1.13.1',
    py_version='py38',
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    },
    base_job_name="SMD-MP-demo",
)

smd_mp_estimator.fit('s3://my_bucket/my_training_data/')

要启用该库，您需要通过 SageMaker 估计器构造函数的distribution参数将配置字典传递给"smdistributed"和"mpi"键。

SageMaker 模型并行度的配置参数

对于 "smdistributed" 键，用 "modelparallel" 键和以下内部字典传递字典。

注意
不支持在一个训练作业中使用 "modelparallel" 和 "dataparallel"。
- "enabled" – 必需。要启用模型并行性，请设置 "enabled": True。
- "parameters" – 必需。为 SageMaker 模型并行度指定一组参数。
  - 有关常用参数的完整列表，请参阅 SageMaker Python SDK 文档smdistributed中的参数。
    
    有关信息 TensorFlow，请参阅TensorFlow特定参数。
    
    有关信息 PyTorch，请参阅PyTorch特定参数。
  - "pipeline_parallel_degree"（在 smdistributed-modelparallel<v1.6.0 中为 "partitions" ）– 必填。对于 smdistributed 的参数，必须使用此参数来指定要拆分为多少个模型分区。
    
    重要
    参数名称发生了重大变化。自 smdistributed-modelparallel v1.6.0 开始，"pipeline_parallel_degree" 参数取代了 "partitions"。有关更多信息，请参阅 Pyth SageMaker on SDK 文档中的 SageMaker 模型并行配置常用参数和SageMaker 分布式模型并行发行说明。
对于 "mpi" 键，传递一个包含以下内容的字典：
- "enabled" – 必需。设置为 True 以使用 MPI 启动分布式训练作业。
- "processes_per_host" – 必需。指定 MPI 应在每台主机上启动的进程数。在 SageMaker AI 中，主机是单个 Amazon EC2 ML 实例。 SageMaker Python SDK 在流程之间以及模型和数据并行度 GPUs 之间保持 one-to-one映射。这意味着 SageMaker AI 将每个进程安排在一个单独的 GPU 上，并且任何 GPU 都不包含多个进程。如果您正在使用 PyTorch，则必须通过将每个进程限制在自己的设备上torch.cuda.set_device(smp.local_rank())。要了解更多信息，请参阅使用自动拆分 PyTorch。
  
  重要
  process_per_host不得大于 GPUs 每个实例的数量，并且通常等于 GPUs 每个实例的数量。
- "custom_mpi_options"（可选）– 使用此键传递您可能需要的任何自定义 MPI 选项。如果您没有将任何 MPI 自定义选项传递给键，则默认情况下，MPI 选项将设置为以下标志。
```
--mca btl_vader_single_copy_mechanism none
```
  注意
  您无需为键明确指定此默认标志。如果明确指定了该标志，则分布式模型并行训练作业可能会失败并出现以下错误：
  The following MCA parameter has been listed multiple times on the command line: MCA param: btl_vader_single_copy_mechanism MCA parameters can only be listed once on a command line to ensure there is no ambiguity as to its value. Please correct the situation and try again.
  提示
  如果您使用启用 EFA 的实例类型启动训练作业，例如 ml.p4d.24xlarge 和 ml.p3dn.24xlarge，请使用以下标志以获得最佳性能：
  -x FI_EFA_USE_DEVICE_RDMA=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1

要使用估算器和模型 SageMaker 并行配置的训练脚本启动训练作业，请运行该estimator.fit()函数。

使用以下资源详细了解如何在 Pyth SageMaker on SDK 中使用模型并行度功能：

如果您是新用户，我们建议您使用 SageMaker 笔记本实例。要查看如何使用 SageMaker 笔记本实例启动训练作业的示例，请参阅亚马逊 SageMaker AI 模型并行度库 v2 示例。
您也可以使用 Amazon CLI，从您的计算机提交分布式训练作业。要 Amazon CLI 在您的计算机上进行设置，请参阅设置您的 Amazon 凭据和开发区域。

扩展包含分布式模型并行库的预构建 SageMaker的 Docker 容器

要扩展预先构建 SageMaker的容器并使用其模型并行度库，您必须将其中一个可用的 Amazon 深度学习容器 (DLC) 图像用于或。 PyTorch TensorFlow SageMaker 模型并行度库包含在带有 CUDA () 的 TensorFlow （2.3.0 及更高版本）和 PyTorch （1.6.0 及更高版本）DLC 镜像中。cuxyz有关 DLC 图像的完整列表，请参阅 Deep Learnin g Containers GitHub 存储库中可用的Amazon 深度学习容器镜像。

提示

我们建议您使用包含最新版本的映像 TensorFlow或 PyTorch 访问 SageMaker 模型并行度库的最新 up-to-date版本。

例如，您的 Dockerfile 应该包含类似于下文的 FROM 语句：


# Use the SageMaker DLC image URI for TensorFlow or PyTorch
FROM aws-dlc-account-id.dkr.ecr.aws-region.amazonaws.com/framework-training:{framework-version-tag}

# Add your dependencies here
RUN ...

ENV PATH="/opt/ml/code:${PATH}"

# this environment variable is used by the SageMaker AI container to determine our user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code

此外，在定义 PyTorch TensorFlow 或估计器时，必须entry_point为训练脚本指定。这应该与 ENV SAGEMAKER_SUBMIT_DIRECTORY 在 Dockerfile 中标识的路径相同。

提示

你必须将此 Docker 容器推送到亚马逊弹性容器注册表 (Amazon ECR) Registry，然后使用图像 URI (image_uri) 来定义训练的估算器。 SageMaker 有关更多信息，请参阅扩展预构建容器。

托管 Docker 容器并检索容器的镜像 URI 后，按如下方式创建一个 SageMaker PyTorch估算器对象。本例假设您已经定义 smp_options 和 mpi_options。


smd_mp_estimator = Estimator(
    entry_point="your_training_script.py",
    role=sagemaker.get_execution_role(),
    instance_type='ml.p3.16xlarge',
    sagemaker_session=sagemaker_session,
    image_uri='your_aws_account_id.dkr.ecr.region.amazonaws.com/name:tag'
    instance_count=1,
    distribution={
        "smdistributed": smp_options,
        "mpi": mpi_options
    },
    base_job_name="SMD-MP-demo",
)

smd_mp_estimator.fit('s3://my_bucket/my_training_data/')

使用 SageMaker 分布式模型并行库创建自己的 Docker 容器

要构建自己的 Docker 容器进行训练并使用 SageMaker 模型并行库，您必须在 Dockerfile 中包含正确的依赖项和 SageMaker分布式并行库的二进制文件。本节提供了在自己的 Docker 容器中正确准备 SageMaker 训练环境和模型 parallel 库时必须包含的最少代码块。

注意

这个带有 SageMaker 模型并行库作为二进制文件的自定义 Docker 选项仅适用于。 PyTorch

使用 SageMaker 训练工具包和模型并行库创建 Dockerfile

从 NVIDIA CUDA 基础映像之一开始使用。
```
FROM <cuda-cudnn-base-image>
```
提示
官方的 Amazon 深度学习容器 (DLC) 镜像是基于 NVIDIA CUDA 基础镜像构建的。我们建议您查看Amazon 深度学习容器的官方 Dockerfiles， PyTorch以了解需要安装哪些版本的库以及如何配置它们。官方 Dockerfile 已完成，经过基准测试，并由深度学习容器服务团队 SageMaker 和深度学习容器服务团队管理。在提供的链接中，选择您使用的 PyTorch版本，选择 CUDA (cuxyz) 文件夹，然后选择以或结尾的 Dockerfile。.gpu .sagemaker.gpu
要设置分布式训练环境，您需要为通信和网络设备安装软件，例如 Elastic Fabric Adapter (EFA)、NVIDIA Collective Communications Library (NCCL) 和 Open MPI。根据您选择的 PyTorch 和 CUDA 版本，必须安装兼容版本的库。

重要
由于 SageMaker 模型并行库在后续步骤中需要 SageMaker 数据并行库，因此我们强烈建议您按照中的说明正确设置分布式 SageMaker 训练的训练环境。使用 SageMaker AI 分布式数据并行库创建自己的 Docker 容器

有关使用 NCCL 和 Open MPI 设置 EFA 的更多信息，请参阅开始使用 EFA 和 MPI 以及开始使用 EFA 和 NCCL。

添加以下参数来指定 URLs SageMaker 分布式训练包的 PyTorch。 SageMaker 模型并行库要求 SageMaker 数据并行库使用跨节点远程直接内存访问 (RDMA)。


ARG SMD_MODEL_PARALLEL_URL=https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.10.0/build-artifacts/2022-02-21-19-26/smdistributed_modelparallel-1.7.0-cp38-cp38-linux_x86_64.whl
ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.10.2/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl

安装 SageMaker 模型 parallel 库所需的依赖项。

安装 METIS 库。


ARG METIS=metis-5.1.0

RUN rm /etc/apt/sources.list.d/* \
  && wget -nv http://glaros.dtc.umn.edu/gkhome/fetch/sw/metis/${METIS}.tar.gz \
  && gunzip -f ${METIS}.tar.gz \
  && tar -xvf ${METIS}.tar \
  && cd ${METIS} \
  && apt-get update \
  && make config shared=1 \
  && make install \
  && cd .. \
  && rm -rf ${METIS}.tar* \
  && rm -rf ${METIS} \
  && rm -rf /var/lib/apt/lists/* \
  && apt-get clean

安装 RAPIDS Memory Manager 库。这需要 CMake3.14 或更高版本。


ARG RMM_VERSION=0.15.0

RUN  wget -nv https://github.com/rapidsai/rmm/archive/v${RMM_VERSION}.tar.gz \
  && tar -xvf v${RMM_VERSION}.tar.gz \
  && cd rmm-${RMM_VERSION} \
  && INSTALL_PREFIX=/usr/local ./build.sh librmm \
  && cd .. \
  && rm -rf v${RMM_VERSION}.tar* \
  && rm -rf rmm-${RMM_VERSION}

安装 SageMaker 模型并行库。


RUN pip install --no-cache-dir -U ${SMD_MODEL_PARALLEL_URL}

安装 SageMaker 数据 parallel 库。


RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}

安装 sagemaker-training 工具包。该工具包包含创建与 SageMaker 训练平台和 SageMaker Python SDK 兼容的容器所必需的常用功能。
```
RUN pip install sagemaker-training
```
完成 Dockerfile 的创建后，请参阅调整自己的训练容器，了解如何构建 Docker 容器并将其托管在 Amazon ECR 中。

提示

有关创建用于 SageMaker 人工智能训练的自定义 Dockerfile 的更多一般信息，请参阅使用自己的训练算法。

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

PyTorch

对具有模型并行性的模型执行检查点操作和微调