在 SageMaker Python SDK 中使用 PyTorch 框架估算器

您可以在 SageMaker AI 框架估算器 PyTorch 或 TensorFlow 中添加 distribution 参数，启动分布式训练。欲了解更多详情，请从以下选项中选择一个 SageMaker AI 分布式数据并行（SMDDP）库支持的框架。

PyTorch

以下启动器选项可用于启动 PyTorch 分布式训练。

pytorchddp：该选项运行 mpirun，并设置在 SageMaker AI 上运行 PyTorch 分布式训练所需的环境变量。要使用此选项，请在 distribution 参数中输入以下字典。
```
{ "pytorchddp": { "enabled": True } }
```
torch_distributed：该选项运行 torchrun，并设置在 SageMaker AI 上运行 PyTorch 分布式训练所需的环境变量。要使用此选项，请在 distribution 参数中输入以下字典。
```
{ "torch_distributed": { "enabled": True } }
```
smdistributed：该选项也运行 mpirun，但使用 smddprun 设置在 SageMaker AI 上运行 PyTorch 分布式训练所需的环境变量。
```
{ "smdistributed": { "dataparallel": { "enabled": True } } }
```

如果您选择将 NCCL AllGather 替换为 SMDDP AllGather，则可以使用所有三个选项。选择一个适合您使用场景的选项。

如果您选择用 SMDDP AllReduce 替换 NCCL AllReduce，则应选择基于 mpirun 的选项之一：smdistributed 或 pytorchddp。您还可以添加以下 MPI 选项。


{ 
    "pytorchddp": {
        "enabled": True, 
        "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
    }
}


{ 
    "smdistributed": { 
        "dataparallel": {
            "enabled": True, 
            "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
        }
    }
}

下面的代码示例介绍了带有分布式训练选项的 PyTorch 估算器的基本结构。


from sagemaker.pytorch import PyTorch

pt_estimator = PyTorch(
    base_job_name="training_job_name_prefix",
    source_dir="subdirectory-to-your-code",
    entry_point="adapted-training-script.py",
    role="SageMakerRole",
    py_version="py310",
    framework_version="2.0.1",

    # For running a multi-node distributed training job, specify a value greater than 1
    # Example: 2,3,4,..8
    instance_count=2,

    # Instance types supported by the SageMaker AI data parallel library: 
    # ml.p4d.24xlarge, ml.p4de.24xlarge
    instance_type="ml.p4d.24xlarge",

    # Activate distributed training with SMDDP
    distribution={ "pytorchddp": { "enabled": True } }  # mpirun, activates SMDDP AllReduce OR AllGather
    # distribution={ "torch_distributed": { "enabled": True } }  # torchrun, activates SMDDP AllGather
    # distribution={ "smdistributed": { "dataparallel": { "enabled": True } } }  # mpirun, activates SMDDP AllReduce OR AllGather
)

pt_estimator.fit("s3://bucket/path/to/training/data")

注意

SageMaker AI PyTorch DLC 中未预安装 PyTorch Lightning 及其实用程序库（例如 Lightning Bolts）。创建以下 requirements.txt 文件，并将该文件保存到用于保存训练脚本的源目录中。


# requirements.txt
pytorch-lightning
lightning-bolts

例如，树结构目录应如下所示。


├── pytorch_training_launcher_jupyter_notebook.ipynb
└── sub-folder-for-your-code
    ├──  adapted-training-script.py
    └──  requirements.txt

有关指定放置 requirements.txt 文件以及训练脚本和作业提交的源目录的更多信息，请参阅 Amazon SageMaker AI Python SDK 文档中的使用第三方库。

启动 SMDDP 集体操作和使用正确的分布式训练启动器选项的考虑因素

SMDDP AllReduce 和 SMDDP AllGather 目前并不相互兼容。
在使用 smdistributed 或 pytorchddp（基于 mpirun 的启动器）和 NCCL AllGather 时，SMDDP AllReduce 默认为激活状态。
使用 torch_distributed 启动器时，SMDDP AllGather 默认处于激活状态，而 AllReduce 则返回到 NCCL。
在使用基于 mpirun 的启动器时，还可以通过如下设置的附加环境变量激活 SMDDP AllGather。
```
export SMDATAPARALLEL_OPTIMIZE_SDP=true
```

TensorFlow

重要

SMDDP 库已停止对 TensorFlow 的支持，在 TensorFlow 版本 2.11.0 之后的 DLC 中不再可用。要查找已安装 SMDDP 库的 TensorFlow DLC，请参阅 TensorFlow（已过时）。


from sagemaker.tensorflow import TensorFlow

tf_estimator = TensorFlow(
    base_job_name = "training_job_name_prefix",
    entry_point="adapted-training-script.py",
    role="SageMakerRole",
    framework_version="2.11.0",
    py_version="py38",

    # For running a multi-node distributed training job, specify a value greater than 1
    # Example: 2,3,4,..8
    instance_count=2,

    # Instance types supported by the SageMaker AI data parallel library: 
    # ml.p4d.24xlarge, ml.p3dn.24xlarge, and ml.p3.16xlarge
    instance_type="ml.p3.16xlarge",

    # Training using the SageMaker AI data parallel distributed training strategy
    distribution={ "smdistributed": { "dataparallel": { "enabled": True } } }
)

tf_estimator.fit("s3://bucket/path/to/training/data")

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

利用 SMDDP 启动分布式训练作业

使用 SageMaker AI 通用估算器扩展预构建的 DLC 容器