How to run a distributed training job with the SageMaker distributed data parallelism library - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

How to run a distributed training job with the SageMaker distributed data parallelism library

The SageMaker distributed data parallelism (SMDDP) library is designed for ease of use and to provide seamless integration with PyTorch.

When training a deep learning model with the SMDDP library on SageMaker, you can focus on writing your training script and model training.

To get started, import the SMDDP library to use its collective operations optimized for Amazon. The following topics provide instructions on what to add to your training script depending on which collective operation you want to optimize.