Core Features of the SageMaker Model Parallelism Library
Amazon SageMaker's model parallelism library offers distribution strategies and memory-saving techniques, such as sharded data parallelism, tensor parallelism, model partitioning by layers for pipeline scheduling, and checkpointing. The model parallelism strategies and techniques help distribute large models across multiple devices while optimizing training speed and memory consumption. The library also provides Python helper functions, context managers, and wrapper functions to adapt your training script for automated or manual partitioning of your model.
When you implement model parallelism to your training job, you keep the same two-step
workflow shown in the Run a
SageMaker Distributed Training Job with Model Parallelism
To get started with examples, see the following Jupyter notebooks that demonstrate how to use the SageMaker model parallelism library.
To dive deep into the core features of the library, see the following topics.
Note
The SageMaker distributed training libraries are available through the Amazon deep learning containers for PyTorch, Hugging Face, and TensorFlow within the SageMaker Training platform. To utilize the features of the distributed training libraries, we recommend that you use the SageMaker Python SDK. You can also manually configure in JSON request syntax if you use SageMaker APIs through SDK for Python (Boto3) or Amazon Command Line Interface. Throughout the documentation, instructions and examples focus on how to use the distributed training libraries with the SageMaker Python SDK.
Important
The SageMaker model parallelism library supports all the core features for PyTorch, and supports pipeline parallelism for TensorFlow.