Release notes for the SageMaker model parallelism library
See the following release notes to track the latest updates for the SageMaker model parallelism
(SMP) library. If you have further questions about the SMP library, contact the SMP service
team at sm-model-parallel-feedback@amazon.com
.
The SageMaker model parallelism library v2.3.1
Date: May 9, 2024
Bug fixes
-
Fixed an
ImportError
issue when usingmoe_load_balancing=balanced
in torch.sagemaker.moe.moe_config.MoEConfig for expert parallelism. -
Fixed a fine-tuning issue where the torch.sagemaker.transform call raised
KeyError
whenload_state_dict_from_rank0
is enabled. -
Fixed an out-of-memory (OOM) error raised when loading large Mixture of Experts (MoE) models, such as Mixtral 8x22B, for fine-tuning.
SMP Docker container
The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. This release incorporates the aforementioned bug fixes into the following SMP Docker image.
-
SMP Docker container for PyTorch v2.2.0 with CUDA v12.1
658645717510.dkr.ecr.
us-west-2
.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121
The SageMaker model parallelism library v2.3.0
Date: April 11, 2024
New features
-
Added a new core feature, expert parallelism, to support Mixture of Experts transformer models. To learn more, see Expert parallelism.
SMP Docker container
The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker automatically picks up the SMP Docker containers. To use this release of SMP v2, upgrade your SageMaker Python SDK to v2.214.4 or later.
-
SMP Docker container for PyTorch v2.2.0 with CUDA v12.1
658645717510.dkr.ecr.
us-west-2
.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121-
Pre-installed packages in this Docker container
-
The SMDDP library v2.2.0
-
CUDNN v8.9.5.29
-
FlashAttention v2.3.3
-
TransformerEngine v1.2.1
-
Hugging Face Transformers v4.37.1
-
Hugging Face Datasets library v2.16.1
-
Megatron-core 0.5.0
-
EFA v1.30.0
-
NCCL v2.19.4
-
-
The SageMaker model parallelism library v2.2.0
Date: March 7, 2024
New Features
-
Added support for FP8 training of the following Hugging Face transformer models on P5 instances with Transformer Engine integration:
-
GPT-NeoX
-
Llama 2
-
Bug Fixes
-
Fixed a bug where tensors were not guaranteed to be contiguous before the
AllGather
collective call during tensor parallelism training.
Currency Updates
-
Added support for PyTorch v2.2.0.
-
Upgraded the SMDDP library to v2.2.0.
-
Upgraded the FlashAttention library to v2.3.3.
-
Upgraded the NCCL library to v2.19.4.
Deprecation
-
Discontinued support for Transformer Engine versions before v1.2.0.
Known issues
-
The SMP Activation offloading feature currently does not work. Use the native PyTorch activation offloading instead.
Other changes
-
Included a patch to fix the performance regression discussed in the issue thread at https://github.com/pytorch/pytorch/issues/117748
in the PyTorch GitHub repository.
SMP Docker container
The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker automatically picks up the SMP Docker containers. To use this release of SMP v2, upgrade your SageMaker Python SDK to v2.212.0 or later.
-
SMP Docker container for PyTorch v2.2.0 with CUDA v12.1
658645717510.dkr.ecr.
us-west-2
.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121-
Available for P4d, P4de, and P5 instances
-
Pre-installed packages in this Docker container
-
The SMDDP library v2.2.0
-
CUDNN v8.9.5.29
-
FlashAttention v2.3.3
-
TransformerEngine v1.2.1
-
Hugging Face Transformers v4.37.1
-
Hugging Face Datasets library v2.16.1
-
EFA v1.30.0
-
NCCL v2.19.4
-
-
The SageMaker model parallelism library v2.1.0
Date: February 6, 2024
Currency Updates
-
Added support for PyTorch v2.1.2.
Deprecation
-
Discontinued support for Hugging Face Transformers v4.31.0.
Known issues
-
An issue is discovered that fine-tuning of the Hugging Face Llama 2 model with
attn_implementation=flash_attention_2
and FSDP causes the model to diverge. For reference, see the issue ticketin the Hugging Face Transformers GitHub repository. To avoid the divergence issue, use attn_implementation=sdpa
. Alternatively, use the SMP transformer model implementation by setting upuse_smp_implementation=True
.
SMP Docker container
The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker automatically picks up the SMP Docker containers. To use this release of SMP v2, upgrade your SageMaker Python SDK to v2.207.0 or later.
-
SMP Docker container for PyTorch v2.1.2 with CUDA v12.1
658645717510.dkr.ecr.
us-west-2
.amazonaws.com/smdistributed-modelparallel:2.1.2-gpu-py310-cu121-
Available for P4d, P4de, and P5 instances
-
Pre-installed packages in this Docker container
-
The SMDDP library v2.1.0
-
CUDNN v8.9.5.29
-
FlashAttention v2.3.3
-
TransformerEngine v1.2.1
-
Hugging Face Transformers v4.37.1
-
Hugging Face Datasets library v2.16.1
-
EFA v1.30.0
-
-
SMP Conda channel
The following S3 bucket is a public Conda channel hosted by the SMP service team. If you want to install the SMP v2 library in an environment of highly customizable compute resources such as SageMaker HyperPod clusters, use this Conda channel to properly install the SMP library.
-
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-v2/
For more information about Conda channels in general, see Channels
The SageMaker model parallelism library v2.0.0
Date: December 19, 2023
New features
Released the SageMaker model parallelism (SMP) library v2.0.0 with the following new offerings.
-
A new
torch.sagemaker
package, entirely revamped from the previoussmdistributed.modelparallel.torch
package in SMP v1.x. -
Support for PyTorch 2.0.1.
-
Support for PyTorch FSDP.
-
Tensor parallelism implementation by integrating with the Transformer Engine
library. -
Support for both SageMaker Training and SageMaker HyperPod.
Breaking changes
-
SMP v2 revamped the APIs entirely and provides the
torch.sagemaker
package. Mostly, you only need to initialize with thetorch.sagemaker.init()
module and pass model parallel configuration parameters. With this new package, you can significantly simplify code modifications in your training script. To learn more about adapting your training script to use SMP v2, see Get started with the SageMaker model parallelism library v2. -
If you've used SMP v1 for training Hugging Face Transformer models and want to reuse the models in SMP v2, see Upgrade from SMP v1 to SMP v2.
-
For PyTorch FSDP training, you should use SMP v2.
Known issues
-
Activation checkpointing currently only works with the following wrapping policies with FSDP.
-
auto_wrap_policy = functools.partial(transformer_auto_wrap_policy, ...)
-
-
To use Activation offloading, FSDP activation checkpointing type must be REENTRANT
. -
When running with tensor parallel enabled with the sharded data parallel degree set to
1
, you must usebackend = nccl
. Thesmddp
backend option is not supported in this scenario. -
Transformer Engine
is required to use PyTorch with the SMP library even when not using tensor parallelism.
Other changes
-
Starting from this release, the documentation for the SageMaker model parallelism library is fully available in this Amazon SageMaker Developer Guide. In favor of this complete developer guide for SMP v2 in the Amazon SageMaker Developer Guide, the additional reference for SMP v1.x
in the SageMaker Python SDK documentation is deprecated. If you still need the documentation for SMP v1.x, the developer guide for SMP v1.x is available at (Archived) SageMaker model parallelism library v1.x, and the SMP Python library v1.x reference is available in the SageMaker Python SDK v2.199.0 documentation .
Deprecations
-
Discontinued support for TensorFlow.
-
There is no pipeline parallelism support in SMP v2.
-
There is no support for the DeepSpeed library in favor of native PyTorch FSDP.
SMP Docker container
The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker automatically picks up the SMP Docker containers. To use this release of SMP v2, upgrade your SageMaker Python SDK to v2.207.0 or later.
-
SMP Docker container for PyTorch v2.0.1 with CUDA v12.1
658645717510.dkr.ecr.
us-west-2
.amazonaws.com/smdistributed-modelparallel:2.0.1-gpu-py310-cu121