SageMaker data parallelism library release notes
See the following release notes to track the latest updates for the SageMaker distributed data parallelism (SMDDP) library.
The SageMaker distributed data parallelism library v2.2.0
Date: March 4, 2024
New features
-
Added support for PyTorch v2.2.0 with CUDA v12.1.
Integration into Docker containers distributed by the SageMaker model parallelism (SMP) library
This version of the SMDDP library is migrated to The SageMaker model parallelism library v2.2.0.
658645717510.dkr.ecr.
<region>
.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121
For Regions where the SMP Docker images are available, see Amazon Web Services Regions.
Binary file of this release
You can download or install the library using the following URL.
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.2.0/cu121/2024-03-04/smdistributed_dataparallel-2.2.0-cp310-cp310-linux_x86_64.whl
The SageMaker distributed data parallelism library v2.1.0
Date: March 1, 2024
New features
-
Added support for PyTorch v2.1.0 with CUDA v12.1.
Bug fixes
-
Fixed the CPU memory leak issue in SMDDP v2.0.1.
Integration into SageMaker Framework Containers
This version of the SMDDP library passed benchmark testing and is migrated to the
following SageMaker Framework Container
-
PyTorch v2.1.0
763104351884.dkr.ecr.
<region>
.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker
Integration into Docker containers distributed by the SageMaker model parallelism (SMP) library
This version of the SMDDP library is migrated to The SageMaker model parallelism library v2.1.0.
658645717510.dkr.ecr.
<region>
.amazonaws.com/smdistributed-modelparallel:2.1.2-gpu-py310-cu121
For Regions where the SMP Docker images are available, see Amazon Web Services Regions.
Binary file of this release
You can download or install the library using the following URL.
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.1.0/cu121/2024-02-04/smdistributed_dataparallel-2.1.0-cp310-cp310-linux_x86_64.whl
The SageMaker distributed data parallelism library v2.0.1
Date: December 7, 2023
New features
-
Added a new SMDDP-implementation of
AllGather
collective operation optimized for Amazon compute resources and network infrastructure. To learn more, see SMDDP AllGather collective operation. -
The SMDDP
AllGather
collective operation is compatible with PyTorch FSDP and DeepSpeed. To learn more, see Use the SMDDP library in your PyTorch training script. -
Added support for PyTorch v2.0.1
Known issues
-
There's a CPU memory leak issue from a gradual CPU memory increase while training with SMDDP
AllReduce
in DDP mode.
Integration into SageMaker Framework Containers
This version of the SMDDP library passed benchmark testing and is migrated to the
following SageMaker Framework Container
-
PyTorch v2.0.1
763104351884.dkr.ecr.
<region>
.amazonaws.com/pytorch-training:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker
Binary file of this release
You can download or install the library using the following URL.
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.1/cu118/2023-12-07/smdistributed_dataparallel-2.0.2-cp310-cp310-linux_x86_64.whl
Other changes
-
Starting from this release, documentation for the SMDDP library is fully available in this Amazon SageMaker Developer Guide. In favor of the complete developer guide for SMDDP v2 housed in the Amazon SageMaker Developer Guide, documentation for the additional reference for SMDDP v1.x
in the SageMaker Python SDK documentation is no longer supported. If you still need SMP v1.x documentation, see the following snapshot of the documentation at SageMaker Python SDK v2.212.0 documentation .