Amazon Deep Learning Containers for PyTorch 2.5 Training on EC2, ECS and EKS - Amazon Deep Learning Containers
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Amazon Deep Learning Containers for PyTorch 2.5 Training on EC2, ECS and EKS

Amazon Deep Learning Containers (DLCs) for Amazon Elastic Compute Cloud (EC2), Amazon Elastic Container Service (ECS), and Amazon Elastic Kubernetes Service (EKS) are now available with PyTorch 2.5 and support for CUDA 12.4 on Ubuntu 22.04. You can launch the new versions of the Deep Learning Containers on any of the EC2, ECS and EKS services. For a complete list of frameworks and versions supported by the Amazon Deep Learning Containers, see below.

This release includes container images for training on GPU, optimized for performance and scale on Amazon. These Docker images have been tested with EC2, ECS and EKS services, and provide stable versions of NVIDIA CUDA, Intel MKL, and other components to provide an optimized user experience for running deep learning workloads on Amazon. All software components in these images are scanned for security vulnerabilities and updated or patched in accordance with Amazon Security best practices. These new DLC are designed to be used on any of the EC2, ECS and EKS services. If you are looking for a DLC to use with SageMaker, please refer to this documentation.

A list of available containers can be found in our documentation. Get started quickly with the Amazon Deep Learning Containers using the getting-started guides and beginner to advanced level tutorials in our developer guide. You can also subscribe to our discussion forum to get launch announcements and post your questions.

Release Notes

  • Introduced containers for PyTorch 2.5.1 for training which support EC2, ECS, and EKS. For details about this release, check out our GitHub release tag.

  • PyTorch 2.5 features a new CuDNN backend for SDPA, enabling speedups by default for users of SDPA on H100s or newer GPUs. Additionally, regional compilation of torch.compile offers a way to reduce the cold start up time for torch.compile by allowing users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Finally, TorchInductor CPP backend offers solid performance speedup with numerous enhancements like FP16 support, CPP wrapper, AOT-Inductor mode, and max-autotune mode.

  • Includes the fix for wheels from PyPI being unusable out-of-the-box on RPM-based Linux distributions, as addressed in PyTorch 2.5.1.

  • Please refer to the official PyTorch 2.5.0 release notes here and PyTorch 2.5.1 release notes here for the full description of updates.

  • NVIDIA/apex has been removed in favor of native torch operations. For more information on migrating from apex to torch built-in operations, see here.

  • Added Python 3.11 support

  • Added CUDA 12.4 support

  • Added Ubuntu 22.04 support

  • The GPU Docker Image includes the following libraries:

    • CUDA 12.4.1

    • cuDNN 9.1.0.70

    • NCCL 2.23.4

    • Amazon OFI NCCL plugin 1.12.1

    • EFA installer 1.36.0

    • Transformer Engine 1.11

    • Flash Attention 2.6.3

    • GDRCopy 2.4.2

  • The Dockerfile for CPU can be found here, and the Dockerfile for GPU can be found here.

For latest updates, please refer to the aws/deep-learning-containers GitHub repo.

Security Advisory

Amazon recommends that customers monitor critical security updates in the Amazon Security Bulletin.

Python 3.11 Support

Python 3.11 is supported in the PyTorch Training containers.

CPU Instance Type Support

The containers support x86_64 instance types.

GPU Instance Type support

The containers support GPU instance types and contain the following software components for GPU support:

  • CUDA 12.4.1

  • cuDNN 9.1.0.70+cuda12.4

  • NCCL 2.23.4+cuda12.4

Amazon Regions support

The containers are available in the following regions:

Region

Code

US East (Ohio)

us-east-2

US East (N. Virginia)

us-east-1

US West (Oregon)

us-west-2

US West (N. California)

us-west-1

AF South (Cape Town)

af-south-1

Asia Pacific (Hong Kong)

ap-east-1

Asia Pacific (Hyderabad)

ap-south-2

Asia Pacific (Mumbai)

ap-south-1

Asia Pacific (Osaka)

ap-northeast-3

Asia Pacific (Seoul)

ap-northeast-2

Asia Pacific (Tokyo)

ap-northeast-1

Asia Pacific (Melbourne)

ap-southeast-4

Asia Pacific (Jakarta)

ap-southeast-3

Asia Pacific (Sydney)

ap-southeast-2

Asia Pacific (Singapore)

ap-southeast-1

Asia Pacific (Malaysia)

ap-southeast-5

Central (Canada)

ca-central-1

Canada (Calgary)

ca-west-1

EU (Zurich)

eu-central-2

EU (Frankfurt)

eu-central-1

EU (Ireland)

eu-west-1

EU (London)

eu-west-2

EU( Paris)

eu-west-3

EU (Spain)

eu-south-2

EU (Milan)

eu-south-1

EU (Stockholm)

eu-north-1

Israel (Tel Aviv)

il-central-1

Middle East (Bahrain)

me-south-1

Middle East (UAE)

me-central-1

SA (Sau Paulo)

sa-east-1

China (Beijing)

cn-north-1

China (Ningxia)

cn-northwest-1

Build and Test

  • Built on: c5.18xlarge

  • Tested on: g3.16xlarge, p3.16xlarge, p3dn.24xlarge, p4d.24xlarge, p4de.24xlarge, g4dn.xlarge, p5.48xlarge

  • Tested with Resnet50, BERT along with ImageNet datasets on EC2, ECS AMI (Amazon Linux AMI 2.0.20240515), and EKS AMI (amazon-eks-gpu-node-1.25.16-20240514)

Known Issues

  • Customers using TransformerEngine may run into [W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator()) due to NVFuser deprecation since PyTorch 2.2. For more information, please check this issue.