Amazon Deep Learning OSS AMI GPU PyTorch 2.7 (Ubuntu 22.04)
For help getting started, see Getting started with DLAMI.
AMI name format
Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.7 (Ubuntu 22.04) ${YYYY-MM-DD}
Supported EC2 instances
Please refer to Important changes to DLAMI
G4dn, G5, G5, Gr6, P4, P4de, P5, P5e, P5en, P6-B200
The AMI includes the following:
Supported Amazon Service: Amazon EC2
Operating System: Ubuntu 22.04
Compute Architecture: x86
Linux Kernel: 6.8
NVIDIA Driver: 570.133.20
NVIDIA CUDA 12.8 stack:
CUDA, NCCL and cuDDN installation directories: /usr/local/cuda-12.8/
NCCL Tests Location:
all_reduce, all_gather, and reduce_scatter:
/usr/local/cuda-12.8/efa/test-cuda-12.8/
To run NCCL tests, LD_LIBRARY_PATH is already updated with needed paths.
Common PATHs are already added to LD_LIBRARY_PATH:
/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/amazon/ofi-nccl/lib:/usr/local/lib:/usr/lib
LD_LIBRARY_PAT is updated with CUDA version paths:
/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda:/usr/local/cuda/targets/x86_64-linux/lib
Compiled NCCL Version:
For CUDA directory of 12.8, compiled NCCL Version 2.26.2+CUDA12.8
Default CUDA: 12.8
PATH /usr/local/cuda points to CUDA 12.8
-
Updated below env vars:
LD_LIBRARY_PATH to have /usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib
PATH to have /usr/local/cuda/bin/:/usr/local/cuda/include/
EFA Installer: 1.40.0
Nvidia GDRCopy: 2.5
Nvidia Transformer Engine: 1.11.0
Amazon OFI NCCL: 1.14.2-aws
Installation path: /opt/amazon/ofi-nccl/. Path /opt/amazon/ofi-nccl/lib is added to LD_LIBRARY_PATH
Amazon CLI v2 at /usr/local/bin/aws
EBS volume type: gp3
Nvidia container toolkit: 1.17.7
Version command: nvidia-container-cli -V
Docker: 28.2.2
Python: /usr/bin/python3.12
Query AMI-ID with SSM Parameter (example region is us-east-1):
aws ssm get-parameter --region
us-east-1
\ --name /aws/service/deeplearning/ami/x86_64/oss-nvidia-driver-gpu-pytorch-2.7-ubuntu-22.04/latest/ami-id \ --query "Parameter.Value" \ --output textQuery AMI-ID with AWSCLI (example region is us-east-1):
aws ec2 describe-images --region us-east-1 --owners amazon --filters 'Name=name,Values=Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.7 (Ubuntu 22.04) ????????' 'Name=state,Values=available' --query 'reverse(sort_by(Images, &CreationDate))[:1].ImageId' --output text
Notices
Flash Attention
Flash attention does not yet have an official release for PyTorch 2.7
. For this reason, it is temporarily removed from this AMI. Once an official release is made for PyTorch 2.7, we will include it in this AMI. Without flash attention, transformer engine defaults to using cuDNN fused attention. There are currently known issues with fused attention and Blackwell GPUs, like P6-B200 instances.
"With compute capability sm10.0 (Blackwell-architecture) GPUs, the FP8 datatype with scaled dot-product attention contains a deadlock that causes the kernel to hang under some circumstances, such as when the problem size is large or the GPU is running multiple kernels simultaneously. A fix is planned for a future release." [cuDNN 9.10.0 release notes
] For users seeking to run P6-B200 instances with FP8 data and scaled dot-product attention, please consider installing flash attention manually.
P6-B200 Instances
P6-B200 instances require CUDA version 12.8 or higher and NVIDIA driver 570 or newer drivers.
P6-B200 contain 8 network interface cards and can be launched using the following Amazon CLI command:
aws ec2 run-instances --region $REGION \ --instance-type $INSTANCETYPE \ --image-id $AMI --key-name $KEYNAME \ --iam-instance-profile "Name=dlami-builder" \ --tag-specifications "ResourceType=instanace,Tags=[{Key=Name,Value=$TAG}]" \ --network-interfaces ""NetworkCardIndex=0,DeviceIndex=0,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \ "NetworkCardIndex=1,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \ "NetworkCardIndex=2,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \ "NetworkCardIndex=3,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \ "NetworkCardIndex=4,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \ .... .... .... "NetworkCardIndex=7,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa"
P5/P5e Instances
DeviceIndex is unique to each NetworkCard and must be a non-negative integer less than the limit of ENIs per NetworkCard. On P5, the number of ENIs per NetworkCard is 2, meaning that the only valid values for DeviceIndex are 0 or 1. Below is an example of EC2 P5 instance launch command using awscli showing NetworkCardIndex for numbers 0-31 and DeviceIndex as 0 for the first interface and 1 for the remaining 31 interfaces.
aws ec2 run-instances --region $REGION \ --instance-type $INSTANCETYPE \ --image-id $AMI --key-name $KEYNAME \ --iam-instance-profile "Name=dlami-builder" \ --tag-specifications "ResourceType=instanace,Tags=[{Key=Name,Value=$TAG}]" \ --network-interfaces ""NetworkCardIndex=0,DeviceIndex=0,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \ "NetworkCardIndex=1,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \ "NetworkCardIndex=2,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \ "NetworkCardIndex=3,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \ "NetworkCardIndex=4,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa" \ .... .... .... "NetworkCardIndex=31,DeviceIndex=1,Groups=$SG,SubnetId=$SUBNET,InterfaceType=efa"
Kernel
Kernel version is pinned using command:
echo linux-aws hold | sudo dkpg -set-selections echo linux-headers-aws hold | sudo dpkg -set-selections echo linux-image-aws hold | sudo dpkg -set-selections
We recommend users to avoid updating their kernel version (unless due to a security patch) to ensure compatibility with installed drivers and package versions. If users still wish to update they can run the following commands to unpin their kernel versions:
echo linux-aws install | sudo dpkg -set-selections echo linux-headers-aws install | sudo dpkg -set-selections echo linux-image-aws install | sudo dpkg -set-selections apt-get upgrade -y
For each new version of DLAMI, latest available compatible kernel is used.
PyTorch Deprecation of Anaconda Channel
Starting with PyTorch 2.6, PyTorch has deprecated support for Conda (see official announcement
Release Date: 2025-06-03
AMI name: Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.7 (Ubuntu 22.04) 20250602
Added
Initial release of Deep Learning AMI GPU PyTorch 2.7 (Ubuntu 22.04) series. Including a Python virtual environment pytorch (source /opt/pytorch/bin/activate) complimented with NVIDIA Driver R570, CUDA=12.8, cuDNN=9.10, PyTorch NCCL=2.26.5, and EFA=1.40.0.
Known Issues
"With compute capability sm10.0 (Blackwell-architecture) GPUs, the FP8 datatype with scaled dot-product attention contains a deadlock that causes the kernel to hang under some circumstances, such as when the problem size is large or the GPU is running multiple kernels simultaneously. A fix is planned for a future release." [cuDNN 9.10.0 release notes
] For users seeking to run P6-B200 instances with FP8 data and scaled dot-product attention, please consider installing flash attention manually.