Machine learning training using Elastic Fabric Adapter
This capability is not available in China Amazon Web Services Regions.
This topic describes how to integrate Elastic Fabric Adapter (EFA) with pods deployed in
your Amazon EKS cluster. Elastic Fabric Adapter (EFA) is a network interface for Amazon EC2 instances
that enables you to run applications requiring high levels of inter-node communications at
scale on Amazon. Its custom-built operating system bypass hardware interface enhances the
performance of inter-instance communications, which is critical to scaling these
applications. With EFA, High Performance Computing (HPC) applications using the Message
Passing Interface (MPI) and Machine Learning (ML) applications using NVIDIA Collective
Communications Library (NCCL) can scale to thousands of CPUs or GPUs. As a result, you get
the application performance of on-premises HPC clusters with the on-demand elasticity and
flexibility of the Amazon cloud. Integrating EFA with applications running on Amazon EKS clusters
can reduce the time to complete large scale distributed training workloads without having to
add additional instances to your cluster. For more information about EFA, Elastic Fabric Adapter
The EFA plugin described in this topic fully supports Amazon EC2 P4d
instances, which represent the
current state of the art in distributed machine learning in the cloud. Each
p4d.24xlarge
instance has eight NVIDIA A100 GPUs, and 400 Gbps
GPUDirectRDMA over EFA. GPUDirectRDMA enables you to have direct GPU-to-GPU communication
across nodes with CPU bypass, increasing collective communication bandwidth and lowering
latency. Amazon EKS and EFA integration with P4d
instances provides a seamless
method to take advantage of the highest performing Amazon EC2 computing instance for distributed
machine learning training.
Prerequisites
-
An existing Amazon EKS cluster. If you don't have an existing cluster, use one of our Getting started with Amazon EKS guides to create one. Your cluster must be deployed in a VPC that has at least one private subnet with enough available IP addresses to deploy nodes in. The private subnet must have outbound internet access provided by an external device, such as a NAT gateway.
If you plan to use
eksctl
to create your node group,eksctl
can also create a cluster for you. -
Version
2.11.3
or later or1.27.93
or later of the Amazon CLI installed and configured on your device or Amazon CloudShell. You can check your current version withaws --version | cut -d / -f2 | cut -d ' ' -f1
. Package managers suchyum
,apt-get
, or Homebrew for macOS are often several versions behind the latest version of the Amazon CLI. To install the latest version, see Installing, updating, and uninstalling the Amazon CLI and Quick configuration withaws configure
in the Amazon Command Line Interface User Guide. The Amazon CLI version installed in the Amazon CloudShell may also be several versions behind the latest version. To update it, see Installing Amazon CLI to your home directory in the Amazon CloudShell User Guide. -
The
kubectl
command line tool is installed on your device or Amazon CloudShell. The version can be the same as or up to one minor version earlier or later than the Kubernetes version of your cluster. For example, if your cluster version is1.24
, you can usekubectl
version1.23
,1.24
, or1.25
with it. To install or upgradekubectl
, see Installing or updating kubectl. -
You must have the Amazon VPC CNI plugin for Kubernetes version
1.7.10
or later installed before launching worker nodes that support multiple Elastic Fabric Adapters, such as thep4d.24xlarge
. For more information about updating your Amazon VPC CNI plugin for Kubernetes version, see Working with the Amazon VPC CNI plugin for Kubernetes Amazon EKS add-on.
Create node group
The following procedure helps you create a node group with a p4d.24xlarge
backed node group with EFA interfaces and GPUDirect RDMA, and run an example NVIDIA
Collective Communications Library (NCCL) test for multi-node NCCL Performance using
EFAs. The example can be used a template for distributed deep learning training on Amazon EKS
using EFAs.
-
Determine which Amazon EC2 instance types that support EFA are available in the Amazon Web Services Region that you want to deploy nodes in. Replace
region-code
with the Amazon Web Services Region that you want to deploy your node group in.aws ec2 describe-instance-types --region
region-code
--filters Name=network-info.efa-supported,Values=true \ --query "InstanceTypes[*].[InstanceType]" --output textWhen you deploy nodes, the instance type that you want to deploy must be available in the Amazon Web Services Region that your cluster is in.
-
Determine which Availability Zones that the instance type that you want to deploy is available in. In this tutorial, the
p4d.24xlarge
instance type is used and must be returned in the output for the Amazon Web Services Region that you specified in the previous step. When you deploy nodes in a production cluster, replacep4d.24xlarge
with any instance type returned in the previous step.aws ec2 describe-instance-type-offerings --region
region-code
--location-type availability-zone --filters Name=instance-type,Values=p4d.24xlarge
\ --query 'InstanceTypeOfferings[*].Location' --output textThe example output is as follows.
cn-north-1
a
cn-north-1
c
cn-north-1
b
Note the Availability Zones returned for use in later steps. When you deploy nodes to a cluster, your VPC must have subnets with available IP addresses in one of the Availability Zones returned in the output.
-
Create a node group using either
eksctl
or the Amazon CLI and Amazon CloudFormation. -
Deploy the EFA Kubernetes device plugin.
The EFA Kubernetes device plugin detects and advertises EFA interfaces as allocatable resources to Kubernetes. An application can consume the extended resource type
vpc.amazonaws.com/efa
in a pod request spec just like CPU and memory. For more information, see Consuming extended resourcesin the Kubernetes documentation. Once requested, the plugin automatically assigns and mounts an EFA interface to the pod. Using the device plugin simplifies EFA setup and does not require a pod to run in privileged mode. kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-efa-eks/main/manifest/efa-k8s-device-plugin.yml
(Optional) Deploy a sample EFA compatible application
Deploy the Kubeflow MPI Operator
For the NCCL tests you can apply the Kubeflow MPI Operator. The MPI Operator makes
it easy to run Allreduce-style distributed training on Kubernetes. For more
information, see MPI
Operator
kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v1alpha2/mpi-operator.yaml
Run the multi-node NCCL Performance Test to verify GPUDirectRDMA/EFA
To verify NCCL Performance with GPUDirectRDMA over EFA, run the standard NCCL
Performance test. For more information, see the official NCCL-Tests11.2
and the latest version of EFA.
Alternately, you can download an Amazon Docker image available from an Amazon ECR repo
An important consideration required for adopting EFA with Kubernetes is
configuring and managing Huge Pages as a resource in the cluster. For more
information, see Manage Huge Pages
Complete the following steps to run a two node NCCL Performance Test. In the example
NCCL test job, each worker requests eight GPUs, 5210Mi of hugepages-2Mi, four EFAs, and
8000Mi of memory, which effectively means each worker consumes all the resources of a
p4d.24xlarge
instance.
-
Create the NCCL-tests job.
kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-efa-eks/main/examples/simple/nccl-efa-tests.yaml
The example output is as follows.
mpijob.kubeflow.org/nccl-tests-efa created
-
View your running pods.
kubectl get pods
The example output is as follows.
NAME READY STATUS RESTARTS AGE nccl-tests-efa-launcher-
nbql9
0/1 Init:0/1 0 2m49s nccl-tests-efa-worker-0 1/1 Running 0 2m49s nccl-tests-efa-worker-1 1/1 Running 0 2m49sThe MPI Operator creates a launcher pod and 2 worker pods (one on each node).
-
View the log for the
efa-launcher
pod. Replace
with the value from your output.wzr8j
kubectl logs -f nccl-tests-efa-launcher-
nbql9
For more examples, see the Amazon EKS EFA samples