Elastic Fabric Adapter - Amazon ParallelCluster
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Elastic Fabric Adapter

Elastic Fabric Adapter (EFA) is a network device that has OS-bypass capabilities for low-latency network communications with other instances on the same subnet. EFA is exposed by using Libfabric, and can be used by applications using the Messaging Passing Interface (MPI).

To use EFA with Amazon ParallelCluster and a Slurm scheduler, set SlurmQueues / ComputeResources / Efa / Enabled to true.

To view the list of Amazon EC2 instances that support EFA, see Supported instance types in the Amazon EC2 User Guide for Linux Instances.

We recommend that you run your EFA-enabled instances in a placement group. This way the instances are launched into a low-latency group in a single Availability Zone. For more information on how to configure placement groups with Amazon ParallelCluster, see SlurmQueues / Networking / PlacementGroup.

Note

Elastic Fabric Adapter (EFA) isn't supported over different availability zones. For more information, see Scheduling / SlurmQueues / Networking / SubnetIds.

Note

By default, Ubuntu distributions enable ptrace (process trace) protection. ptrace protection is disabled so that Libfabric works properly. For more information, see Disable ptrace protection in the Amazon EC2 User Guide.

Default EFA network configuration

Starting in Amazon ParallelCluster 3.15.0, when EFA is enabled, Amazon ParallelCluster automatically configures EFA-only network interfaces to separate EFA traffic from IP traffic. This maximizes EFA bandwidth while minimizing IP address consumption. Amazon ParallelCluster determines the optimal configuration based on the capabilities of the instance type.

This default configuration is recommended for most workloads, including tightly-coupled HPC and distributed AI/ML training.

Customizing EFA network interfaces

If your workload requires a different network configuration, such as maximizing ENA bandwidth on secondary network cards or configuring a subset of available network cards, you can override the default settings using the SlurmQueues / ComputeResources / LaunchTemplateOverrides parameter. This replaces the entire network interface configuration of the compute nodes with the configuration defined in your launch template.

For a step-by-step walkthrough, see Customize compute node network interfaces with launch template overrides.

Warning

If you configure network interfaces in a way that is not supported by the instance type, instances will fail to launch. To verify the supported network configurations for your instance type, see DescribeInstanceTypes in the Amazon EC2 API Reference.

For more information, see Elastic Fabric Adapter in the Amazon EC2 User Guide and Scale HPC workloads with elastic fabric adapter and Amazon ParallelCluster in the Amazon Open Source Blog.