Default EFA network configuration Customizing EFA network interfaces

Elastic Fabric Adapter

Elastic Fabric Adapter (EFA) is a network device that has OS-bypass capabilities for low-latency network communications with other instances on the same subnet. EFA is exposed by using Libfabric, and can be used by applications using the Messaging Passing Interface (MPI).

To use EFA with Amazon ParallelCluster and a Slurm scheduler, set SlurmQueues / ComputeResources / Efa / Enabled to true.

To view the list of Amazon EC2 instances that support EFA, see Supported instance types in the Amazon EC2 User Guide for Linux Instances.

We recommend that you run your EFA-enabled instances in a placement group. This way the instances are launched into a low-latency group in a single Availability Zone. For more information on how to configure placement groups with Amazon ParallelCluster, see SlurmQueues / Networking / PlacementGroup.

Note

Elastic Fabric Adapter (EFA) isn't supported over different availability zones. For more information, see Scheduling / SlurmQueues / Networking / SubnetIds.

Note

By default, Ubuntu distributions enable ptrace (process trace) protection. ptrace protection is disabled so that Libfabric works properly. For more information, see Disable ptrace protection in the Amazon EC2 User Guide.

Default EFA network configuration

Starting in Amazon ParallelCluster 3.15.0, when EFA is enabled, Amazon ParallelCluster automatically configures EFA-only network interfaces to separate EFA traffic from IP traffic. This maximizes EFA bandwidth while minimizing IP address consumption. Amazon ParallelCluster determines the optimal configuration based on the capabilities of the instance type. Therefore, EFA-enabled compute nodes are launched with more than one network interface, even when they use a single-network-card instance type, provided that instance type supports more than one network interface.

This default configuration is recommended for most workloads, including tightly-coupled HPC and distributed AI/ML training.

Note

Amazon EC2 does not auto-assign a public IP address to an instance launched with more than one network interface. EFA-enabled compute nodes launch with multiple network interfaces. These compute nodes fail to bootstrap if they rely on an auto-assigned public IP for internet access (a public subnet with no NAT gateway). Place these compute nodes in a private subnet with a NAT gateway and set AssignPublicIp to false. This requirement previously applied only to instance types with multiple network cards.

Customizing EFA network interfaces

If your workload requires a different network configuration, such as maximizing ENA bandwidth on secondary network cards or configuring a subset of available network cards, you can override the default settings using the SlurmQueues / ComputeResources / LaunchTemplateOverrides parameter. This replaces the entire network interface configuration of the compute nodes with the configuration defined in your launch template.

For a step-by-step walkthrough, see Customize compute node network interfaces with launch template overrides.

Warning

If you configure network interfaces in a way that is not supported by the instance type, instances will fail to launch. To verify the supported network configurations for your instance type, see DescribeInstanceTypes in the Amazon EC2 API Reference.

For more information, see Elastic Fabric Adapter in the Amazon EC2 User Guide and Scale HPC workloads with elastic fabric adapter and Amazon ParallelCluster in the Amazon Open Source Blog.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Troubleshooting stacks that include the Amazon ParallelCluster custom resource

Enable Intel MPI