Creating a cluster with an EFA-enabled FSx Lustre
In this tutorial, you will create a cluster that uses an EFA-enabled FSx Lustre file system as shared storage. Using an FSx Lustre file system with EFA enabled can provide a boost in performance up to 8x. To verify if an EFA-enabled file system is what you need, look at Working with EFA-enabled file systems in the FSx for Lustre User Guide.
When you use Amazon ParallelCluster, you only pay for the Amazon resources that are created when you create or update Amazon ParallelCluster images and clusters. For more information, see Amazon services used by Amazon ParallelCluster.
Requirements
-
The Amazon CLI is installed and configured.
-
The ParallelCluster CLI is installed and configured.
-
An Amazon EC2 key pair to log into the cluster.
-
An IAM role with the permissions that are required to run the ParallelCluster CLI.
Create Security Groups
Create two security groups in the same VPC where the cluster and the file system will be deployed: one for the client running on cluster nodes and one for the file system.
# Create security group for the FSx client aws ec2 create-security-group \ --group-name Fsx-Client-SecurityGroup \ --description "Allow traffic for the FSx Lustre client" \ --vpc-idvpc-cluster\ --regionregion# Create security group for the FSx file system aws ec2 create-security-group \ --group-name Fsx-FileSystem-SecurityGroup \ --description "Allow traffic for the FSx Lustre File System" \ --vpc-idvpc-cluster\ --regionregion
In the remainder of the tutorial, we will assume sg-client and sg-file-system
are the security group ids of the client and file system, respectively.
Configure the security group for the client to allow all inbound/outbound traffic to and from the file system, as required by EFA.
# Allow all inbound traffic from the file system to the client aws ec2 authorize-security-group-ingress \ --group-idsg-client\ --protocol -1 \ --port -1 \ --source-groupsg-file-system\ --regionregion# Allow all outbound traffic from the client to the file system aws ec2 authorize-security-group-egress \ --group-idsg-client\ --protocol -1 \ --port -1 \ --source-groupsg-file-system\ --regionregion
Configure the security group for the file system to allow all inbound/outbound traffic within itself and all inbound traffic from the client, as required by EFA.
# Allow all inbound traffic within this security group aws ec2 authorize-security-group-ingress \ --group-idsg-file-system\ --protocol -1 \ --port -1 \ --source-groupsg-file-system\ --regionregion# Allow all outbound traffic within this security group aws ec2 authorize-security-group-egress \ --group-idsg-file-system\ --protocol -1 \ --port -1 \ --source-groupsg-file-system\ --regionregion# Allow all inbound traffic from the client aws ec2 authorize-security-group-ingress \ --group-idsg-file-system\ --protocol -1 \ --port -1 \ --source-groupsg-client\ --regionregion# Allow all outbound traffic to the client aws ec2 authorize-security-group-egress \ --group-idsg-file-system\ --protocol -1 \ --port -1 \ --source-groupsg-client\ --regionregion
Create the file system
Create the file system within the same Availability Zone (AZ) where the compute nodes will be;
and replace with its ID in the following
code. This is required to allow EFA work with your file system. Note that, as part of the file
system creation, we enable EFA using the EfaEnable property.subnet-compute-nodes
aws fsx create-file-system \ --file-system-type LUSTRE \ --storage-capacity 38400 \ --storage-type SSD \ --subnet-idssubnet-compute-nodes\ --security-group-idssg-file-system\ --lustre-configuration DeploymentType=PERSISTENT_2,PerUnitStorageThroughput=125,EfaEnabled=true,MetadataConfiguration={Mode=AUTOMATIC} \ --regionregion
Take note of the file system id returned by the previous command. In the remainder of the
tutorial, replace with this file system id.fs-id
Create the cluster
-
Create the cluster with the following configurations set in the Amazon ParallelCluster YAML configuration file:
-
AMI based on a supported OS, such as Ubuntu 22.04.
-
Compute nodes must use an EFA supported instance type having Nitro v4+, such as g6.16xlarge.
-
Compute nodes must be in the same AZ where the file system is.
-
Compute nodes must have Efa/Enabled set to true.
-
Compute nodes must configure the FSx Lustre client to use EFA by following Configuring EFA clients in the FSx for Lustre User Guide. That guide provides the
configure-efa-fsx-lustre-clientpackage, which you download, extract, and run withsetup.sh. To apply it automatically on every compute node, run these steps from an OnNodeStart custom action, as shown below. -
Create the file
configure-efa-fsx-lustre-client-wrapper.shand upload it to a bucket, for exampleyour-bucket, that is reachable from the compute nodes. Following the steps in the FSx documentation referenced above, the wrapper performs the equivalent of:#!/bin/bash set -euo pipefail # Download the FSx Lustre EFA client configuration package. # See: https://docs.aws.amazon.com/fsx/latest/LustreGuide/configure-efa-clients.html # Replace the source below with a location reachable from your compute nodes # (for example, your own S3 bucket populated from the FSx documentation package). cd /tmp aws s3 cps3://your-bucket/configure-efa-fsx-lustre-client.zip. unzip -o configure-efa-fsx-lustre-client.zip cd configure-efa-fsx-lustre-client # Configure the FSx Lustre client to use EFA. ./setup.sh
-
-
-
Create a cluster configuration file
config.yaml:Region:regionImage: Os: ubuntu2204 HeadNode: InstanceType: c5.xlarge Networking: SubnetId:subnet-xxxxxxxxxxAdditionalSecurityGroups: -sg-clientSsh: KeyName:my-ssh-keyScheduling: Scheduler: slurm SlurmQueues: - Name: q1 ComputeResources: - Name: cr1 Instances: - InstanceType: g6.16xlarge MinCount: 1 MaxCount: 3 Efa: Enabled: true Networking: SubnetIds: -subnet-xxxxxxxxxx# Subnet in the same AZ where the file system is AdditionalSecurityGroups: -sg-clientPlacementGroup: Enabled: false Iam: S3Access: - BucketName:your-bucketCustomActions: OnNodeStart: # Point this at the wrapper script you created and hosted (see step above). Script:s3://your-bucket/configure-efa-fsx-lustre-client-wrapper.shSharedStorage: - MountDir: /fsx Name: my-fsxlustre-efa-external StorageType: FsxLustre FsxLustreSettings: FileSystemId:fs-idThen create a cluster using that configuration:
pcluster create-cluster \ --cluster-name fsx-efa-tutorial \ --cluster-configuration config.yaml \ --regionregion
Validate FSx with EFA is working
To verify that Lustre network traffic is using EFA, use the Lustre lnetctl tool
that can show the network traffic for a given network interface. To this aim, execute the following
commands in a compute node:
# Take note of the number of packets flowing through the interface, # which are specified in statistics:send_count and statistics:recv_count sudo lnetctl net show --net efa -v # Generate traffic to the file system echo 'Hello World' > /fsx/hello-world.txt # Take note of the number of packets flowing through the interface, # which are specified in statistics:send_count and statistics:recv_count sudo lnetctl net show --net efa -v
If the feature is working, the number of packets flowing through the interface is expected to increase.