Seeing errors in compute node initializations
Seeing Node bootstrap error
in clustermgtd.log
The problem is related to compute nodes failing to bootstrap. For information on how to debug a cluster protected mode issue, see How to debug protected mode.
I configured on demand capacity reservations (ODCRs) or zonal Reserved Instances
ODCRs that include instances that have multiple network interfaces, such as P4d, P4de, and Amazon Trainium (Trn)
In the cluster configuration file, check that the HeadNode
is in a public subnet and that the compute nodes are in a private
subnet.
ODCRs are targeted ODCRS
Seeing Unable to read file '/opt/slurm/etc/pcluster/run_instances_overrides.json'.
even though I already have /opt/slurm/etc/pcluster/run_instances_overrides.json
in place by following the instructions given in
Launch instances with ODCR (On-Demand Capacity Reservations)
If you are using Amazon ParallelCluster versions 3.1.1 to 3.2.1 with targeted ODCRs, and you are also using the run instances override JSON file, it's possible that you don’t have the JSON file formatted
correctly. You could see an error in clustermgtd.log
, such as the following:
Unable to read file '/opt/slurm/etc/pcluster/run_instances_overrides.json'. Using default: {} in /var/log/parallelcluster/clustermgtd.
Validate that the JSON file format is correct by running the following:
$
echo /opt/slurm/etc/pcluster/run_instances_overrides.json | jq
Seeing Found RunInstances parameters override.
in clustermgtd.log
when cluster creation failed, or in slurm_resume.log
when run job failed
If you are using run instances override JSON file, check that you correctly set the queue
name and the compute resources name in the /opt/slurm/etc/pcluster/run_instances_overrides.json
file.
Seeing An error occurred (InsufficientInstanceCapacity)
in slurm_resume.log
when I fail to a run job, or in clustermgtd.log
when I fail to create a cluster
Using PG-ODCR (Placement Group ODCR)
When creating an ODCR with an associated placement group, the same placement group name must be used in the configuration file. Set the corresponding placement group name in the cluster configuration.
Using zonal Reserved Instances
If you are using zonal Reserved Instances with PlacementGroup
/ Enabled
to true
in the cluster
configuration, you might see an error, such as the following:
We currently do not have sufficient trn1.32xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get trn1.32xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b, us-east-1c, us-east-1e, us-east-1f.
You might see this because the zonal Reserved Instances aren't placed in the same UC (or spine), which can cause insufficient capacity
errors (ICEs) when using placement groups. You can check this case by disabling the PlacementGroup
Group setting in the cluster
configuration to determine if the cluster can allocate the instances.
Seeing An error occurred (VcpuLimitExceeded)
in slurm_resume.log
when I fail to run a job, or in clustermgtd.log
, when I fail to create a cluster
Check the vCPU limits on your account for the specific EC2 instance type that you are using. If you see zero or fewer vCPUs than you are requesting, request an increase for your limits. For information about how to view current limits and request new limits, see Amazon EC2 service quotas in the Amazon EC2 User Guide for Linux Instances.
Seeing An error occurred (InsufficientInstanceCapacity)
in slurm_resume.log
when I fail to run a job, or in clustermgtd.log
, when I fail to create a cluster
You are experiencing an insufficient capacity issue. Follow https://aws.amazon.com/premiumsupport/knowledge-center/ec2-insufficient-capacity-errors/
Seeing nodes are in DOWN
state with Reason (Code:InsufficientInstanceCapacity)...
You are experiencing an insufficient capacity issue. Follow https://aws.amazon.com/premiumsupport/knowledge-center/ec2-insufficient-capacity-errors/
Seeing cannot change locale (en_US.utf-8) because it has an invalid name
in slurm_resume.log
This can occur if you have an unsuccessful yum
installation process that left the locale settings in an inconsistent state. For
example, this can be caused when a user terminates the install process.
To verify the cause, take the following actions:
-
Run
su - pcluster-admin
.The shell shows an error, such as,
cannot change locale...no such file or directory
. -
Run
localedef --list
.Returns an empty list or doesn't contain the default locale.
-
Check the last
yum
command withyum history
andyum history info #ID
. Does the last ID haveReturn-Code: Success
?If the last ID doesn't have
Return-Code: Success
, the post-install scripts might not have run successfully.
To fix the issue, try rebuilding the locale with yum reinstall glibc-all-langpacks
. After the
rebuild, su - pcluster-admin
doesn't show an error or warning if the issue is fixed.
None of the previous scenarios apply to my situation
To troubleshoot compute node initialization issues, see Troubleshooting node initialization issues.
Check to see if your scenario is covered in GitHub Known Issues
For additional support, see Additional support.