Troubleshooting scaling issues
This section is relevant to clusters that were installed using Amazon ParallelCluster version 3.0.0 and later with the Slurm job scheduler. For more information about configuring multiple queues, see Configuration of multiple queues.
If one of your running clusters is experiencing issues, place the cluster in a STOPPED state by running the following command
before you begin to troubleshoot. This prevents incurring any unexpected costs.
$pcluster update-compute-fleet --cluster-namemycluster\ --status STOP_REQUESTED
You can list the log streams available from the cluster nodes by using the pcluster list-cluster-log-streams command and filtering by using the private-dns-name of one of the failing nodes or
the head node:
$pcluster list-cluster-log-streams --cluster-namemycluster--regioneu-west-1\ --filters 'Name=private-dns-name,Values=ip-10-0-0-101'
Then, you can retrieve the content of the log stream to analyze it by using the pcluster get-cluster-log-events command and passing the --log-stream-name corresponding to one of the key logs
mentioned in the following section:
$pcluster get-cluster-log-events --cluster-namemycluster\ --regioneu-west-1--log-stream-nameip-10-0-0-13.i-04e91cc1f4ea796fe.cfn-init
Amazon ParallelCluster creates cluster CloudWatch log streams in log groups. You can view these logs in the CloudWatch console Custom Dashboards or Log groups. For more information, see Integration with Amazon CloudWatch Logs and Amazon CloudWatch dashboard.
Topics
Key logs for debugging
The following table provides an overview of the key logs for the head node:
-
/var/log/cfn-init.log- This is the Amazon CloudFormation init log. It contains all commands that were run when an instance was set up. Use it to troubleshoot initialization issues. -
/var/log/chef-client.log- This is the Chef client log. It contains all commands that were run through Chef/CINC. Use it to troubleshoot initialization issues. -
/var/log/parallelcluster/slurm_resume.log- This is aResumeProgramlog. It launches instances for dynamic nodes. Use it to troubleshoot dynamic nodes launch issues. -
/var/log/parallelcluster/slurm_suspend.log- This is theSuspendProgramlog. It's called when instances are terminated for dynamic nodes. Use it to troubleshoot dynamic nodes termination issues. When you check this log, you should also check theclustermgtdlog. -
/var/log/parallelcluster/clustermgtd- This is theclustermgtdlog. It runs as the centralized daemon that manages most cluster operation actions. Use it to troubleshoot any launch, termination, or cluster operation issues. -
/var/log/slurmctld.log- This is the Slurm control daemon log. Amazon ParallelCluster doesn't make scaling decisions. Rather, it only attempts to launch resources to satisfy the Slurm requirements. It's useful for scaling and allocation issues, job-related issues, and any scheduler-related launch and termination issues. -
/var/log/parallelcluster/compute_console_output- This log records the console output from a sample subset of static compute nodes that have unexpectedly terminated. Use this log if static compute nodes terminate and the compute node logs aren't available in CloudWatch. Thecompute_console_output logcontent you receive is the same when you use the Amazon EC2 console or Amazon CLI to retrieve the instance console output.
These are the key logs for the compute nodes:
-
/var/log/cloud-init-output.log- This is the cloud-initlog. It contains all commands that were run when an instance was set up. Use it to troubleshoot initialization issues. -
/var/log/parallelcluster/computemgtd- This is thecomputemgtdlog. It runs on each compute node to monitor the node in the uncommon event thatclustermgtddaemon on the head node is offline. Use it to troubleshoot unexpected termination issues. -
/var/log/slurmd.log- This is the Slurm compute daemon log. Use it to troubleshoot initialization and compute failure issues.
Seeing InsufficientInstanceCapacity error in
slurm_resume.log when I fail to run a job, or in clustermgtd.log when I fail to create a
cluster
If the cluster uses a Slurm scheduler, you are experiencing an insufficient capacity issue. If there aren't enough instances available when
an instance launch request is made, an InsufficientInstanceCapacity error is returned.
For static instance capacity, you can find the error in the clustermgtd log at /var/log/parallelcluster/clustermgtd.
For dynamic instance capacity, you can find the error in the ResumeProgram log at /var/log/parallelcluster/slurm_resume.log.
The message looks similar to the following example:
An error occurred (InsufficientInstanceCapacity) when calling the RunInstances/CreateFleet operation...
Based on your use case, consider using one of the following methods to avoid getting these types of error messages:
-
Disable the placement group if it's enabled. For more information, see Placement groups and instance launch issues.
-
Reserve capacity for the instances and launch them with ODCR (On-Demand Capacity Reservations). For more information, see Launch instances with On-Demand Capacity Reservations (ODCR).
-
Configure multiple compute resources with different instance types. If your workload doesn't require a specific instance type, you can leverage fast insufficient capacity fail over with multiple compute resources. For more information, see Slurm cluster fast insufficient capacity fail-over.
-
Configure multiple instance types in the same compute resource, and leverage the multiple instance type allocation. For more information about configuring multiple instances, see Multiple instance type allocation with Slurm and Scheduling / SlurmQueues / ComputeResources / Instances.
-
Move the queue to a different Availability Zone by changing the subnet ID in the cluster configuration Scheduling / SlurmQueues / Networking / SubnetIds.
-
If your workload isn't tightly coupled, span the queue across different Availability Zones. For more information about configuring multiple subnets, see Scheduling / SlurmQueues / Networking / SubnetIds.
Troubleshooting node initialization issues
This section covers how you can troubleshoot node initialization issues. This includes issues where the node fails to launch, power up, or join a cluster.
Topics
Head node
Applicable logs:
-
/var/log/cfn-init.log -
/var/log/chef-client.log -
/var/log/parallelcluster/clustermgtd -
/var/log/parallelcluster/slurm_resume.log -
/var/log/slurmctld.log
Check the /var/log/cfn-init.log and /var/log/chef-client.log logs or corresponding log streams. These logs contain
all the actions that were run when the head node was set up. Most errors that occur during setup should have error messages located in the
/var/log/chef-client.log log. If OnNodeStart or OnNodeConfigured scripts are specified in the
configuration of the cluster, double check that the script runs successfully through log messages.
When a cluster is created, the head node must wait for the compute nodes to join the cluster before it can join the cluster. Because of this, if the compute nodes fail to join the cluster, then the head node also fails. You can follow one of these sets of procedures, depending on the type of compute notes you use, to troubleshoot this type of issue:
Compute nodes
-
Applicable logs:
-
/var/log/cloud-init-output.log -
/var/log/slurmd.log
-
-
If a compute node is launched, first check
/var/log/cloud-init-output.log, which should contain the setup logs similar to the/var/log/chef-client.loglog on the head node. Most errors that occur during setup should have error messages located at the/var/log/cloud-init-output.loglog. If pre-install or post-install scripts are specified in cluster configuration, check that they ran successfully. -
If you’re using a custom AMI with modification to the Slurm configuration, then there might be a Slurm-related error that prevents the compute node from joining the cluster. For scheduler-related errors, check the
/var/log/slurmd.loglog.
Dynamic compute nodes:
-
Search the
ResumeProgramlog (/var/log/parallelcluster/slurm_resume.log) for your compute node name to see ifResumeProgramwas ever called with the node. (IfResumeProgramwasn't ever called, you can check theslurmctldlog (/var/log/slurmctld.log) to determine if Slurm ever tried to callResumeProgramwith the node). -
Note that incorrect permissions for
ResumeProgrammight causeResumeProgramto fail silently. If you’re using a custom AMI with modification toResumeProgramsetup, check that theResumeProgramis owned by theslurmuser and has the744(rwxr--r--) permission. -
If
ResumeProgramis called, check to see if an instance is launched for the node. If no instance was launched, you can see an error message that describes the launch failure. -
If the instance is launched, then there might have been a problem during the setup process. You should see the corresponding private IP address and instance ID from the
ResumeProgramlog. Moreover, you can look at corresponding setup logs for the specific instance. For more information about troubleshooting a setup error with a compute node, see the next section.
Static compute nodes:
-
Check the
clustermgtd(/var/log/parallelcluster/clustermgtd) log to see if instances were launched for the node. If they weren't launched, there should be clear error message detailing the launch failure. -
If instance is launched, there's some issue during setup process. You should see the corresponding private IP address and instance ID from the
ResumeProgramlog. Moreover, you can look at the corresponding setup logs for the specific instance.
Compute nodes backed by Spot Instances:
-
If it's the first time you use Spot Instances and the job remains in a PD (pending state), double check the
/var/log/parallelcluster/slurm_resume.logfile. You'll probably find an error like the following:2022-05-20 13:06:24,796 - [slurm_plugin.common:add_instances_for_nodes] - ERROR - Encountered exception when launching instances for nodes (x1) ['spot-dy-t2micro-2']: An error occurred (AuthFailure.ServiceLinkedRoleCreationNotPermitted) when calling the RunInstances operation: The provided credentials do not have permission to create the service-linked role for Amazon EC2 Spot Instances.When using Spot Instances, an
AWSServiceRoleForEC2Spotservice-linked role must exist in your account. To create this role in your account using the Amazon CLI, run the following command:$aws iam create-service-linked-role --aws-service-name spot.amazonaws.comFor more information, see Working with Spot Instances in the Amazon ParallelCluster User Guide and Service-linked role for Spot Instance requests in the Amazon EC2 User Guide.
Troubleshooting unexpected node replacements and terminations
This section continues to explore how you can troubleshoot node related issues, specifically when a node is replaced or terminated unexpectedly.
-
Applicable logs:
-
/var/log/parallelcluster/clustermgtd(head node) -
/var/log/slurmctld.log(head node) -
/var/log/parallelcluster/computemgtd(compute node)
-
Nodes replaced or terminated unexpectedly
-
Check in the
clustermgtdlog (/var/log/parallelcluster/clustermgtd) to see ifclustermgtdreplaced or terminated a node. Note thatclustermgtdhandles all normal node maintenance action. -
If
clustermgtdreplaced or terminated the node, there should be a message detailing why this action was taken on the node. If the reason is scheduler related (for example, because the node is inDOWN), check inslurmctldlog for more information. If the reason is Amazon EC2 related, there should be informative message detailing the Amazon EC2 related issue that required the replacement. -
If
clustermgtddidn't terminate the node, first check if this was an expected termination by Amazon EC2 , more specifically a spot termination.computemgtd, running on a compute node, can also terminate a node ifclustermgtdis determined as unhealthy. Checkcomputemgtdlog (/var/log/parallelcluster/computemgtd) to see ifcomputemgtdterminated the node.
Nodes failed
-
Check in
slurmctldlog (/var/log/slurmctld.log) to see why a job or a node failed. Note that jobs are automatically re-queued if a node failed. -
If
slurm_resumereports that node is launched andclustermgtdreports after several minutes that there’s no corresponding instance in Amazon EC2 for that node, the node might fail during setup. To retrieve the log from a compute (/var/log/cloud-init-output.log), do the following steps:-
Submit a job to let Slurm spin up a new node.
-
Wait for the compute node to start.
-
Modify the instance initiated shutdown behavior so that a failing compute node will be stopped rather than terminated.
$aws ec2 modify-instance-attribute \ --instance-idi-1234567890abcdef0\ --instance-initiated-shutdown-behavior "{\"Value\": \"stop\"}" -
Enable termination protection.
$aws ec2 modify-instance-attribute \ --instance-idi-1234567890abcdef0\ --disable-api-termination -
Tag the node to be easily identifiable.
$aws ec2 create-tags \ --resourcesi-1234567890abcdef0\ --tags Key=Name,Value=QUARANTINED-Compute -
Detach the node from the cluster by changing the
parallelcluster:cluster-nametag.$aws ec2 create-tags \ --resourcesi-1234567890abcdef0\ --tags Key=parallelcluster:clustername,Value=QUARANTINED-ClusterName -
Retrieve the console output from the node with this command.
$aws ec2 get-console-output --instance-idi-1234567890abcdef0--output text
-
Replacing, terminating, or powering down problematic instances and nodes
-
Applicable logs:
-
/var/log/parallelcluster/clustermgtd(head node) -
/var/log/parallelcluster/slurm_suspend.log(head node)
-
-
In most cases,
clustermgtdhandles all expected instance termination action. Check in theclustermgtdlog to see why it failed to replace or terminate a node. -
For dynamic nodes failing SlurmSettings Properties, check in the
SuspendProgramlog to see ifSuspendProgramwas called byslurmctldwith the specific node as argument. Note thatSuspendProgramdoesn’t actually perform any action. Rather, it only logs when it’s called. All instance termination andNodeAddrreset is done byclustermgtd. Slurm puts nodes back into aPOWER_SAVINGstate afterSuspendTimeoutautomatically. -
If compute nodes are failing continuously due to bootstrap failures, verify if they are being launched with Slurm cluster protected mode enabled. If protected mode isn't enabled, modify the protected mode settings to enable protected mode. Troubleshoot and fix the bootstrap script.
Queue (partition) Inactive status
If you run sinfo and the output shows queues with AVAIL status of inact, your cluster might have Slurm cluster protected mode enabled and the queue has been set to the INACTIVE state
for a pre-defined period of time.
Troubleshooting other known node and job issues
Another type of known issue is that Amazon ParallelCluster might fail to allocate jobs or make scaling decisions. With this type of issue,
Amazon ParallelCluster only launches, terminates, or maintains resources according to Slurm instructions. For these issues, check the
slurmctld log to troubleshoot them.