Retrieving and preserving logs Troubleshooting stack deployment issues Troubleshooting issues in multiple queue mode clusters Troubleshooting issues in single queue mode clusters Placement groups and instance launch issues Directories that cannot be replaced Troubleshooting issues in Amazon DCV Troubleshooting issues in clusters with Amazon Batch integration Troubleshooting when a resource fails to create Troubleshooting IAM policy size issues Additional support

Amazon ParallelCluster troubleshooting

The Amazon ParallelCluster community maintains a Wiki page that provides many troubleshooting tips on the Amazon ParallelCluster GitHub Wiki. For a list of known issues, see Known issues.

Topics

Retrieving and preserving logs
Troubleshooting stack deployment issues
Troubleshooting issues in multiple queue mode clusters
Troubleshooting issues in single queue mode clusters
Placement groups and instance launch issues
Directories that cannot be replaced
Troubleshooting issues in Amazon DCV
Troubleshooting issues in clusters with Amazon Batch integration
Troubleshooting when a resource fails to create
Troubleshooting IAM policy size issues
Additional support

Retrieving and preserving logs

Logs are a useful resource for troubleshooting issues. Before you can use logs to troubleshoot issues for your Amazon ParallelCluster resources, you should first create an archive of cluster logs. Follow the steps described in the Creating an Archive of a Cluster’s Logs topic on the Amazon ParallelCluster GitHub Wiki to start this process.

If one of your running clusters is experiencing issues, you should place the cluster in a STOPPED state by running the pcluster stop <cluster_name> command before you begin to troubleshoot. This prevents incurring any unexpected costs.

If pcluster stops functioning or if you want to delete a cluster while still preserving its logs, run the pcluster delete —keep-logs <cluster_name> command. Running this command deletes the cluster yet retains the log group that’s stored in Amazon CloudWatch. For more information about this command, see the pcluster delete documentation.

Troubleshooting stack deployment issues

If your cluster fails to be created and rolls back stack creation, you can look through the following log files to diagnose the issue. You want to look for the output of ROLLBACK_IN_PROGRESS in these logs. The failure message should look like the following:


$ pcluster create mycluster
Creating stack named: parallelcluster-mycluster
Status: parallelcluster-mycluster - ROLLBACK_IN_PROGRESS                        
Cluster creation failed.  Failed events:
  - AWS::EC2::Instance MasterServer Received FAILURE signal with UniqueId i-07af1cb218dd6a081

To diagnose the issue, create the cluster again using pcluster create, including the --norollback flag. Then, SSH into the cluster:


$ pcluster create mycluster --norollback
...
$ pcluster ssh mycluster

After you're logged into the head node, you should find three primary log files that you can use to pinpoint the error.

/var/log/cfn-init.log is the log for the cfn-init script. First check this log. You're likely to see an error like Command chef failed in this log. Look at the lines immediately before this line for more specifics connected with the error message. For more information, see cfn-init.
/var/log/cloud-init.log is the log for cloud-init. If you don't see anything in cfn-init.log, then try checking this log next.
/var/log/cloud-init-output.log is the output of commands that were run by cloud-init. This includes the output from cfn-init. In most cases, you don't need to look at this log to troubleshoot this type of issue.

Troubleshooting issues in multiple queue mode clusters

This section is relevant to clusters that were installed using Amazon ParallelCluster version 2.9.0 and later with the Slurm job scheduler. For more information about multiple queue mode, see Multiple queue mode.

Topics

Key logs
Troubleshooting node initialization issues
Troubleshooting unexpected node replacements and terminations
Replacing, terminating, or powering down problem instances and nodes
Troubleshooting other known node and job issues

Key logs

The following table provides an overview of the key logs for the head node:

/var/log/cfn-init.log: This is the Amazon CloudFormation init log. It contains all commands that were run when an instance is set up. It's useful for troubleshooting initialization issues.
/var/log/chef-client.log: This is the Chef client log. It contains all commands that were run through Chef/CINC. It's useful for troubleshooting initialization issues.
/var/log/parallelcluster/slurm_resume.log: This is a ResumeProgram log. It launches instances for dynamic nodes and useful for troubleshooting dynamic nodes launch issues.
/var/log/parallelcluster/slurm_suspend.log: This is the SuspendProgram log. It's called when instances are terminated for dynamic nodes, and useful for troubleshooting dynamic nodes termination issues. When you check this log, you should also check the clustermgtd log.
/var/log/parallelcluster/clustermgtd: This is the clustermgtd log. It runs as the centralized daemon that manages most cluster operation actions. It's useful for troubleshooting any launch, termination, or cluster operation issue.
/var/log/slurmctld.log: This is the Slurm control daemon log. Amazon ParallelCluster doesn't make scaling decisions. Rather, it only attempts to launch resources to satisfy the Slurm requirements. It's useful for scaling and allocation issues, job-related issues, and any scheduler-related launch and termination issues.

These are the Key notes for the Compute nodes:

/var/log/cloud-init-output.log: This is the cloud-init log. It contains all commands that were run when an instance is set up. It's useful for troubleshooting initialization issues.
/var/log/parallelcluster/computemgtd: This is the computemgtd log. It runs on each compute node to monitor the node in the rare event that clustermgtd daemon on the head node is offline. It's useful for troubleshooting unexpected termination issues.
/var/log/slurmd.log: This is the Slurm compute daemon log. It's useful for troubleshooting initialization and compute failure related issues.

Troubleshooting node initialization issues

This section covers how you can troubleshoot node initialization issues. This includes issues where the node fails to launch, power up, or join a cluster.

Head node:

Applicable logs:

/var/log/cfn-init.log
/var/log/chef-client.log
/var/log/parallelcluster/clustermgtd
/var/log/parallelcluster/slurm_resume.log
/var/log/slurmctld.log

Check the /var/log/cfn-init.log and /var/log/chef-client.log logs. These logs should contain all the actions that were run when the head node is set up. Most errors that occur during setup should have a error messages located in the /var/log/chef-client.log log. If pre-install or post-install scripts are specified in the configuration of the cluster, double check that the script runs successfully through log messages.

When a cluster is created, the head node needs to wait for the compute nodes to join the cluster before it can join the cluster. As such, if the compute nodes fail to join the cluster, then the head node also fails. You can follow one of these sets of procedures, depending on the type of compute notes you use, to troubleshoot this type of issue:

Dynamic compute nodes:

Search the ResumeProgram log (/var/log/parallelcluster/slurm_resume.log) for your compute node name to see if ResumeProgram was ever called with the node. (If ResumeProgram wasn't ever called, you can check the slurmctld log (/var/log/slurmctld.log) to determine if Slurm ever tried to call ResumeProgram with the node.)
Note that incorrect permissions for ResumeProgram might cause ResumeProgram to fail silently. If you’re using a custom AMI with modification to ResumeProgram setup, check that the ResumeProgram is owned by the slurm user and has the 744 (rwxr--r--) permission.
If ResumeProgram is called, check to see if an instance is launched for the node. If no instance was launched, you should be able to see an error message that describes the launch failure.
If the instance is launched, then there might have been a problem during the setup process. You should see the corresponding private IP address and instance ID from the ResumeProgram log. Moreover, you can look at corresponding setup logs for the specific instance. For more information about troubleshooting a setup error with a compute node, see the next section.

Static compute nodes:

Check the clustermgtd (/var/log/parallelcluster/clustermgtd) log to see if instances were launched for the node. If they weren't launched, there should be clear error message detailing the launch failure.
If instance is launched, there's some issue during setup process. You should see the corresponding private IP address and instance ID from the ResumeProgram log. Moreover, you can look at the corresponding setup logs for the specific instance.

Compute nodes:
- Applicable logs:
  - /var/log/cloud-init-output.log
  - /var/log/slurmd.log
- If compute node is launched, first check /var/log/cloud-init-output.log, which should contain the setup logs similar to the /var/log/chef-client.log log on the head node. Most errors that occur during setup should have error messages located at the /var/log/cloud-init-output.log log. If pre-install or post-install scripts are specified in cluster configuration, check that they ran successfully.
- If you’re using a custom AMI with modification to Slurm configuration, then there might be a Slurm related error that prevents the compute node from joining the cluster. For scheduler related errors, check the /var/log/slurmd.log log.

Troubleshooting unexpected node replacements and terminations

This section continues to explore how you can troubleshoot node related issues, specifically when a node is replaced or terminated unexpectedly.

Applicable logs:
- /var/log/parallelcluster/clustermgtd (head node)
- /var/log/slurmctld.log (head node)
- /var/log/parallelcluster/computemgtd (compute node)
Nodes replaced or terminated unexpectedly
- Check in the clustermgtd log (/var/log/parallelcluster/clustermgtd) to see if clustermgtd took the action to replace or terminate a node. Note that clustermgtd handles all normal node maintenance action.
- If clustermgtd replaced or terminated the node, there should be a message detailing why this action was taken on the node. If the reason is scheduler related (for example, because the node is in DOWN), check in slurmctld log for more information. If the reason is Amazon EC2 related, there should be informative message detailing the Amazon EC2 related issue that required the replacement.
- If clustermgtd didn't terminate the node, first check if this was an expected termination by Amazon EC2, more specifically a spot termination. computemgtd, running on a Compute node, can also take an action to terminate a node if clustermgtd is determined as unhealthy. Check computemgtd log (/var/log/parallelcluster/computemgtd) to see if computemgtd terminated the node.
Nodes failed
- Check in slurmctld log (/var/log/slurmctld.log) to see why a job or a node failed. Note that jobs are automatically re-queued if a node failed.
- If slurm_resume reports that node is launched and clustermgtdreports after several minutes that there’s no corresponding instance in Amazon EC2 for that node, the node might fail during setup. To retrieve the log from a compute (/var/log/cloud-init-output.log), do the following steps:
  - Submit a job to let Slurm spin up a new node.
  - After the node starts, enable termination protection using this command.
    
    aws ec2 modify-instance-attribute --instance-id i-xyz --disable-api-termination
  - Retrieve the console output from the node with this command.
    
    aws ec2 get-console-output --instance-id i-xyz --output text

Replacing, terminating, or powering down problem instances and nodes

Applicable logs:
- /var/log/parallelcluster/clustermgtd (head node)
- /var/log/parallelcluster/slurm_suspend.log (head node)
In most cases, clustermgtd handles all expected instance termination action. Check in the clustermgtd log to see why it failed to replace or terminate a node.
For dynamic nodes failing scaledown_idletime, check in the SuspendProgram log to see if SuspendProgram was called by slurmctld with the specific node as argument. Note that SuspendProgram doesn’t actually perform any action. Rather, it only logs when it’s called. All instance termination and NodeAddr reset is done by clustermgtd. Slurm puts nodes back into a POWER_SAVING state after SuspendTimeout automatically.

Troubleshooting other known node and job issues

Another type of known issue is that Amazon ParallelCluster might fail to allocate jobs or make scaling decisions. With this type of issue, Amazon ParallelCluster only launches, terminates, or maintains resources according to Slurm instructions. For these issues, check the slurmctld log to troubleshoot these issues.

Troubleshooting issues in single queue mode clusters

Note

Starting with version 2.11.5, Amazon ParallelCluster doesn't support the use of SGE or Torque schedulers.

This section applies to clusters that don’t have multiple queue mode with one of the following two configurations:

Launched using a Amazon ParallelCluster version earlier than 2.9.0 and SGE, Torque, or Slurm job schedulers.
Launched using Amazon ParallelCluster version 2.9.0 or later and SGE or Torque job schedulers.

Topics

Key logs
Troubleshooting failed launch and join operations
Troubleshooting scaling issues
Troubleshooting other cluster related issues

Key logs

The following log files are the key logs for the head node.

For Amazon ParallelCluster version 2.9.0 or later:

/var/log/chef-client.log: This is the CINC (chef) client log. It contains all commands that were run through CINC. It's useful for troubleshooting initialization issues.

For all Amazon ParallelCluster versions:

/var/log/cfn-init.log: This is the cfn-init log. It contains all commands that were run when an instance is set up, and is therefore useful for troubleshooting initialization issues. For more information, see cfn-init.
/var/log/clustermgtd.log: This is the clustermgtd log for Slurm schedulers. clustermgtd runs as the centralized daemon that manages most cluster operation actions. It's useful for troubleshooting any launch, termination, or cluster operation issue.
/var/log/jobwatcher: This is the jobwatcher log for SGE and Torque schedulers. jobwatcher monitors the scheduler queue and updates the Auto Scaling Group. It's useful for troubleshooting issues related to scaling up nodes.
/var/log/sqswatcher: This is the sqswatcher log for SGE and Torque schedulers. sqswatcher processes the instance ready event sent by a compute instance after successful initialization. It also adds compute nodes to the scheduler configuration. This log is useful for troubleshooting why a node or nodes failed to join a cluster.

The following are the key logs for the compute nodes.

Amazon ParallelCluster version 2.9.0 or later

/var/log/cloud-init-output.log: This is the Cloud init log. It contains all commands that were run when an instance is set up. It's useful for troubleshooting initialization issues.

Amazon ParallelCluster versions before 2.9.0

/var/log/cfn-init.log: This is the CloudFormation init log. It contains all commands that were run when an instance is set up. It's useful for troubleshooting initialization issues

All versions

/var/log/nodewatcher: This is the nodewatcher log. nodewatcher daemons that run on each Compute node when using SGE and Torque schedulers. They scale down a node if it’s idle. This log is useful for any issues related to scaling down resources.

Troubleshooting failed launch and join operations

Applicable logs:
- /var/log/cfn-init-cmd.log (head node and compute node)
- /var/log/sqswatcher (head node)
If nodes failed to launch, check in the /var/log/cfn-init-cmd.log log to see the specific error message. In most cases, node launch failures are due to a setup failure.
If compute nodes failed to join the scheduler configuration despite successful setup, check the /var/log/sqswatcher log to see whether sqswatcher processed the event. These issues in most cases are because sqswatcher didn’t process the event.

Troubleshooting scaling issues

Applicable logs:
- /var/log/jobwatcher (head node)
- /var/log/nodewatcher (compute node)
Scale up issues: For the head node, check the /var/log/jobwatcher log to see if the jobwatcher daemon calculated the proper number of required nodes and updated the Auto Scaling Group. Note that jobwatcher monitors the scheduler queue and updates the Auto Scaling Group.
Scale down issues: For compute nodes, check the /var/log/nodewatcher log on the problem node to see why the node was scaled down. Note that nodewatcher daemons scale down a compute node if it’s idle.

One known issue is random compute note fails on large-scale clusters, specifically those with 500 or more compute nodes. This issue is related to a limitation of the scaling architecture of single queue cluster. If you want to use a large scale cluster, are using Amazon ParallelCluster version v2.9.0 or later, are using Slurm, and want to avoid this issue, you should upgrade and switch to a multiple queue mode supported cluster. You can do so by running pcluster-config convert.

For ultra-large-scale clusters, additional tuning to your system might be required. For more information, contact Amazon Web Services Support.

Placement groups and instance launch issues

To get the lowest inter-node latency, use a placement group. A placement group guarantees that your instances are on the same networking backbone. If there aren't enough instances available when a request is made, an InsufficientInstanceCapacity error is returned. To reduce the possibility of receiving this error when using cluster placement groups, set the placement_group parameter to DYNAMIC and set the placement parameter to compute.

If you require a high performance shared filesystem, consider using FSx for Lustre.

If the head node must be in the placement group, use the same instance type and subnet for both the head as well as all of the compute nodes. By doing this, the compute_instance_type parameter has the same value as the master_instance_type parameter, the placement parameter is set to cluster, and the compute_subnet_id parameter isn't specified. With this configuration, the value of the master_subnet_id parameter is used for the compute nodes.

For more information, see Troubleshooting instance launch issues and Placement groups roles and limitations in the Amazon EC2 User Guide

Directories that cannot be replaced

The following directories are shared between the nodes and cannot be replaced.

/home: This includes the default user home folder (/home/ec2_user on Amazon Linux, /home/centos on CentOS, and /home/ubuntu on Ubuntu).
/opt/intel: This includes Intel MPI, Intel Parallel Studio, and related files.
/opt/sge: Note
Starting with version 2.11.5, Amazon ParallelCluster doesn't support the use of SGE or Torque schedulers.

This includes Son of Grid Engine and related files. (Conditional, only if scheduler = sge.)
/opt/slurm: This includes Slurm Workload Manager and related files. (Conditional, only if scheduler = slurm.)
/opt/torque: Note
Starting with version 2.11.5, Amazon ParallelCluster doesn't support the use of SGE or Torque schedulers.

This includes Torque Resource Manager and related files. (Conditional, only if scheduler = torque.)

Troubleshooting issues in Amazon DCV

Logs for Amazon DCV

The logs for Amazon DCV are written to files in the /var/log/dcv/ directory. Reviewing these logs can help to troubleshoot issues.

Amazon DCV instance type memory

The instance type should have at least 1.7 gibibyte (GiB) of RAM to run Amazon DCV. Nano and micro instance types don't have enough memory to run Amazon DCV.

Ubuntu Amazon DCV issues

When running Gnome Terminal over a DCV session on Ubuntu, you might not automatically have access to the user environment that Amazon ParallelCluster makes available through the login shell. The user environment provides environment modules such as openmpi or intelmpi, and other user settings.

Gnome Terminal's default settings prevent the shell from starting as a login shell. This means that shell profiles aren't automatically sourced and the Amazon ParallelCluster user environment isn't loaded.

To properly source the shell profile and access the Amazon ParallelCluster user environment, do one of the following:

Change the default terminal settings:
1. Choose the Edit menu in the Gnome terminal.
2. Select Preferences, then Profiles.
3. Choose Command and select Run Command as login shell.
4. Open a new terminal.
Use the command line to source the available profiles:
```
$ source /etc/profile && source $HOME/.bashrc
```

Troubleshooting issues in clusters with Amazon Batch integration

This section is relevant to clusters with Amazon Batch scheduler integration.

Head node issues

Head node related setup issues can be troubleshooted in the same way as single queue cluster. For more information about these issues, see Troubleshooting issues in single queue mode clusters.

Amazon Batch multi-node parallel jobs submission issues

If you have problems submitting multi-node parallel jobs when using Amazon Batch as the job scheduler, you should upgrade to Amazon ParallelCluster version 2.5.0. If that's not feasible, you can use the workaround that's detailed in the topic: Self patch a cluster used for submitting multi-node parallel jobs through Amazon Batch.

Compute issues

Amazon Batch manages the scaling and compute aspects of your services. If you encounter compute related issues, see the Amazon Batch troubleshooting documentation for help.

Job failures

If a job fails, you can run the awsbout command to retrieve the job output. You can also run the awsbstat -d command to obtain a link to the job logs stored by Amazon CloudWatch.

Troubleshooting when a resource fails to create

This section is relevant to cluster resources when they fail to create.

When a resource fails to create, ParallelCluster returns an error message like the following.


pcluster create -c config my-cluster
Beginning cluster creation for cluster: my-cluster
WARNING: The instance type 'p4d.24xlarge' cannot take public IPs. Please make sure that the subnet with 
id 'subnet-1234567890abcdef0' has the proper routing configuration to allow private IPs reaching the 
Internet (e.g. a NAT Gateway and a valid route table).
WARNING: The instance type 'p4d.24xlarge' cannot take public IPs. Please make sure that the subnet with
id 'subnet-1234567890abcdef0' has the proper routing configuration to allow private IPs reaching the Internet 
(e.g. a NAT Gateway and a valid route table).
Info: There is a newer version 3.0.3 of AWS ParallelCluster available.
Creating stack named: parallelcluster-my-cluster
Status: parallelcluster-my-cluster - ROLLBACK_IN_PROGRESS                   
Cluster creation failed.  Failed events:
- AWS::CloudFormation::Stack MasterServerSubstack Embedded stack 
arn:aws:cloudformation:region-id:123456789012:stack/parallelcluster-my-cluster-MasterServerSubstack-ABCDEFGHIJKL/a1234567-b321-c765-d432-dcba98766789 
was not successfully created: 
The following resource(s) failed to create: [MasterServer]. 
- AWS::CloudFormation::Stack parallelcluster-my-cluster-MasterServerSubstack-ABCDEFGHIJKL The following resource(s) failed to create: [MasterServer]. 
- AWS::EC2::Instance MasterServer You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the 
specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.  
(Service: AmazonEC2; Status Code: 400; Error Code: VcpuLimitExceeded; Request ID: a9876543-b321-c765-d432-dcba98766789; Proxy: null)
}

As an example, if you see the status message shown in the previous command response, you must use instance types that won't exceed your current vCPU limit or request more vCPU capacity.

You can also use the CloudFormation console to see information about the "Cluster creation failed" status.

View CloudFormation error messages from the console.

Log in to the Amazon Web Services Management Console and navigate to https://console.amazonaws.cn/cloudformation.
Select the stack named parallelcluster-cluster_name.
Choose the Events tab.
Check the Status for the resource that failed to create by scrolling through the list of resource events by Logical ID. If a subtask failed to create, work backwards to find the failed resource event.

An example of a Amazon CloudFormation error message:


2022-02-07 11:59:14 UTC-0800	MasterServerSubstack	CREATE_FAILED	Embedded stack 
arn:aws:cloudformation:region-id:123456789012:stack/parallelcluster-my-cluster-MasterServerSubstack-ABCDEFGHIJKL/a1234567-b321-c765-d432-dcba98766789
was not successfully created: The following resource(s) failed to create: [MasterServer].

Troubleshooting IAM policy size issues

Refer to IAM and Amazon STS quotas, name requirements, and character limits to verify quotas on managed policies attached to roles. If a managed policy size exceeds the quota, split the policy into two or more policies. If you exceed the quota on the number of policies attached to an IAM role, create additional roles and distribute the policies among them to meet the quota.

Additional support

For a list of known issues, see the main GitHub Wiki page or the issues page. For more urgent issues, contact Amazon Web Services Support or open a new GitHub issue.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Setting up a custom Amazon ParallelCluster node package

Amazon ParallelCluster support policy

Amazon ParallelCluster troubleshooting

Topics

Retrieving and preserving logs

Troubleshooting stack deployment issues

Troubleshooting issues in multiple queue mode clusters

Topics

Key logs

Troubleshooting node initialization issues

Troubleshooting unexpected node replacements and terminations

Replacing, terminating, or powering down problem instances and nodes

Troubleshooting other known node and job issues

Troubleshooting issues in single queue mode clusters

Note

Topics

Key logs

Troubleshooting failed launch and join operations

Troubleshooting scaling issues

Troubleshooting other cluster related issues

Placement groups and instance launch issues

Directories that cannot be replaced

Note

Note

Troubleshooting issues in Amazon DCV

Topics

Logs for Amazon DCV

Amazon DCV instance type memory

Ubuntu Amazon DCV issues

Change the default terminal settings:

Troubleshooting issues in clusters with Amazon Batch integration

Head node issues

Amazon Batch multi-node parallel jobs submission issues

Compute issues

Job failures

Troubleshooting when a resource fails to create

Troubleshooting IAM policy size issues

Additional support