Resolving OS configuration changes that cause errors or failures
When making OS configuration changes to Amazon ParallelCluster nodes, various issues can arise that may cause cluster creation, update, or operation failures. This section provides guidance on identifying and resolving common OS configuration-related issues.
Common OS configuration issues
Locale configuration issues
One of the most common OS configuration issues is related to locale settings. If you see errors like:
cannot change locale (en_US.utf-8) because it has an invalid name
This typically occurs when:
-
A
yum
installation process was unsuccessful and left locale settings in an inconsistent state -
A user terminated an installation process prematurely
-
Locale packages are missing or corrupted
How to diagnose
-
Check if you can switch to the pcluster-admin user:
$
su - pcluster-admin
If you see an error like
cannot change locale...no such file or directory
, this confirms the issue. -
Check available locales:
$
localedef --list
If this returns an empty list or doesn't contain the default locale, your locale configuration is broken.
-
Check the last
yum
command:$
yum history
$
yum history info #ID
If the last ID doesn't have
Return-Code: Success
, the post-install scripts might not have run successfully.
How to resolve
Rebuild the locale by reinstalling the language packs:
$
sudo yum reinstall glibc-all-langpacks
After the rebuild, verify the issue is fixed by running:
$
su - pcluster-admin
If no error or warning appears, the issue has been resolved.
OS package conflicts
When installing custom packages or modifying system packages, conflicts can arise that prevent proper cluster operation.
How to diagnose
-
Check the chef-client log for package-related errors:
$
less /var/log/chef-client.log
-
Look for package dependency conflicts in the cfn-init log:
$
less /var/log/cfn-init.log
How to resolve
-
If a specific package is causing issues, try reinstalling it:
$
sudo yum reinstall package-name
-
For dependency conflicts, you may need to remove conflicting packages:
$
sudo yum remove conflicting-package
-
If the issue persists, consider creating a custom AMI with your required packages pre-installed using the
pcluster build-image
command. For more information, see Amazon ParallelCluster AMI customization.
System configuration file modifications
Modifying critical system configuration files can cause cluster failures, especially if these files are managed by Amazon ParallelCluster.
How to diagnose
-
Check for errors in the chef-client log that mention specific configuration files:
$
grep -i "config" /var/log/chef-client.log
-
Look for permission or syntax errors in configuration files:
$
less /var/log/cfn-init.log
How to resolve
-
Restore modified configuration files to their original state:
$
sudo cp /etc/file.conf.bak /etc/file.conf
-
If you need to make persistent changes to system configuration files, use custom bootstrap actions instead of directly modifying files:
HeadNode: CustomActions: OnNodeConfigured: Script: s3://bucket-name/config-script.sh
For more information, see Custom bootstrap actions.
-
For configuration changes that must be made directly to system files, consider creating a custom AMI. For more information, see Amazon ParallelCluster AMI customization.
Kernel updates and compatibility issues
Kernel updates can cause compatibility issues with certain Amazon services, particularly with Amazon FSx for Lustre.
How to diagnose
-
Check if kernel updates have been applied:
$
uname -r
-
Look for Amazon FSx mount failures in the logs:
$
grep -i "fsx" /var/log/chef-client.log
How to resolve
-
For Ubuntu 22.04, avoid updating to the latest kernel as there is no Amazon FSx client for that kernel. For more information, see Operating system considerations.
-
If you've already updated the kernel and are experiencing issues, consider downgrading to a compatible kernel version:
$
sudo apt install linux-image-previous-version
-
For persistent kernel customizations, create a custom AMI with the specific kernel version you need. For more information, see Amazon ParallelCluster AMI customization.
Best practices for OS configuration changes
To minimize issues when making OS configuration changes:
-
Use Custom Bootstrap Actions: Instead of directly modifying system files, use
OnNodeStart
orOnNodeConfigured
scripts to make changes in a controlled manner. For more information, see Custom bootstrap actions. -
Create Custom AMIs: For significant OS modifications, create a custom AMI using
pcluster build-image
rather than making changes to running instances. For more information, see Amazon ParallelCluster AMI customization. -
Test Changes First: Before applying changes to a production cluster, test them on a small test cluster to ensure compatibility.
-
Document Changes: Keep track of all OS configuration changes made to facilitate troubleshooting.
-
Backup Configuration Files: Before modifying any system configuration file, create a backup:
$
sudo cp /etc/file.conf /etc/file.conf.bak
-
Check Logs After Changes: After making OS configuration changes, check the logs for any errors:
$
less /var/log/cfn-init.log
$
less /var/log/chef-client.log
By following these guidelines, you can minimize the risk of OS configuration changes causing cluster failures and more effectively troubleshoot any issues that do arise.