AMI patching and EC2 instance replacement
To ensure that all dynamically launched cluster compute nodes behave in a consistent manner, Amazon ParallelCluster disables cluster instance automatic OS updates. Additionally, a specific set of Amazon ParallelCluster AMIs are built for each version of Amazon ParallelCluster and its associated CLI. This specific set of AMIs remain unchanged and they are only supported by the Amazon ParallelCluster version they were built for. Amazon ParallelCluster AMIs for released versions aren't updated.
However, due to emergent security issues, customers might want to add patches to these AMIs and then update their clusters with the patched AMI. This aligns with the Amazon ParallelCluster Shared Responsibility Model.
To view the specific set of Amazon ParallelCluster AMIs supported by the Amazon ParallelCluster CLI version you are currently using, run:
$
pcluster version
$
pcluster list-official-images
The Amazon ParallelCluster head node is a static instance and you can manually update it. Restart and reboot of the head node is fully supported starting with Amazon ParallelCluster version 3.0.0.
If your instances have ephemeral instance stores, you must remember to save instance store data before manual updates. For more information, see the HeadNode / LocalStorage / EphemeralVolume cluster configuration and Instance types with instance store volumes in the Amazon EC2 User Guide for Linux Instances.
The compute nodes are ephemeral instances. By default you can only access them from the head node. Starting with Amazon ParallelCluster version 3.0.0, you can update the AMI associated with compute instances by modifying the Scheduling / SlurmQueues / Image / CustomAmi parameter and running the pcluster update-cluster command, after stopping the compute fleet with pcluster update-compute-fleet:
$
pcluster update-compute-fleet-status --status STOP_REQUESTED
It's possible to automate the creation of an updated custom AMI for the compute nodes by using one of the following methods:
-
Use the pcluster build-image command with an updated Build / ParentImage.
-
Run the build with Build / UpdateOsPackages / Enabled:
true
.
Head node instance update or replacement
In some circumstances, you might be required to restart or reboot the head node. For example, this is required when you manually update the OS, or when there's an Amazon instance scheduled retirement that imposes a head node instance restart.
If your instance doesn't have ephemeral drives, you can stop and start it again at any time. In the case of a scheduled retirement, starting the stopped instance migrates it to use the new hardware.
Similarly, you can manually stop and start an instance that doesn't have instance stores. For this case and for other cases of instances without ephemeral volumes, continue to Stop and start a cluster's head node.
If your instance has ephemeral drives and its been stopped, the data in the instance store is lost. You can determine if the instance type used for the head node has instance stores from the table found in Instance store volumes.
Save data from ephemeral drives
Starting with Amazon ParallelCluster version 3.0.0, the head node restart and reboot is fully supported for every instance type. However, if instances have an ephemeral drive, its data is lost. Follow the next steps to preserve your data before a head node restart or reboot.
To check if you have data that needs to be preserved, view the content in the EphemeralVolume /
MountDir folder (/scratch
by default).
You can transfer the data to the root volume or the shared storage systems attached to the cluster, such as Amazon FSx, Amazon EFS, or Amazon EBS. Note that the data transfer to remote storage can incur additional costs.
After saving the data, continue to Stop and start a cluster's head node.
Stop and start a cluster's head node
-
Verify there aren't any running jobs in the cluster.
When using a Slurm scheduler:
-
If the
sbatch
--no-requeue
option isn't specified, running jobs are requeued. -
If the
--no-requeue
option is specified, running jobs fail.
-
-
Request a cluster compute fleet stop:
$
pcluster update-compute-fleet --cluster-name
cluster-name
--status STOP_REQUESTED{ "status": "STOP_REQUESTED", ... }
-
Wait until the compute fleet status is
STOPPED
:$
pcluster update-compute-fleet --cluster-name
cluster-name
--status STOP_REQUESTED{ "status": "STOPPED", ... }
-
For manual updates with an OS reboot or instance restart, you can use the Amazon Web Services Management Console or Amazon CLI. The following is an example of using the Amazon CLI.
# Retrieve head node instance id
$
pcluster describe-cluster --cluster-name
cluster-name
--status STOP_REQUESTED{ "headNode": { "instanceId": "i-1234567890abcdef0", ... }, ... }
# stop and start the instance$
aws ec2 stop-instances --instance-ids
1234567890abcdef0
{ "StoppingInstances": [ { "CurrentState": { "Name": "stopping" ... }, "InstanceId": "i-1234567890abcdef0", "PreviousState": { "Name": "running" ... } } ] }
$
aws ec2 start-instances --instance-ids
1234567890abcdef0
{ "StartingInstances": [ { "CurrentState": { "Name": "pending" ... }, "InstanceId": "i-1234567890abcdef0", "PreviousState": { "Name": "stopped" ... } } ] }
-
Start the cluster compute fleet:
$
pcluster update-compute-fleet --cluster-name
cluster-name
--status START_REQUESTED{ "status": "START_REQUESTED", ... }