Amazon ParallelCluster processes - Amazon ParallelCluster
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Amazon ParallelCluster processes

This section applies only to HPC clusters that are deployed with one of the supported traditional job schedulers (SGE, Slurm, or Torque). When used with these schedulers, Amazon ParallelCluster manages compute node provisioning and removal by interacting with both the Auto Scaling group and the underlying job scheduler.

For HPC clusters that are based on Amazon Batch, Amazon ParallelCluster relies on the capabilities provided by the Amazon Batch for the compute node management.

Note

Starting with version 2.11.5, Amazon ParallelCluster doesn't support the use of SGE or Torque schedulers. You can continue using them in versions up to and including 2.11.4, but they aren't eligible for future updates or troubleshooting support from the Amazon service and Amazon Support teams.

SGE and Torque integration processes

Note

This section only applies to Amazon ParallelCluster versions up to and including version 2.11.4. Starting with version 2.11.5, Amazon ParallelCluster doesn't support the use of SGE and Torque schedulers, Amazon SNS, and Amazon SQS.

General overview

A cluster's lifecycle begins after it is created by a user. Typically, a cluster is created from the Command Line Interface (CLI). After it's created, a cluster exists until it's deleted. Amazon ParallelCluster daemons run on the cluster nodes, mainly to manage the HPC cluster elasticity. The following diagram shows a user workflow and the cluster lifecycle. The sections that follow describe the Amazon ParallelCluster daemons that are used to manage the cluster.

Cluster lifecycle

With SGE and Torque schedulers, Amazon ParallelCluster uses nodewatcher, jobwatcher, and sqswatcher processes.

jobwatcher

When a cluster is running, a process owned by the root user monitors the configured scheduler (SGE or Torque). Each minute it evaluates the queue in order to decide when to scale up.

jobwatcher workflow

sqswatcher

The sqswatcher process monitors for Amazon SQS messages that are sent by Auto Scaling to notify you of state changes within the cluster. When an instance comes online, it submits an "instance ready" message to Amazon SQS. This message is picked up by sqs_watcher, running on the head node. These messages are used to notify the queue manager when new instances come online or are terminated, so they can be added or removed from the queue.

sqswatcher workflow

nodewatcher

The nodewatcher process runs on each node in the compute fleet. After the scaledown_idletime period, as defined by the user, the instance is terminated.

nodewatcher workflow

Slurm integration processes

With Slurm schedulers, Amazon ParallelCluster uses clustermgtd and computemgt processes.

clustermgtd

Clusters that run in heterogeneous mode (indicated by specifying a queue_settings value) have a cluster management daemon (clustermgtd) process that runs on the head node. These tasks are performed by the cluster management daemon.

  • Inactive partition clean-up

  • Static capacity management: make sure static capacity is always up and healthy

  • Sync scheduler with Amazon EC2.

  • Orphaned instance clean-up

  • Restore scheduler node status on Amazon EC2 termination that happens outside of the suspend workflow

  • Unhealthy Amazon EC2 instances management (failing Amazon EC2 health checks)

  • Scheduled maintenance events management

  • Unhealthy Scheduler nodes management (failing Scheduler health checks)

computemgtd

Clusters that run in heterogeneous mode (indicated by specifying a queue_settings value) have compute management daemon (computemgtd) processes that run on each of the compute node. Every five (5) minutes, the compute management daemon confirms that the head node can be reached and is healthy. If five (5) minutes pass during which the head node cannot be reached or is not healthy, the compute node is shut down.