SGE and Torque integration processes Slurm integration processes

Amazon ParallelCluster processes

This section applies only to HPC clusters that are deployed with one of the supported traditional job schedulers (SGE, Slurm, or Torque). When used with these schedulers, Amazon ParallelCluster manages compute node provisioning and removal by interacting with both the Auto Scaling group and the underlying job scheduler.

For HPC clusters that are based on Amazon Batch, Amazon ParallelCluster relies on the capabilities provided by the Amazon Batch for the compute node management.

Note

Starting with version 2.11.5, Amazon ParallelCluster doesn't support the use of SGE or Torque schedulers. You can continue using them in versions up to and including 2.11.4, but they aren't eligible for future updates or troubleshooting support from the Amazon service and Amazon Support teams.

`SGE and Torque integration processes`

Note

This section only applies to Amazon ParallelCluster versions up to and including version 2.11.4. Starting with version 2.11.5, Amazon ParallelCluster doesn't support the use of SGE and Torque schedulers, Amazon SNS, and Amazon SQS.

General overview

A cluster's lifecycle begins after it is created by a user. Typically, a cluster is created from the Command Line Interface (CLI). After it's created, a cluster exists until it's deleted. Amazon ParallelCluster daemons run on the cluster nodes, mainly to manage the HPC cluster elasticity. The following diagram shows a user workflow and the cluster lifecycle. The sections that follow describe the Amazon ParallelCluster daemons that are used to manage the cluster.

With SGE and Torque schedulers, Amazon ParallelCluster uses nodewatcher, jobwatcher, and sqswatcher processes.

`jobwatcher`

When a cluster is running, a process owned by the root user monitors the configured scheduler (SGE or Torque). Each minute it evaluates the queue in order to decide when to scale up.

`sqswatcher`

The sqswatcher process monitors for Amazon SQS messages that are sent by Auto Scaling to notify you of state changes within the cluster. When an instance comes online, it submits an "instance ready" message to Amazon SQS. This message is picked up by sqs_watcher, running on the head node. These messages are used to notify the queue manager when new instances come online or are terminated, so they can be added or removed from the queue.

`nodewatcher`

The nodewatcher process runs on each node in the compute fleet. After the scaledown_idletime period, as defined by the user, the instance is terminated.

`Slurm integration processes`

With Slurm schedulers, Amazon ParallelCluster uses clustermgtd and computemgt processes.

`clustermgtd`

Clusters that run in heterogeneous mode (indicated by specifying a queue_settings value) have a cluster management daemon (clustermgtd) process that runs on the head node. These tasks are performed by the cluster management daemon.

Inactive partition clean-up
Static capacity management: make sure static capacity is always up and healthy
Sync scheduler with Amazon EC2.
Orphaned instance clean-up
Restore scheduler node status on Amazon EC2 termination that happens outside of the suspend workflow
Unhealthy Amazon EC2 instances management (failing Amazon EC2 health checks)
Scheduled maintenance events management
Unhealthy Scheduler nodes management (failing Scheduler health checks)

`computemgtd`

Clusters that run in heterogeneous mode (indicated by specifying a queue_settings value) have compute management daemon (computemgtd) processes that run on each of the compute node. Every five (5) minutes, the compute management daemon confirms that the head node can be reached and is healthy. If five (5) minutes pass during which the head node cannot be reached or is not healthy, the compute node is shut down.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

How Amazon ParallelCluster works

Amazon services used by Amazon ParallelCluster