Slurm cluster fast insufficient capacity fail-over - Amazon ParallelCluster
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Slurm cluster fast insufficient capacity fail-over

Starting with Amazon ParallelCluster version 3.2.0, clusters run with the fast insufficient capacity fail-over mode enabled by default. This minimizes the time spent retrying to queue a job when Amazon EC2 insufficient capacity errors are detected. This is particularly effective when you configure your cluster with multiple kinds of instance types.

Amazon EC2 detected insufficient capacity failures:
  • InsufficientInstanceCapacity

  • InsufficientHostCapacity

  • InsufficientReservedInstanceCapacity

  • MaxSpotInstanceCountExceeded

  • SpotMaxPriceTooLow: Activated if the Spot request price is lower than the minimum required Spot request fulfillment price.

  • Unsupported: Activated with the use of an instance type that isn't supported in a specific Amazon Web Services Region.

In fast insufficient capacity failure-over mode, if an insufficient capacity error is detected when a job is assigned to a SlurmQueues / compute resource, Amazon ParallelCluster does the following:

  1. It sets the compute resource to a disabled (DOWN) state for a predefined period of time.

  2. It uses POWER_DOWN_FORCE to cancel the compute resource failing node jobs and to suspend the failing node. It sets the failing node to the IDLE and POWER_DOWN (!) state, and then to POWERING_DOWN (%).

  3. It requeues the job to another compute resource.

The static and powered up nodes of the disabled compute resource aren't impacted. Jobs can complete on these nodes.

This cycle repeats until the job is successfully assigned to a compute resource node or nodes. For information about node states, see the Slurm guide for multiple queue mode.

If no compute resources are found to run the job, the job is set to the PENDING state until the predefined period of time elapses. In this case, you can modify the predefined period of time as described in the following section.

Insufficient capacity timeout parameter

insufficient_capacity_timeout

insufficient_capacity_timeout specifies the period of time (in seconds) that the compute resource is kept in the disabled (down) state when an insufficient capacity error is detected.

By default, insufficient_capacity_timeout is enabled.

The default insufficient_capacity_timeout is 600 seconds (10 minutes).

If the insufficient_capacity_timeout value is less than or equal to zero, fast insufficient capacity failure-over mode is disabled.

You can change the insufficient_capacity_timeout value by adding the parameter in the clustermgtd config file located at /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf in the HeadNode.

The parameter can be updated at any time without stopping the compute fleet.

For example:

  • insufficient_capacity_timeout=600:

    If an insufficient capacity error is detected, the compute resource is set to a disabled (DOWN). After 10 minutes, its failed node is set to the idle~ (POWER_SAVING) state.

  • insufficient_capacity_timeout=60:

    If an insufficient capacity error is detected, the compute resource is in a disabled (DOWN). After 1 minute, its failed node is set to the idle~ state.

  • insufficient_capacity_timeout=0:

    Fast insufficient capacity failure-over mode is disabled. The compute resource isn't disabled.

Note

There might be a delay of up to one minute between the time when nodes fail with insufficient capacity errors and the time when the cluster management daemon detects the node failures. This is because the cluster management daemon checks for node insufficient capacity failures and sets the compute resources to the down state at one-minute intervals.

Fast insufficient capacity fail-over mode status

When a cluster is in fast insufficient capacity fail-over mode, you can check its status and node states.

Node states

When a job is submitted to a compute resource dynamic node and an insufficient capacity error is detected, the node is placed in the down# state with reason.

(Code:InsufficientInstanceCapacity)Failure when resuming nodes.

Then powered off nodes (nodes in idle~ state) are set to down~ with reason.

(Code:InsufficientInstanceCapacity)Temporarily disabling node due to insufficient capacity.

The job is requeued to other compute resources in the queue.

The compute resource static nodes and nodes that are UP aren't impacted by fast insufficient capacity fail-over mode.

Consider the node states shown in the following example.

$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST queue1* up infinite 30 idle~ queue1-dy-c-1-[1-15],queue1-dy-c-2-[1-15] queue2 up infinite 30 idle~ queue2-dy-c-1-[1-15],queue2-dy-c-2-[1-15]

We submit a job to queue1 that requires one node.

$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST queue1* up infinite 1 down# queue1-dy-c-1-1 queue1* up infinite 15 idle~ queue1-dy-c-2-[1-15] queue1* up infinite 14 down~ queue1-dy-c-1-[2-15] queue2 up infinite 30 idle~ queue2-dy-c-1-[1-15],queue2-dy-c-2-[1-15]

Node queue1-dy-c-1-1 is launched to run the job. However, the instance failed to launch due to an insufficient capacity error. Node queue1-dy-c-1-1 is set to down. The powered off dynamic node within the compute resource (queue2-dy-c-1) is set to down.

You can check the node reason with scontrol show nodes.

$ scontrol show nodes queue1-dy-c-1-1 NodeName=broken-dy-c-2-1 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUTot=96 CPULoad=0.00 ... ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=(Code:InsufficientInstanceCapacity)Failure when resuming nodes [root@2022-03-10T22:17:50] $ scontrol show nodes queue1-dy-c-1-2 NodeName=broken-dy-c-2-1 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUTot=96 CPULoad=0.00 ... ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=(Code:InsufficientInstanceCapacity)Temporarily disabling node due to insufficient capacity [root@2022-03-10T22:17:50]

The job is queued to another instance type within the queue compute resources.

After the insufficient_capacity_timeout elapses, nodes in the compute resource are reset to the idle~ state.

$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST queue1* up infinite 30 idle~ queue1-dy-c-1-[1-15],queue1-dy-c-2-[1-15] queue2 up infinite 30 idle~ queue2-dy-c-1-[1-15],queue2-dy-c-2-[1-15]

After the insufficient_capacity_timeout elapses and nodes in the compute resource are reset to the idle~ state, the Slurm scheduler gives the nodes lower priority. The scheduler keeps selecting nodes from other queue compute resources with higher weights unless one of the following occurs:

  • A job's submission requirements match the recovered compute resource.

  • No other compute resources are available because they are at capacity.

  • slurmctld is restarted.

  • The Amazon ParallelCluster compute fleet is stopped and started to power down and power up all nodes.

Related logs

Logs related to insufficient capacity errors and fast insufficient capacity fail-over mode can be found in Slurm's resume log and clustermgtd log in the head node.

Slurm resume (/var/log/parallelcluster/slurm_resume.log)

Error messages when a node fails to launch because of insufficient capacity.

[slurm_plugin.instance_manager:_launch_ec2_instances] - ERROR - Failed RunInstances request: dcd0c252-90d4-44a7-9c79-ef740f7ecd87 [slurm_plugin.instance_manager:add_instances_for_nodes] - ERROR - Encountered exception when launching instances for nodes (x1) ['queue1-dy-c-1-1']: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 1): We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-west-2b). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2a, us-west-2c.
Slurm clustermgtd (/var/log/parallelcluster/clustermgtd)

Compute resource c-1 in queue1 is disabled because of insufficient capacity.

[slurm_plugin.clustermgtd:_reset_timeout_expired_compute_resources] - INFO - The following compute resources are in down state due to insufficient capacity: {'queue1': {'c-1': ComputeResourceFailureEvent(timestamp=datetime.datetime(2022, 4, 14, 23, 0, 4, 769380, tzinfo=datetime.timezone.utc), error_code='InsufficientInstanceCapacity')}}, compute resources are reset after insufficient capacity timeout (600 seconds) expired

After the insufficient capacity timeout expires, the compute resource is reset, nodes within the compute resources are set to idle~.

[root:_reset_insufficient_capacity_timeout_expired_nodes] - INFO - Reset the following compute resources because insufficient capacity timeout expired: {'queue1': ['c-1']}