Cluster scaling for dynamic nodes - Amazon ParallelCluster
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Cluster scaling for dynamic nodes

ParallelCluster supports Slurm's methods to dynamically scale clusters by using Slurm's power saver plugin. For more information, see the Cloud Scheduling Guide and the Slurm Power Saving Guide in the Slurm documentation.

Starting with ParallelCluster version 3.8.0, ParallelCluster uses Job-level resume or job-level scaling as the default dynamic node allocation strategy to scale the cluster: ParallelCluster scales up the cluster based on the requirements of each job, the number of nodes allocated to the job, and which nodes need to be resumed. ParallelCluster gets this information from the SLURM_RESUME_FILE environment variable.

The scaling for dynamic nodes is a two steps process, which involves the launch of the EC2 instances and the assignment of the launched Amazon EC2 instances to the Slurm nodes. Each of these two steps can be done using an all-or-nothing or best-effort logic.

For launch of the Amazon EC2 instances:

  • all-or-nothing calls the launch Amazon EC2 API with minimum target equals to the total target capacity

  • best-effort calls the launch Amazon EC2 API with minimum target equals to 1 and the total target capacity equals to the requested capacity

For assignment of the Amazon EC2 instances to Slurm nodes:

  • all-or-nothing assigns Amazon EC2 instances to Slurm nodes only if it's possible to assign an Amazon EC2 instance to every requested node

  • best-effort assigns Amazon EC2 instances to Slurm nodes even if all the requested nodes are not covered by Amazon EC2 instance capacity

    The possible combinations of the above strategies translates into the ParallelCluster launch strategies.

The available ParallelCluster launch strategies that can be set into the ScalingStrategy cluster configuration to be used with job-level scaling are:

all-or-nothing scaling:

This strategy involves Amazon ParallelCluster initiating an Amazon EC2 launch instance API call for each job, that requires all instances necessary for the requested compute nodes to be successfully launched. This ensures that the cluster scales only when the required capacity per job is available, avoiding idle instances left at the end of the scaling process.

The strategy uses an all-or-nothing logic for the launch of the Amazon EC2 instances for each job plus and all-or-nothing logic for the assignment of the Amazon EC2 instances to Slurm nodes.

The strategy groups launch requests into batches, one for each compute resource requested and up to 500 nodes each. For requests spanning multiple compute resources or exceeding 500 nodes, ParallelCluster sequentially processes multiple batches.

The failure of any single resource's batch results in the termination of all associated unused capacity, ensuring that no idle instances will be left at the end of the scaling process.

Limitations

  • The time taken for scaling is directly proportional to the number of jobs submitted per execution of the Slurm resume program.

  • The scaling operation is limited by the RunInstances resource account limit, set at 1000 instances by default. This limitation is in accordance with Amazon's EC2 API throttling policies, for more details refer to Amazon EC2 API throttling documentation

  • When you submit a job in a compute resource with a single instance type, in a queue that spans multiple Availability Zones, the all-or-nothing EC2 launch API call only succeeds if all of the capacity can be provided in a single Availability Zone.

  • When you submit a job in a compute resource with multiple instance types, in a queue with a single Availability Zone, the all-or-nothing Amazon EC2 launch API call only succeeds if all of the capacity can be provided by a single instance type.

  • When you submit a job in a compute resource with multiple instance types, in a queue spanning multiple Availability Zones, the all-or-nothing Amazon EC2 launch API call isn't supported and ParallelCluster performs best-effort scaling instead.

greedy-all-or-nothing scaling:

This variant of the all-or-nothing strategy still ensures that the cluster scales only when the required capacity per job is available, avoiding idle instances at the end of the scaling process, but it involves ParallelCluster initiating an Amazon EC2 launch instance API call that aims for a minimum target capacity of 1, attempting to maximize the number of nodes launched up to the requested capacity. The strategy uses a best-effort logic for the launch of the EC2 instances for all the jobs plus the all-or-nothing logic for the assignment of the Amazon EC2 instances to Slurm nodes for each job.

The strategy groups launch requests into batches, one for each compute resource requested and up to 500 nodes each. For requests spanning multiple compute resources or exceeding 500 nodes, Parellelcluster sequentially processes multiple batches.

It ensure that no idle instances will be left at the end of the scaling process, by maximizing the throughput at the cost of temporary over-scaling during the scaling process.

Limitations

  • Temporary over-scaling is possible, leading to additional costs for instances that transition to a running state before scaling completion.

  • The same instance limit as in the all-or-nothing strategy applies, subject to Amazon's RunInstances resource account limit.

best-effort scaling:

This strategy calls Amazon EC2 launch instance API call by targeting a minimum capacity of 1 and aiming to achieve the total requested capacity at the cost of leaving idle instances after the scaling process execution if not all the requested capacity is available. The strategy uses a best-effort logic for the launch of the Amazon EC2 instances for all the jobs plus the best-effort logic for the assignment of the Amazon EC2 instances to Slurm nodes for each job.

The strategy groups launch requests into batches, one for each compute resource requested and up to 500 nodes each. For requests spanning multiple compute resources or exceeding 500 nodes, ParallelCluster sequentially processes multiple batches.

This strategy allows for scaling far beyond the default 1000 instances limit over multiple scaling process executions, at the cost of having idle instances across the different scaling processes.

Limitations

  • Possible idle running instances at the end of the scaling process, for the case when it’s not possible to allocate all the nodes requested by the jobs.

The following is an example that shows how the scaling of dynamic nodes behave using the different ParallelCluster launch strategies. Suppose you have submitted two jobs requesting 20 nodes each, for a total of 40 nodes of the same type, but there are only 30 Amazon EC2 instances available to cover the requested capacity on EC2.

all-or-nothing scaling:

  • For the first job, an all-or-nothing Amazon EC2 launch instance API is called, requesting 20 instances. A successful call has results in the launch of 20 instances

  • all-or-nothing assignment of the 20 launched instances to Slurm nodes for the first job is successful

  • Another all-or-nothing Amazon EC2 launch instance API is called, requesting 20 instances for the second job. The call is not successful, since there is only capacity for another 10 instances. No instances are launched at this time

greedy-all-or-nothing scaling:

  • A best-effort Amazon EC2 launch instance API is called, requesting 40 instances, which is the total capacity requested by all the jobs. This results in the launch of 30 instances

  • An all-or-nothing assignment of 20 of the launched instances to Slurm nodes for the first job is successful

  • Another all-or-nothing assignment of the remaining launched instances to Slurm nodes for the second job is tried, but since there are only 10 available instances out of the total 20 requested by the job, the assignment is not successful

  • The 10 unassigned launched instances are terminated

best-effort scaling:

  • A best-effort Amazon EC2 launch instance API is called, requesting 40 instances, which is the total capacity requested by all the jobs. This results in the launch of 30 instances.

  • A best-effort assignment of 20 of the launched instances to Slurm nodes for the first job is successful.

  • Another best-effort assignment of the remaining 10 launched instances to Slurm nodes for the second job is successful, even if the total requested capacity was 20. But since the job was requesting the 20 nodes, and it was possible to assign Amazon EC2 instances to only 10 of them, the job cannot start and the instances are left running idle, until enough capacity is found to start the missing 10 instances at a later call of the scaling process, or the scheduler schedules the job on other, already running, compute nodes.

ParallelCluster uses 2 types of dynamic node allocation strategies to scale the cluster:

  • Allocation based on available requested node information:
    • All-nodes resume or node-list scaling:

      ParallelCluster scales up the cluster based only on Slurm's requested node list names when Slurm's ResumeProgram runs. It allocates compute resources to nodes only by node name. The list of node names can span multiple jobs.

    • Job-level resume or job-level scaling:

      ParallelCluster scales up the cluster based on the requirements of each job, the current number of nodes that are allocated to the job, and which nodes need to be resumed. ParallelCluster gets this information from the SLURM_RESUME_FILE environment variable.

  • Allocation with an Amazon EC2 launch strategy:
    • Best-effort scaling:

      ParallelCluster scales up the cluster by using an Amazon EC2 launch instance API call with the minimum target capacity equal to 1, to launch some, but not necessarily all of instances needed to support the requested nodes.

    • All-or-nothing scaling:

      ParallelCluster scales up the cluster by using an Amazon EC2 launch instance API call that only succeeds if all of the instances needed to support the requested nodes are launched. In this case, it calls the Amazon EC2 launch instance API with the minimum target capacity equal to the total requested capacity.

By default, ParallelCluster uses node-list scaling with a best-effort Amazon EC2 launch strategy to launch some, but not necessarily all of instances needed to support the requested nodes. It tries to provision as much capacity as possible to serve the submitted workload.

Starting with ParallelCluster version 3.7.0, ParallelCluster uses job-level scaling with an all-or-nothing EC2 launch strategy for jobs submitted in exclusive mode. When you submit a job in exclusive mode, the job has exclusive access to its allocated nodes. For more information, see EXCLUSIVE in the Slurm documentation.

To submit a job in exclusive mode:

  • Pass the exclusive flag when submitting a Slurm job to the cluster. For example, sbatch ... --exclusive.

    OR

  • Submit a job to a cluster queue that has been configured with JobExclusiveAllocation set to true.

When submitting a job in exclusive mode:

  • ParallelCluster currently batches launch requests to include up to 500 nodes. If a job requests more than 500 nodes, ParallelCluster makes an all-or-nothing launch request for each set of 500 nodes and an additional launch request for the remainder of nodes.

  • If node allocation is in a single compute resource, ParallelCluster makes an all-or-nothing launch request for each set of 500 nodes and an additional launch request for the remainder of nodes. If a launch request fails, ParallelCluster terminates the unused capacity created by all of the launch requests.

  • If node allocation spans multiple compute resources, ParallelCluster needs to make an all-or-nothing launch request for each compute resource. These requests are also batched. If a launch request fails for one of the compute resources, ParallelCluster terminates the unused capacity created by all of the compute resource launch requests.

job-level scaling with all-or-nothing launch strategy known limitations:

  • When you submit a job in a compute resource with a single instance type, in a queue that spans multiple Availability Zones, the all-or-nothing EC2 launch API call only succeeds if all of the capacity can be provided in a single Availability Zone.

  • When you submit a job in a compute resource with multiple instance types, in a queue with a single Availability Zone, the all-or-nothing Amazon EC2 launch API call only succeeds if all of the capacity can be provided by a single instance type.

  • When you submit a job in a compute resource with multiple instance types, in a queue spanning multiple Availability Zones, the all-or-nothing Amazon EC2 launch API call isn't supported and ParallelCluster performs best-effort scaling instead.

Amazon ParallelCluster uses only one type of dynamic node allocation strategy to scale the cluster:

  • Allocation based on available requested node information:

    • All-nodes resume or node-list scaling: ParallelCluster scales up the cluster based only on Slurm's requested node list names when Slurm'sResumeProgram runs. It allocates compute resources to nodes only by node name. The list of node names can span multiple jobs.

  • Allocation with an Amazon EC2 launch strategy:

    • Best-effort scaling: ParallelCluster scales up the cluster by using an Amazon EC2 launch instance API call with the minimum target capacity equal to 1, to launch some, but not necessarily all of instances needed to support the requested nodes.

ParallelCluster usesnode-list scaling with a best-effort Amazon EC2 launch strategy to launch some, but not necessarily all of instances needed to support the requested nodes. It tries to provision as much capacity as possible to serve the submitted workload.

Limitations

  • Possible idle running instances at the end of the scaling process, for the case when it’s not possible to allocate all the nodes requested by the jobs.