Working with Spot Instances - Amazon ParallelCluster
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Working with Spot Instances

Amazon ParallelCluster uses Spot Instances if the cluster configuration has set cluster_type = spot. Spot Instances are more cost effective than On-Demand Instances, but they might be interrupted. The effect of the interruption varies depending on the specific scheduler used. It might help to take advantage of Spot Instance interruption notices, which provide a two-minute warning before Amazon EC2 must stop or terminate your Spot Instance. For more information, see Spot Instance interruptions in Amazon EC2 User Guide. The following sections describe three scenarios in which Spot Instances can be interrupted.

Note

Using Spot Instances requires that the AWSServiceRoleForEC2Spot service-linked role exist in your account. To create this role in your account using the Amazon CLI, run the following command:

aws iam create-service-linked-role --aws-service-name spot.amazonaws.com

For more information, see Service-linked role for Spot Instance requests in the Amazon EC2 User Guide.

Scenario 1: Spot Instance with no running jobs is interrupted

When this interruption occurs, Amazon ParallelCluster tries to replace the instance if the scheduler queue has pending jobs that require additional instances, or if the number of active instances is lower than the initial_queue_size setting. If Amazon ParallelCluster can't provision new instances, then a request for new instances is periodically repeated.

Scenario 2: Spot Instance running single node jobs is interrupted

The behavior of this interruption depends on the scheduler being used.

Slurm

The job fails with a state code of NODE_FAIL, and the job is requeued (unless --no-requeue is specified when the job is submitted). If the node is a static node, it's replaced. If the node is a dynamic node, the node is terminated and reset. For more information about sbatch, including the --no-requeue parameter, see sbatch in the Slurm documentation.

Note

This behavior changed in Amazon ParallelCluster version 2.9.0. Earlier versions terminated the job with a state code of NODE_FAIL and the node was removed from the scheduler queue.

SGE
Note

This only applies to Amazon ParallelCluster versions up to and including version 2.11.4. Starting with version 2.11.5, Amazon ParallelCluster doesn't support the use of SGE or Torque schedulers.

The job is terminated. If the job has enabled the rerun flag (using either qsub -r yes or qalter -r yes) or the queue has the rerun configuration set to TRUE, then the job is rescheduled. The compute instance is removed from the scheduler queue. This behavior comes from these SGE configuration parameters:

  • reschedule_unknown 00:00:30

  • ENABLE_FORCED_QDEL_IF_UNKNOWN

  • ENABLE_RESCHEDULE_KILL=1

Torque
Note

This only applies to Amazon ParallelCluster versions up to and including version 2.11.4. Starting with version 2.11.5, Amazon ParallelCluster doesn't support the use of SGE or Torque schedulers.

The job is removed from the system and the node is removed from the scheduler. The job isn't rerun. If multiple jobs are running on the instance when it is interrupted, Torque might time out during node removal. An error might display in the sqswatcher log file. This doesn't affect scaling logic, and a proper cleanup is performed by subsequent retries.

Scenario 3: Spot Instance running multi-node jobs is interrupted

The behavior of this interruption depends on the scheduler being used.

Slurm

The job fails with a state code of NODE_FAIL, and the job is requeued (unless --no-requeue was specified when the job was submitted). If the node is a static node, it's replaced. If the node is a dynamic node, the node is terminated and reset. Other nodes that were running the terminated jobs might be allocated to other pending jobs, or scaled down after the configured scaledown_idletime time has passed.

Note

This behavior changed in Amazon ParallelCluster version 2.9.0. Earlier versions terminated the job with a state code of NODE_FAIL and the node was removed from the scheduler queue. Other nodes that were running the terminated jobs might be scaled down after the configured scaledown_idletime time has passed.

SGE
Note

This only applies to Amazon ParallelCluster versions up to and including version 2.11.4. Starting with version 2.11.5, Amazon ParallelCluster doesn't support the use of SGE or Torque schedulers.

The job isn't terminated and continues to run on the remaining nodes. The compute node is removed from the scheduler queue, but will appear in the hosts list as an orphaned and unavailable node.

The user must delete the job when this occurs (qdel <jobid>). The node still displays in the hosts list (qhost), although this doesn't affect Amazon ParallelCluster. To remove the host from the list, run the following command after replacing the instance.

sudo -- bash -c 'source /etc/profile.d/sge.sh; qconf -dattr hostgroup hostlist <hostname> @allhosts; qconf -de <hostname>'
Torque
Note

This only applies to Amazon ParallelCluster versions up to and including version 2.11.4. Starting with version 2.11.5, Amazon ParallelCluster doesn't support the use of SGE or Torque schedulers.

The job is removed from the system and the node is removed from the scheduler. The job isn't rerun. If multiple jobs are running on the instance when it is interrupted, Torque might time out during node removal. An error might display in the sqswatcher log file. This doesn't affect scaling logic, and a proper cleanup is performed by subsequent retries.

For more information about Spot Instances, see Spot Instances in the Amazon EC2 User Guide.