Slurm guide for multiple queue mode
Here you can learn how Amazon ParallelCluster and Slurm manage queue (partition) nodes and how you can monitor the queue and node states.
Overview
The scaling architecture is based on Slurm’s Cloud Scheduling
Guide
Cloud node lifecycle
Throughout their lifecycle, cloud nodes enter several if not all of the
following states: POWER_SAVING, POWER_UP
(pow_up), ALLOCATED (alloc), and
POWER_DOWN (pow_dn). In some cases, a cloud node
might enter the OFFLINE state. The following list details several
aspects of these states in the cloud node lifecycle.
-
A node in a
POWER_SAVINGstate appears with a~suffix (for exampleidle~) insinfo. In this state, no EC2 instances are backing the node. However, Slurm can still allocate jobs to the node. -
A node transitioning to a
POWER_UPstate appears with a#suffix (for exampleidle#) insinfo. A node automatically transitions to aPOWER_UPstate, when Slurm allocates a job to a node in aPOWER_SAVINGstate.Alternatively, you can transition the nodes to the
POWER_UPstate manually as ansuroot user with the command:$scontrol update nodename=nodenamestate=power_upIn this stage, the
ResumeProgramis invoked, EC2 instances are launched and configured, and the node transitions to thePOWER_UPstate. -
A node that is currently available for use appears without a suffix (for example
idle) insinfo. After the node is set up and has joined the cluster, it becomes available to run jobs. In this stage, the node is properly configured and ready for use.As a general rule, we recommend that the number of Amazon EC2 instances be the same as the number of available nodes. In most cases, static nodes are available after the cluster is created.
-
A node that is transitioning to a
POWER_DOWNstate appears with a%suffix (for exampleidle%) insinfo. Dynamic nodes automatically enter thePOWER_DOWNstate after ScaledownIdletime. In contrast, static nodes in most cases aren't powered down. However, you can place the nodes in thePOWER_DOWNstate manually as ansuroot user with the command:$scontrol update nodename=nodenamestate=down reason="manual draining"In this state, the instances associated with a node are terminated, and the node is set back to the
POWER_SAVINGstate and available for use after ScaledownIdletime.The ScaledownIdletime setting is saved to the Slurm configuration
SuspendTimeoutsetting. -
A node that is offline appears with a
*suffix (for exampledown*) insinfo. A node goes offline if the Slurm controller can't contact the node or if the static nodes are disabled and the backing instances are terminated.
Consider the node states shown in the following sinfo
example.
$sinfoPARTITION AVAIL TIMELIMIT NODES STATE NODELIST efa up infinite 4 idle~ efa-dy-efacompute1-[1-4] efa up infinite 1 idle efa-st-efacompute1-1 gpu up infinite 1 idle% gpu-dy-gpucompute1-1 gpu up infinite 9 idle~ gpu-dy-gpucompute1-[2-10] ondemand up infinite 2 mix# ondemand-dy-ondemandcompute1-[1-2] ondemand up infinite 18 idle~ ondemand-dy-ondemandcompute1-[3-10],ondemand-dy-ondemandcompute2-[1-10] spot* up infinite 13 idle~ spot-dy-spotcompute1-[1-10],spot-dy-spotcompute2-[1-3] spot* up infinite 2 idle spot-st-spotcompute2-[1-2]
The spot-st-spotcompute2-[1-2] and
efa-st-efacompute1-1 nodes already have backing instances set
up and are available for use. The
ondemand-dy-ondemandcompute1-[1-2] nodes are in the
POWER_UP state and should be available within a few minutes.
The gpu-dy-gpucompute1-1 node is in the POWER_DOWN
state, and it transitions into POWER_SAVING state after ScaledownIdletime (defaults to 10 minutes).
All of the other nodes are in POWER_SAVING state with no EC2
instances backing them.
Working with an available node
An available node is backed by an Amazon EC2 instance. By default, the node name can
be used to directly SSH into the instance (for example ssh
efa-st-efacompute1-1). The private IP address of the instance can be
retrieved using the command:
$scontrol show nodesnodename
Check for IP address in the returned NodeAddr field.
For nodes that aren't available, the NodeAddr field shouldn't
point to a running Amazon EC2 instance. Rather, it should be the same as the node
name.
Job states and submission
Jobs submitted in most cases are immediately allocated to nodes in the system, or placed in pending if all the nodes are allocated.
If nodes allocated for a job include any nodes in a POWER_SAVING
state, the job starts out with a CF, or CONFIGURING
state. At this time, the job waits for the nodes in the
POWER_SAVING state to transition to the POWER_UP
state and become available.
After all nodes allocated for a job are available, the job enters the
RUNNING (R) state.
By default, all jobs are submitted to the default queue (known as a partition
in Slurm). This is signified by a * suffix after the queue name.
You can select a queue using the -p job submission option.
All nodes are configured with the following features, which can be used in job submission commands:
-
An instance type (for example
c5.xlarge) -
A node type (This is either
dynamicorstatic.)
You can see the features for a particular node by using the command:
$scontrol show nodesnodename
In the return, check the AvailableFeatures list.
Consider the initial state of the cluster, which you can view by running the
sinfo command.
$sinfoPARTITION AVAIL TIMELIMIT NODES STATE NODELIST efa up infinite 4 idle~ efa-dy-efacompute1-[1-4] efa up infinite 1 idle efa-st-efacompute1-1 gpu up infinite 10 idle~ gpu-dy-gpucompute1-[1-10] ondemand up infinite 20 idle~ ondemand-dy-ondemandcompute1-[1-10],ondemand-dy-ondemandcompute2-[1-10] spot* up infinite 13 idle~ spot-dy-spotcompute1-[1-10],spot-dy-spotcompute2-[1-3] spot* up infinite 2 idle spot-st-spotcompute2-[1-2]
Note that spot is the default queue. It is indicated by the
* suffix.
Submit a job to one static node in the default queue
(spot).
$sbatch --wrap"sleep 300"-N1-Cstatic
Submit a job to one dynamic node in the EFA queue.
$sbatch --wrap"sleep 300"-pefa-Cdynamic
Submit a job to eight (8) c5.2xlarge nodes and two (2)
t2.xlarge nodes in the ondemand queue.
$sbatch --wrap"sleep 300"-pondemand-N10-C "[c5.2xlarge*8&t2.xlarge*2]"
Submit a job to one GPU node in the gpu queue.
$sbatch --wrap"sleep 300"-pgpu-G1
Consider the state of the jobs using the squeue command.
$squeueJOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 12 ondemand wrap ubuntu CF 0:36 10 ondemand-dy-ondemandcompute1-[1-8],ondemand-dy-ondemandcompute2-[1-2] 13 gpu wrap ubuntu CF 0:05 1 gpu-dy-gpucompute1-1 7 spot wrap ubuntu R 2:48 1 spot-st-spotcompute2-1 8 efa wrap ubuntu R 0:39 1 efa-dy-efacompute1-1
Jobs 7 and 8 (in the spot and efa queues) are
already running (R). Jobs 12 and 13 are still configuring
(CF), probably waiting for instances to become
available.
# Nodes states corresponds to state of running jobs$sinfoPARTITION AVAIL TIMELIMIT NODES STATE NODELIST efa up infinite 3 idle~ efa-dy-efacompute1-[2-4] efa up infinite 1 mix efa-dy-efacompute1-1 efa up infinite 1 idle efa-st-efacompute1-1 gpu up infinite 1 mix~ gpu-dy-gpucompute1-1 gpu up infinite 9 idle~ gpu-dy-gpucompute1-[2-10] ondemand up infinite 10 mix# ondemand-dy-ondemandcompute1-[1-8],ondemand-dy-ondemandcompute2-[1-2] ondemand up infinite 10 idle~ ondemand-dy-ondemandcompute1-[9-10],ondemand-dy-ondemandcompute2-[3-10] spot* up infinite 13 idle~ spot-dy-spotcompute1-[1-10],spot-dy-spotcompute2-[1-3] spot* up infinite 1 mix spot-st-spotcompute2-1 spot* up infinite 1 idle spot-st-spotcompute2-2
Node state and features
In most cases, node states are fully managed by Amazon ParallelCluster according to the specific processes in the cloud node lifecycle described earlier in this topic.
However, Amazon ParallelCluster also replaces or terminates unhealthy nodes in
DOWN and DRAINED states and nodes that have
unhealthy backing instances. For more information, see clustermgtd.
Partition states
Amazon ParallelCluster supports the following partition states. A Slurm partition is a queue in Amazon ParallelCluster.
-
UP: Indicates that the partition is in an active state. This is the default state of a partition. In this state, all nodes in the partition are active and available for use. -
INACTIVE: Indicates that the partition is in the inactive state. In this state, all instances backing nodes of an inactive partition are terminated. New instances aren't launched for nodes in an inactive partition.
pcluster update-compute-fleet
-
Stopping the compute fleet - When the following command is executed, all partitions transition to the
INACTIVEstate, and Amazon ParallelCluster processes keep the partitions in theINACTIVEstate.$pcluster update-compute-fleet --cluster-nametestSlurm\ --regioneu-west-1--status STOP_REQUESTED -
Starting the compute fleet - When the following command is executed, all partitions initially transition to the
UPstate. However, Amazon ParallelCluster processes don't keep the partition in anUPstate. You need to change partition states manually. All static nodes become available after a few minutes. Note that setting a partition toUPdoesn't power up any dynamic capacity.$pcluster update-compute-fleet --cluster-nametestSlurm\ --regioneu-west-1--status START_REQUESTED
When update-compute-fleet is run, you can check the state of the
cluster by running the pcluster describe-compute-fleet command and
checking the Status. The following lists possible states:
-
STOP_REQUESTED: The stop compute fleet request is sent to the cluster. -
STOPPING: Thepclusterprocess is currently stopping the compute fleet. -
STOPPED: Thepclusterprocess finished the stopping process, all partitions are inINACTIVEstate, and all compute instances are terminated. -
START_REQUESTED: The start compute fleet request is sent to the cluster. -
STARTING: Thepclusterprocess is currently starting the cluster. -
RUNNING: Thepclusterprocess finished the starting process, all partitions are in theUPstate, and static nodes are available after a few minutes. -
PROTECTED: This status indicates that some partitions have consistent bootstrap failures. Affected partitions are inactive. Please investigate the issue and then runupdate-compute-fleetto re-enable the fleet.
Manual control of queues
In some cases, you might want to have some manual control over the nodes or
queue (known as a partition in Slurm) in a cluster. You can manage nodes in a
cluster through the following common procedures using the scontrol
command.
-
Power up dynamic nodes in
POWER_SAVINGstateRun the command as an
suroot user:$scontrol update nodename=nodenamestate=power_upYou can also submit a placeholder
sleep 1job requesting a certain number of nodes and then rely on Slurm to power up the required number of nodes. -
Power down dynamic nodes before ScaledownIdletime
We recommend that you set dynamic nodes to
DOWNas ansuroot user with the command:$scontrol update nodename=nodenamestate=down reason="manually draining"Amazon ParallelCluster automatically terminates and resets the downed dynamic nodes.
In general, we don't recommend that you set nodes to
POWER_DOWNdirectly using thescontrol update nodename=command. This is because Amazon ParallelCluster automatically handles the power down process.nodenamestate=power_down -
Disable a queue (partition) or stop all static nodes in specific partition
Set a specific queue to
INACTIVEas ansuroot user with the command:$scontrol update partition=queuenamestate=inactiveDoing this terminates all instances backing nodes in the partition.
-
Enable a queue (partition)
Set a specific queue to
UPansuroot user with the command:$scontrol update partition=queuenamestate=up
Scaling behavior and adjustments
Here is an example of the normal scaling workflow:
-
The scheduler receives a job that requires two nodes.
-
The scheduler transitions two nodes to a
POWER_UPstate, and callsResumeProgramwith the node names (for examplequeue1-dy-spotcompute1-[1-2]). -
ResumeProgramlaunches two Amazon EC2 instances and assigns the private IP addresses and hostnames ofqueue1-dy-spotcompute1-[1-2], waiting forResumeTimeout(the default period is 30 minutes before resetting the nodes. -
Instances are configured and join the cluster. A job starts running on instances.
-
The job completes and stops running.
-
After the configured
SuspendTimehas elapsed (which is set to ScaledownIdletime), the scheduler sets the instances to thePOWER_SAVINGstate. The scheduler then setsqueue1-dy-spotcompute1-[1-2]to thePOWER_DOWNstate and callsSuspendProgramwith the node names. -
SuspendProgramis called for two nodes. Nodes remain in thePOWER_DOWNstate, for example, by remainingidle%for aSuspendTimeout(the default period is 120 seconds (2 minutes)). Afterclustermgtddetects that nodes are powering down, it terminates the backing instances. Then, it transitionsqueue1-dy-spotcompute1-[1-2]to the idle state and resets the private IP address and hostname so it is ready to power up for future jobs.
If things go wrong and an instance for a particular node can't be launched for some reason, then the following happens:
-
The scheduler receives a job that requires two nodes.
-
The scheduler transitions two cloud bursting nodes to the
POWER_UPstate and callsResumeProgramwith the nodenames, (for examplequeue1-dy-spotcompute1-[1-2]). -
ResumeProgramlaunches only one (1) Amazon EC2 instance and configuresqueue1-dy-spotcompute1-1, with one (1) instance,queue1-dy-spotcompute1-2, failing to launch. -
queue1-dy-spotcompute1-1isn't impacted and comes online after reaching thePOWER_UPstate. -
queue1-dy-spotcompute1-2transitions to thePOWER_DOWNstate, and the job is requeued automatically because Slurm detects a node failure. -
queue1-dy-spotcompute1-2becomes available afterSuspendTimeout(the default is 120 seconds (2 minutes)). In the meantime, the job is requeued and can start running on another node. -
The above process repeats until the job can run on an available node without a failure occurring.
There are two timing parameters that can be adjusted if needed:
-
ResumeTimeout(the default is 30 minutes):ResumeTimeoutcontrols the time Slurm waits before transitioning the node to the down state.-
It might be useful to extend
ResumeTimeoutif your pre/post installation process takes nearly that long. -
ResumeTimeoutis also the maximum time that Amazon ParallelCluster waits before replacing or resetting a node if there is an issue. Compute nodes self-terminate if any error occurs during launch or setup. Amazon ParallelCluster processes replace a node upon detection of a terminated instance.
-
-
SuspendTimeout(the default is 120 seconds (2 minutes)):SuspendTimeoutcontrols how quickly nodes get placed back into the system and are ready for use again.-
A shorter
SuspendTimeoutmeans that nodes are reset more quickly, and Slurm can try to launch instances more frequently. -
A longer
SuspendTimeoutmeans that failed nodes are reset more slowly. In the meantime, Slurm tries to use other nodes. IfSuspendTimeoutis more than a few minutes, Slurm tries to cycle through all nodes in the system. A longerSuspendTimeoutmight be beneficial for large-scale systems (over1,000 nodes) to reduce stress on Slurm when it tries to frequently re-queue failing jobs. -
Note that
SuspendTimeoutdoesn't refer to the time Amazon ParallelCluster waits to terminate a backing instance for a node. Backing instances forPOWER_DOWNnodes are immediately terminated. The terminate process usually is finished in a few minutes. However, during this time, the node remains in thePOWER_DOWNstate and isn't available for the scheduler's use.
-
Logs for the architecture
The following list contains the key logs. The log stream name used with
Amazon CloudWatch Logs has the format
,
where {hostname}.{instance_id}.{logIdentifier}logIdentifier follows the log names.
-
ResumeProgram:/var/log/parallelcluster/slurm_resume.log(slurm_resume) -
SuspendProgram:/var/log/parallelcluster/slurm_suspend.log(slurm_suspend) -
clustermgtd:/var/log/parallelcluster/clustermgtd.log(clustermgtd) -
computemgtd:/var/log/parallelcluster/computemgtd.log(computemgtd) -
slurmctld:/var/log/slurmctld.log(slurmctld) -
slurmd:/var/log/slurmd.log(slurmd)
Common issues and how to debug:
Nodes that failed to launch, power up, or join the cluster
-
Dynamic nodes:
-
Check the
ResumeProgramlog to see ifResumeProgramwas called with the node. If not, check theslurmctldlog to determine if Slurm tried to callResumeProgramwith the node. Note that incorrect permissions onResumeProgrammight cause it to fail silently. -
If
ResumeProgramis called, check to see if an instance was launched for the node. If the instance didn't launch, there should be clear error message as to why the instance failed to launch. -
If an instance was launched, there may have been some problem during the bootstrap process. Find the corresponding private IP address and instance ID from the
ResumeProgramlog and look at corresponding bootstrap logs for the specific instance in CloudWatch Logs.
-
-
Static nodes:
-
Check the
clustermgtdlog to see if instances were launched for the node. If instances didn't launch, there should be clear errors on why the instances failed to launch. -
If an instance was launched, there is some problem with the bootstrap process. Find the corresponding private IP and instance ID from the
clustermgtdlog and look at corresponding bootstrap logs for the specific instance in CloudWatch Logs.
-
Nodes replaced or terminated unexpectedly, and node failures
-
Nodes replaced/terminated unexpectedly:
-
In most cases,
clustermgtdhandles all node maintenance actions. To check ifclustermgtdreplaced or terminated a node, check theclustermgtdlog. -
If
clustermgtdreplaced or terminated the node, there should be a message indicating the reason for the action. If the reason is scheduler related (for example, the node wasDOWN), check in theslurmctldlog for more details. If the reason is Amazon EC2 related, use tools such as Amazon CloudWatch or the Amazon EC2 console, CLI, or SDKs, to check status or logs for that instance. For example, you can check if the instance had scheduled events or failed Amazon EC2 health status checks. -
If
clustermgtddidn't terminate the node, check ifcomputemgtdterminated the node or if EC2 terminated the instance to reclaim a Spot Instance.
-
-
Node failures:
-
In most cases, jobs are automatically requeued if a node failed. Look in the
slurmctldlog to see why a job or a node failed and assess the situation from there.
-
Failure when replacing or terminating instances, failure when powering down nodes
-
In general,
clustermgtdhandles all expected instance termination actions. Look in theclustermgtdlog to see why it failed to replace or terminate a node. -
For dynamic nodes failing ScaledownIdletime, look in the
SuspendProgramlog to see ifslurmctldprocesses made calls with the specific node as argument. NoteSuspendProgramdoesn't actually perform any specific action. Rather, it only logs when it’s called. All instance termination andNodeAddrresets are completed byclustermgtd. Slurm transitions nodes toIDLEafterSuspendTimeout.
Other issues:
-
Amazon ParallelCluster doesn't make job allocation or scaling decisions. It only tries to launch, terminate, and maintain resources according to Slurm’s instructions.
For issues regarding job allocations, node allocation and scaling decision, look at the
slurmctldlog for errors.