Job lifecycle for MNP jobs - Amazon Batch
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Job lifecycle for MNP jobs

When you submit a multi-node parallel job, the job enters the SUBMITTED status. Then, the job waits for any job dependencies to finish. The job also moves to the RUNNABLE status. Last, Amazon Batch provisions the instance capacity that's required to run your job and launches these instances.

Each multi-node parallel job contains a main node. The main node is a single subtask that Amazon Batch monitors to determine the outcome of the submitted multi node job. The main node is launched first and it moves to the STARTING status. The timeout value specified in the attemptDurationSeconds parameter applies to the whole job and not to the nodes.

When the main node reaches the RUNNING status after the node's container is running, the child nodes are launched and they also move to the STARTING status. The child nodes come up in random order. There are no guarantees on the timing or ordering of child node launch. To ensure that the all the nodes of the jobs are in the RUNNING status after the node's container is running, your application code can query the Amazon Batch API to get the main node and child node information. Alternatively, the application code can wait until all nodes are online before starting any distributed processing task. The private IP address of the main node is available as the AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS environment variable in each child node. Your application code may use this information to coordinate and communicate data between each task.

As individual nodes exit, they move to SUCCEEDED or FAILED, depending on their exit code. If the main node exits, the job is considered finished, and all of the child nodes are stopped. If a child node dies, Amazon Batch doesn't take any action on the other nodes in the job. If you don't want your job to continue with a reduced number of nodes, you must factor this into your application code. Doing this terminates or cancels the job.