Multi-node parallel jobs on Amazon EKS - Amazon Batch
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Multi-node parallel jobs on Amazon EKS

You can use Amazon Batch on Amazon Elastic Kubernetes Service to run multi-node parallel (MNP) jobs on your managed Kubernetes clusters. This option is commonly used for large, tightly-coupled, high-performance jobs that can’t be run on a single Amazon Elastic Compute Cloud instance. For more information, see Multi-node parallel jobs.

You can use this feature to run Amazon EKS managed Kubernetes-specific high-performance computing applications, large language model training, and other Artificial Intelligence (AI)/Machine Learning (ML) jobs.

Running MNP jobs

Amazon Batch supports MNP jobs on Amazon Elastic Container Service and Amazon EKS using Amazon EC2. The following provides more specifics about the instance and container parameters for the feature.

Instance quotas for MNP on Amazon EKS

  • Up to 1000 instances can be used for a single MNP job.

  • Up to 5000 instances can join a single Amazon EKS cluster.

  • Up to 5 compute environments can be clustered and attached to a job-queue.

For example, you can scale up to 5 clustered compute environments in a job queue and 1000 instances in each compute environment.

In addition to the instance parameters, it’s important to note that you can’t use Fargate for MNP jobs through either service.

You can use only one instance type in each MNP job. You can change the instance type by updating the compute environment, or when you define a new compute environment. You can also specify the instance type, and provide vCPU and memory requirements when creating the job-definition.

Container quotas for MNP on Amazon EKS

  • A multi-node parallel job supports one pod per node.

  • Up to 10 containers (or 10 init containers. For more information see Init Containers in the Kubernetes documentation.) in each pod.

  • Up to 5 node ranges in each MNP job.

  • Up 10 distinct container images in each node range.

For example, you can run up to a maximum of 10,000 containers in a single MNP job that contains 5 node ranges and a total of 50 unique images.

Running MNP jobs in a private Amazon VPC and an Amazon EKS cluster

MNP jobs can run on any Amazon EKS cluster whether it has public Internet or not. When using an Amazon EKS cluster with only private network access be sure that Amazon Batch can access the Amazon EKS control plane and the managed Kubernetes API server. You can grant the necessary access through Amazon Virtual Private Cloud endpoints. For more information, see Configure an endpoint service.

Amazon EKS cluster Pods can’t download an image from a public source since the private VPC doesn’t have Internet access. Your Amazon EKS cluster must pull images from a container registry that's within your Amazon VPC. You can create an (Amazon ECR) in your Amazon VPC and copy container images to it for your nodes access.

You can also create a pull through cache rule with Amazon ECR. Once a pull through cache rule is created for an external public registry, you can simply pull an image from that external public registry using your Amazon ECR private registry URI. Then Amazon ECR creates a repository and caches the image. When a cached image is pulled using the Amazon ECR private registry URI, Amazon ECR checks the remote registry to see if there is a new version of the image and will update your private registry up to one time every 24 hours. For more information, see Creating a pull through cache rule in Amazon ECR.

For more information about this topic, see Getting started with Amazon Batch on Amazon EKS Private Clusters.

Error notification

If your MNP jobs are blocked, you can receive notifications through the Amazon Web Services Management Console and Amazon EventBridge. For example, if an MNP job is stuck at the head of the queue, you can be notified about the issue along with information about what caused it so that you can take prompt action to unblock your job queue. Optionally, you can auto-terminate the MNP job if no action is taken within a distinct amount of time, which can be defined in the job-queue template. For more information, see Job queue blocked events

Create an Amazon EKS MNP job definition

To define and run MNP jobs on Amazon EKS, there are new parameters within the RegisterJobDefinition and SubmitJob API operations.

These actions can be defined through API operations and the Amazon Web Services Management Console.

Register the Amazon EKS MNP job definition request payload

The following example illustrates how you can register an Amazon EKS MNP job definition with two nodes.

{ "jobDefinitionName": "MyEksMnpJobDefinition", "type": "multinode", "nodeProperties": { "numNodes": 2, "mainNode": 0, "nodeRangeProperties": [ { "targetNodes" : "0:", "eksProperties": { "podProperties": { "containers": [ { "name": "test-eks-container-1", "image": "public.ecr.aws/amazonlinux/amazonlinux:2", "command": [ "sleep", "60" ], "resources": { "limits": { "cpu": "1", "memory": "1024Mi" } }, "securityContext":{ "runAsUser":1000, "runAsGroup":3000, "privileged":true, "readOnlyRootFilesystem":true, "runAsNonRoot":true } } ], "initContainers": [ { "name":"init-ekscontainer", "image": "public.ecr.aws/amazonlinux/amazonlinux:2", "command": [ "echo", "helloWorld" ], "resources": { "limits": { "cpu": "1", "memory": "1024Mi" } } } ], "metadata": { "labels": { "environment" : "test" } } } } } ] } }

To register the job definition using the Amazon CLI, copy the definition to a local file named MyEksMnpJobDefinition.json and run the following command.

aws batch register-job-definition --cli-input-json file://MyEksMnpJobDefinition.json

You will receive the following JSON response.

{ "jobDefinitionName": "MyEksMnpJobDefinition", "jobDefinitionArn": "arn:aws:batch:us-east-1:0123456789:job-definition/MyEksMnpJobDefinition:1", "revision": 1 }

Submit the Amazon EKS MNP job

To submit a job using the registered job definition, enter the following command. Replace the value of <EKS_JOB_QUEUE_NAME> with the name or ARN of a pre-existing job queue associated with an Amazon EKS compute environment.

aws batch submit-job --job-queue <EKS_JOB_QUEUE_NAME> \ --job-definition MyEksMnpJobDefinition \ --job-name myFirstEksMnpJob

You will receive the following JSON response.

{ "jobArn": "arn:aws-cn:batch:region:account:job/9b979cce-9da0-446d-90e2-ffa16d52af68", "jobName": "myFirstEksMnpJob", "jobId": "<JOB_ID>" }

You can check the status of the job using the returned jobId with the following command.

aws batch describe-jobs --jobs <JOB_ID>

Override an Amazon EKS MNP job definition

Optionally, you can override the job definition details (such as changing the MNP job size or child job details). The following provides an example JSON request payload to submit a five node MNP job, and changes to the test-eks-container-1 container’s command.

{ "numNodes": 5, "nodePropertyOverrides": [ { "targetNodes": "0:", "eksPropertiesOverride": { "podProperties": { "containers": [ { "name": "test-eks-container-1", "command": [ "sleep", "150" ] } ] } } } ] }

Submit the Amazon EKS MNP job

The submit a job with these overrides, save the example to a local file, eks-mnp-job-nodeoverride.json, and use the Amazon CLI to submit the job with the overrides.

aws batch submit-job --job-queue <EKS_JOB_QUEUE_NAME> \ --job-definition MyEksMnpJobDefinition \ --node-overrides file://./eks-mnp-job-nodeoverride.json \ --job-name fiveLongSleeps