To create a GPU-based job on EKS resources - Amazon Batch
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

To create a GPU-based job on EKS resources

This section covers how to run an Amazon EKS GPU workload on Amazon Batch.

To create GPU-based Kubernetes cluster on Amazon EKS

Before you create a GPU-based Kubernetes cluster on Amazon EKS, you must have completed the steps in Getting started with Amazon Batch on Amazon EKS. In addition, also consider the following:

  • Amazon Batch supports instance types with NVIDIA GPUs.

  • By default, Amazon Batch selects the Amazon EKS accelerated AMI with the Kubernetes version that matches your Amazon EKS cluster control plane version.

$ cat <<EOF > ./batch-eks-gpu-ce.json { "computeEnvironmentName": "My-Eks-GPU-CE1", "type": "MANAGED", "state": "ENABLED", "eksConfiguration": { "eksClusterArn": "arn:aws-cn:eks:<region>:<account>:cluster/<cluster-name>", "kubernetesNamespace": "my-aws-batch-namespace" }, "computeResources": { "type": "EC2", "allocationStrategy": "BEST_FIT_PROGRESSIVE", "minvCpus": 0, "maxvCpus": 1024, "instanceTypes": [ "p3dn.24xlarge", "p4d.24xlarge" ], "subnets": [ "<eks-cluster-subnets-with-access-to-internet-for-image-pull>" ], "securityGroupIds": [ "<eks-cluster-sg>" ], "instanceRole": "<eks-instance-profile>" } } EOF $ aws batch create-compute-environment --cli-input-json file://./batch-eks-gpu-ce.json

Amazon Batch doesn’t manage the NVIDIA GPU device plugin on your behalf. You must install this plugin into your Amazon EKS cluster and allow it to target the Amazon Batch nodes. For more information, see Enabling GPU Support in Kubernetes on GitHub.

To configure the NVIDIA device plugin (DaemonSet) to target the Amazon Batch nodes, run the following commands.

# pull nvidia daemonset spec $ curl -O https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.2/nvidia-device-plugin.yml # using your favorite editor, add Batch node toleration # this will allow the DaemonSet to run on Batch nodes - key: "batch.amazonaws.com/batch-node" operator: "Exists" $ kubectl apply -f nvidia-device-plugin.yml

We do not recommend that you mix compute-based (CPU and memory) workloads with GPU-based workloads in the same pairings of compute environment and job queue. This is because compute jobs can use up GPU capacity.

To attach job queues, run the following commands.

$ cat <<EOF > ./batch-eks-gpu-jq.json { "jobQueueName": "My-Eks-GPU-JQ1", "priority": 10, "computeEnvironmentOrder": [ { "order": 1, "computeEnvironment": "My-Eks-GPU-CE1" } ] } EOF $ aws batch create-job-queue --cli-input-json file://./batch-eks-gpu-jq.json

To create an Amazon EKS GPU job definition

Only nvidia.com/gpu is supported at this time and resource value that you set must be a whole number. You can’t use fractions of GPU. For more information, see Schedule GPUs in the Kubernetes documentation.

To register a GPU job definition for Amazon EKS, run the following commands.

$ cat <<EOF > ./batch-eks-gpu-jd.json { "jobDefinitionName": "MyGPUJobOnEks_Smi", "type": "container", "eksProperties": { "podProperties": { "hostNetwork": true, "containers": [ { "image": "nvcr.io/nvidia/cuda:10.2-runtime-centos7", "command": ["nvidia-smi"], "resources": { "limits": { "cpu": "1", "memory": "1024Mi", "nvidia.com/gpu": "1" } } } ] } } } EOF $ aws batch register-job-definition --cli-input-json file://./batch-eks-gpu-jd.json

To run a GPU job in your Amazon EKS cluster

The GPU resource is non-compressible. Amazon Batch creates a pod spec for GPU jobs where the value of request equals the value of limits. This is a Kubernetes requirement.

To submit a GPU job, run the following commands.

$ aws batch submit-job --job-queue My-Eks-GPU-JQ1 --job-definition MyGPUJobOnEks_Smi --job-name My-Eks-GPU-Job # locate information that can help debug or find logs (if using Amazon CloudWatch Logs with Fluent Bit) $ aws batch describe-jobs --job <job-id> | jq '.jobs[].eksProperties.podProperties | {podName, nodeName}' { "podName": "aws-batch.f3d697c4-3bb5-3955-aa6c-977fcf1cb0ca", "nodeName": "ip-192-168-59-101.ec2.internal" }