Getting started with Amazon Batch on Amazon EKS Private Clusters - Amazon Batch
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Getting started with Amazon Batch on Amazon EKS Private Clusters

Amazon Batch is a managed service that orchestrates batch workloads in your Amazon Elastic Kubernetes Service (Amazon EKS) clusters. This includes queuing, dependency tracking, managed job retries and priorities, pod management, and node scaling. This feature connects your existing private Amazon EKS cluster with Amazon Batch to run your jobs at scale. You can use eksctl (a command line interface for Amazon EKS), the Amazon console, or the Amazon Command Line Interface to create a private Amazon EKS cluster with all the other necessary resources. Support for private Amazon EKS clusters on Amazon Batch is generally available in commercial Amazon Web Services Regions where Amazon Batch is available.

Amazon EKS private only clusters have no inbound/outbound internet access, and only have private subnets. Amazon VPC endpoints are used to enable private access to other Amazon services. eksctl supports creating fully-private clusters using a pre-existing Amazon VPC and subnets. eksctl also creates Amazon VPC endpoints in the supplied Amazon VPC and modifies route tables for the supplied subnets.

Each subnet should have an explicit route table associated with it because eksctl does not modify the main route table. Your cluster must pull images from a container registry that's in your Amazon VPC. As well, you can create an Amazon Elastic Container Registry in your Amazon VPC and copy container images to it for your nodes to pull from. For more information, see Copy a container image from one repository to another repository. To get started with Amazon ECR private repositories, see Amazon ECR private repositories.

You can optionally create a pull through cache rule with Amazon ECR. Once a pull through cache rule is created for an external public registry, you can pull an image from that external public registry using your Amazon ECR private registry uriform resource idetifier (URI). Then Amazon ECR creates a repository and caches the image. When a cached image is pulled using the Amazon ECR private registry URI, Amazon ECR checks the remote registry to see if there is a new version of the image and updates your private registry up to once every 24 hours.

Prerequisites

Before starting this tutorial, you must install and configure the following tools and resources that you need to create and manage both Amazon Batch and Amazon EKS resources. You also need to create all the necessary resources including VPC, subnets, route-tables, VPC endpoints, and Amazon EKS cluster. You need to use the Amazon CLI.

  • Amazon CLI – A command line tool to work with Amazon services, including Amazon EKS. This guide requires that you use version 2.8.6 or later or 1.26.0 or later. For more information, see Installing, updating, and uninstalling the Amazon CLI in the Amazon Command Line Interface User Guide.

    After installing the Amazon CLI, we recommend that you configure it. For more information, see Quick configuration with aws configure in the Amazon Command Line Interface User Guide.

  • kubectl – A command line tool to work with Kubernetes clusters. This guide requires that you use version 1.23 or later. For more information, see Installing or updating kubectl in the Amazon EKS User Guide.

  • eksctl – A command line tool to work with Amazon EKS clusters that automates many individual tasks. This guide requires that you use version 0.115.0 or later. For more information, see Installing or updating eksctl in the Amazon EKS User Guide.

  • Required Amazon Identity and Access Management (IAM) permissions – The IAM security principal that you're using must have permissions to work with Amazon EKS IAM roles and service linked roles, Amazon CloudFormation, and a VPC and related resources. For more information, see Actions, resources, and condition keys for Amazon Elastic Kubernetes Service and Using service-linked roles in the IAM User Guide. You must complete all steps in this guide as the same user.

  • Creating an Amazon EKS cluster – For more information, see Getting started with Amazon EKS – eksctl in the Amazon EKS User Guide.

    Note

    Amazon Batch doesn't provide managed-node orchestration for CoreDNS or other deployment pods. If you need CoreDNS, see Adding the CoreDNS Amazon EKS add-on in the Amazon EKS User Guide. Or, use eksctl create cluster create to create the cluster, it includes CoreDNS by default.

  • Permissions – Users calling the CreateComputeEnvironment API operation to create a compute environment that uses Amazon EKS resources require permissions to the eks:DescribeCluster API operation. Using the Amazon Web Services Management Console to create a compute resource using Amazon EKS resources requires permissions to both eks:DescribeCluster and eks:ListClusters.

  • Create a private EKS cluster in the us-east-1 region using the sample eksctl config file.

    kind: ClusterConfig apiVersion: eksctl.io/v1alpha5 availabilityZones: - us-east-1a - us-east-1b - us-east-1d managedNodeGroups: privateNetworking: true privateCluster: enabled: true skipEndpointCreation: false

    Create your resources using the command: eksctl create cluster -f clusterConfig.yaml

  • Batch managed nodes must be deployed to subnets that have the VPC interface endpoints that you require. For more information, see Private cluster requirements.

Step 1: Preparing your EKS cluster for Amazon Batch

All steps are required.

  1. Create a dedicated namespace for Amazon Batch jobs

    Use kubectl to create a new namespace.

    $ namespace=my-aws-batch-namespace $ cat - <<EOF | kubectl create -f - { "apiVersion": "v1", "kind": "Namespace", "metadata": { "name": "${namespace}", "labels": { "name": "${namespace}" } } } EOF

    Output:

    namespace/my-aws-batch-namespace created
  2. Enable access via role-based access control (RBAC)

    Use kubectl to create a Kubernetes role for the cluster to allow Amazon Batch to watch nodes and pods, and to bind the role. You must do this once for each Amazon EKS cluster.

    Note

    For more information about using RBAC authorization, see Using RBAC Authorization in the Kubernetes documentation.

    $ cat - <<EOF | kubectl apply -f - apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: aws-batch-cluster-role rules: - apiGroups: [""] resources: ["namespaces"] verbs: ["get"] - apiGroups: [""] resources: ["nodes"] verbs: ["get", "list", "watch"] - apiGroups: [""] resources: ["pods"] verbs: ["get", "list", "watch"] - apiGroups: [""] resources: ["configmaps"] verbs: ["get", "list", "watch"] - apiGroups: ["apps"] resources: ["daemonsets", "deployments", "statefulsets", "replicasets"] verbs: ["get", "list", "watch"] - apiGroups: ["rbac.authorization.k8s.io"] resources: ["clusterroles", "clusterrolebindings"] verbs: ["get", "list"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: aws-batch-cluster-role-binding subjects: - kind: User name: aws-batch apiGroup: rbac.authorization.k8s.io roleRef: kind: ClusterRole name: aws-batch-cluster-role apiGroup: rbac.authorization.k8s.io EOF

    Output:

    clusterrole.rbac.authorization.k8s.io/aws-batch-cluster-role created clusterrolebinding.rbac.authorization.k8s.io/aws-batch-cluster-role-binding created

    Create namespace-scoped Kubernetes role for Amazon Batch to manage and lifecycle pods and bind it. You must do this once for each unique namespace.

    $ namespace=my-aws-batch-namespace $ cat - <<EOF | kubectl apply -f - --namespace "${namespace}" apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: aws-batch-compute-environment-role namespace: ${namespace} rules: - apiGroups: [""] resources: ["pods"] verbs: ["create", "get", "list", "watch", "delete", "patch"] - apiGroups: [""] resources: ["serviceaccounts"] verbs: ["get", "list"] - apiGroups: ["rbac.authorization.k8s.io"] resources: ["roles", "rolebindings"] verbs: ["get", "list"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: aws-batch-compute-environment-role-binding namespace: ${namespace} subjects: - kind: User name: aws-batch apiGroup: rbac.authorization.k8s.io roleRef: kind: Role name: aws-batch-compute-environment-role apiGroup: rbac.authorization.k8s.io EOF

    Output:

    role.rbac.authorization.k8s.io/aws-batch-compute-environment-role created rolebinding.rbac.authorization.k8s.io/aws-batch-compute-environment-role-binding created

    Update Kubernetes aws-auth configuration map to map the preceding RBAC permissions to the Amazon Batch service-linked role.

    $ eksctl create iamidentitymapping \ --cluster my-cluster-name \ --arn "arn:aws-cn:iam::<your-account>:role/AWSServiceRoleForBatch" \ --username aws-batch

    Output:

    2022-10-25 20:19:57 [ℹ] adding identity "arn:aws-cn:iam::<your-account>:role/AWSServiceRoleForBatch" to auth ConfigMap
    Note

    The path aws-service-role/batch.amazonaws.com/ has been removed from the ARN of the service-linked role. This is because of an issue with the aws-auth configuration map. For more information, see Roles with paths don't work when the path is included in their ARN in the aws-authconfigmap.

Step 2: Creating an Amazon EKS compute environment

Amazon Batch compute environments define compute resource parameters to meet your batch workload needs. In a managed compute environment, Amazon Batch helps you to manage the capacity and instance types of the compute resources (Kubernetes nodes) within your Amazon EKS cluster. This is based on the compute resource specification that you define when you create the compute environment. You can use EC2 On-Demand Instances or EC2 Spot Instances.

Now that the AWSServiceRoleForBatch service-linked role has access to your Amazon EKS cluster, you can create Amazon Batch resources. First, create a compute environment that points to your Amazon EKS cluster.

$ cat <<EOF > ./batch-eks-compute-environment.json { "computeEnvironmentName": "My-Eks-CE1", "type": "MANAGED", "state": "ENABLED", "eksConfiguration": { "eksClusterArn": "arn:aws-cn:eks:<region>:123456789012:cluster/<cluster-name>", "kubernetesNamespace": "my-aws-batch-namespace" }, "computeResources": { "type": "EC2", "allocationStrategy": "BEST_FIT_PROGRESSIVE", "minvCpus": 0, "maxvCpus": 128, "instanceTypes": [ "m5" ], "subnets": [ "<eks-cluster-subnets-with-access-to-the-image-for-image-pull>" ], "securityGroupIds": [ "<eks-cluster-sg>" ], "instanceRole": "<eks-instance-profile>" } } EOF $ aws batch create-compute-environment --cli-input-json file://./batch-eks-compute-environment.json
Notes
  • The serviceRole parameter should not be specified, then the Amazon Batch service-linked role will be used. Amazon Batch on Amazon EKS only supports the Amazon Batch service-linked role.

  • Only BEST_FIT_PROGRESSIVE, SPOT_CAPACITY_OPTIMIZED, and SPOT_PRICE_CAPACITY_OPTIMIZED allocation strategies are supported for Amazon EKS compute environments.

    Note

    We recommend that you use SPOT_PRICE_CAPACITY_OPTIMIZED rather than SPOT_CAPACITY_OPTIMIZEDn in most instances.

  • For the instanceRole, see Creating the Amazon EKS node IAM role and Enabling IAM principal access to your cluster in the Amazon EKS User Guide. If you're using pod networking, see Configuring the Amazon VPC CNI plugin for Kubernetes to use IAM roles for service accounts in the Amazon EKS User Guide.

  • A way to get working subnets for the subnets parameter is to use the Amazon EKS managed node groups public subnets that were created by eksctl when creating an Amazon EKS cluster. Otherwise, use subnets that have a network path that supports pulling images.

  • The securityGroupIds parameter can use the same security group as the Amazon EKS cluster. This command retrieves the security group ID for the cluster.

    $ eks describe-cluster \ --name <cluster-name> \ --query cluster.resourcesVpcConfig.clusterSecurityGroupId
  • Maintenance of an Amazon EKS compute environment is a shared responsibility. For more information, see Security in Amazon EKS.

Important

It's important to confirm that the compute environment is healthy before proceeding. The DescribeComputeEnvironments API operation can be used to do this.

$ aws batch describe-compute-environments --compute-environments My-Eks-CE1

Confirm that the status parameter is not INVALID. If it is, look at the statusReason parameter for the cause. For more information, see Troubleshooting Amazon Batch.

Step 3: Create a job queue and attach the compute environment

$ aws batch describe-compute-environments --compute-environments My-Eks-CE1

Jobs submitted to this new job queue are run as pods on Amazon Batch managed nodes that joined the Amazon EKS cluster that's associated with your compute environment.

$ cat <<EOF > ./batch-eks-job-queue.json { "jobQueueName": "My-Eks-JQ1", "priority": 10, "computeEnvironmentOrder": [ { "order": 1, "computeEnvironment": "My-Eks-CE1" } ] } EOF $ aws batch create-job-queue --cli-input-json file://./batch-eks-job-queue.json

Step 4: Create a job definition

In the image field of the job definition, instead of providing a link to image in a public ECR repository, provide the link to the image stored in our private ECR repository. See the following sample job-definition:

$ cat <<EOF > ./batch-eks-job-definition.json { "jobDefinitionName": "MyJobOnEks_Sleep", "type": "container", "eksProperties": { "podProperties": { "hostNetwork": true, "containers": [ { "image": "account-id.dkr.ecr.region.amazonaws.com/amazonlinux:2", "command": [ "sleep", "60" ], "resources": { "limits": { "cpu": "1", "memory": "1024Mi" } } } ], "metadata": { "labels": { "environment": "test" } } } } } EOF $ aws batch register-job-definition --cli-input-json file://./batch-eks-job-definition.json

To run kubectl commands, you will need private access to your Amazon EKS cluster. This means all traffic to your cluster API server must come from within your cluster's VPC or a connected network.

Step 5: Submit a job

$ aws batch submit-job - -job-queue My-Eks-JQ1 \ - -job-definition MyJobOnEks_Sleep - -job-name My-Eks-Job1 $ aws batch describe-jobs - -job <jobId-from-submit-response>
Notes

(Optional) Submit a job with overrides

This job overrides the command passed to the container.

$ cat <<EOF > ./submit-job-override.json { "jobName": "EksWithOverrides", "jobQueue": "My-Eks-JQ1", "jobDefinition": "MyJobOnEks_Sleep", "eksPropertiesOverride": { "podProperties": { "containers": [ { "command": [ "/bin/sh" ], "args": [ "-c", "echo hello world" ] } ] } } } EOF $ aws batch submit-job - -cli-input-json file://./submit-job-override.json
Notes

Troubleshooting

If nodes launched by Amazon Batch don't have access to the Amazon ECR repository (or any other repository) that stores your image, then your jobs could remain in the STARTING state. This is because the pod will not be able to download the image and run your Amazon Batch job. If you click on the pod name launched by Amazon Batch you should be able to see the error message and confirm the issue. The error message should look similar to the following:

Failed to pull image "public.ecr.aws/amazonlinux/amazonlinux:2": rpc error: code = Unknown desc = failed to pull and unpack image "public.ecr.aws/amazonlinux/amazonlinux:2": failed to resolve reference "public.ecr.aws/amazonlinux/amazonlinux:2": failed to do request: Head "https://public.ecr.aws/v2/amazonlinux/amazonlinux/manifests/2": dial tcp: i/o timeout

For other common troubleshooting scenarios, see Troubleshooting Amazon Batch. For troubleshooting bases on pod-status, see How do I troubleshoot the pod status in Amazon EKS?.