Troubleshooting Amazon Batch - Amazon Batch
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Troubleshooting Amazon Batch

You might need to troubleshoot issues that are related to your compute environments, job queues, job definitions, or jobs. This chapter describes how to troubleshoot and resolve such issues in your Amazon Batch environment.

Amazon Batch uses IAM policies, roles, and permissions, and runs on Amazon EC2, Amazon ECS, Amazon Fargate, and Amazon Elastic Kubernetes Service infrastructure. To troubleshoot issues that are related to these services, see the following:

Amazon Batch

INVALID compute environment

It's possible that you might have incorrectly configured a managed compute environment. If you did, the compute environment enters an INVALID state and can't accept jobs for placement. The following sections describe the possible causes and how to troubleshoot based on the cause.

Incorrect role name or ARN

The most common cause for a compute environment to enter an INVALID state is that the Amazon Batch service role or the Amazon EC2 Spot Fleet role has an incorrect name or Amazon Resource Name (ARN). This is more common with compute environments that are created using the Amazon CLI or the Amazon SDKs. When you create a compute environment in the Amazon Web Services Management Console, Amazon Batch helps you choose the correct service or Spot Fleet roles. However, suppose that you manually enter the name or the ARN and enter them incorrectly. Then, the resulting compute environment is also INVALID.

However, suppose that you manually enter the name or ARN for an IAM resource in an Amazon CLI command or your SDK code. In this case, Amazon Batch can't validate the string. Instead, Amazon Batch must accept the bad value and attempt to create the environment. If Amazon Batch fails to create the environment, the environment moves to an INVALID state, and you see the following errors.

For an invalid service role:

CLIENT_ERROR - Not authorized to perform sts:AssumeRole (Service: AWSSecurityTokenService; Status Code: 403; Error Code: AccessDenied; Request ID: dc0e2d28-2e99-11e7-b372-7fcc6fb65fe7)

For an invalid Spot Fleet role:

CLIENT_ERROR - Parameter: SpotFleetRequestConfig.IamFleetRole is invalid. (Service: AmazonEC2; Status Code: 400; Error Code: InvalidSpotFleetRequestConfig; Request ID: 331205f0-5ae3-4cea-bac4-897769639f8d) Parameter: SpotFleetRequestConfig.IamFleetRole is invalid

One common cause for this issue is the following scenario. You only specify the name of an IAM role when using the Amazon CLI or the Amazon SDKs, instead of the full Amazon Resource Name (ARN). Depending on how you created the role, the ARN might contain a aws-service-role path prefix. For example, if you manually create the Amazon Batch service role using the procedures in Using service-linked roles for Amazon Batch, your service role ARN might look like the following.

arn:aws-cn:iam::123456789012:role/AWSBatchServiceRole

However, if you created the service role as part of the console first run wizard today, your service role ARN might look like the following.

arn:aws-cn:iam::123456789012:role/aws-service-role/AWSBatchServiceRole

This issue can also occur if you attach the the Amazon Batch service-level policy (AWSBatchServiceRole) to a non-service role. For example, you may receive an error message that resembles the following in this scenario:

CLIENT_ERROR - User: arn:aws:sts::account_number:assumed-role/batch-replacement-role/aws-batch is not authorized to perform: action on resource ...

To resolve this issue, do one of the following.

  • Use an empty string for the service role when you create the Amazon Batch compute environment.

  • Specify the service role in the following format: arn:aws:iam::account_number:role/aws-service-role/batch.amazonaws.com/AWSServiceRoleForBatch.

When you only specify the name of an IAM role when using the Amazon CLI or the Amazon SDKs, Amazon Batch assumes that your ARN doesn't use the aws-service-role path prefix. Because of this, we recommend that you specify the full ARN for your IAM roles when you create compute environments.

To repair a compute environment that's misconfigured this way, see Repairing an INVALID compute environment.

Repairing an INVALID compute environment

When you have a compute environment in an INVALID state, update it to repair the invalid parameter. For an Incorrect role name or ARN, update the compute environment using the correct service role.

To repair a misconfigured compute environment
  1. Open the Amazon Batch console at https://console.amazonaws.cn/batch/.

  2. From the navigation bar, select the Amazon Web Services Region to use.

  3. In the navigation pane, choose Compute environments.

  4. On the Compute environments page, select the radio button next to the compute environment to edit, and then choose Edit.

  5. On the Update compute environment page, for Service role, choose the IAM role to use with your compute environment. The Amazon Batch console only displays roles that have the correct trust relationship for compute environments.

  6. Choose Save to update your compute environment.

Jobs stuck in a RUNNABLE status

Suppose that your compute environment contains compute resources, but your jobs don't progress beyond the RUNNABLE status. Then, it's likely that something is preventing the jobs from being placed on a compute resource and causing your job queues to be blocked. Here's how to know if your job is waiting for its turn or stuck and blocking the queue.

If Amazon Batch detects that you have a RUNNABLE job at the head and blocking the queue, you'll receive a blocked job queue event from Amazon CloudWatch Events with the reason. The same reason is also updated into the statusReason field as a part of ListJobs and DescribeJobs API calls.

Optionally, you can configure the jobStateTimeLimitActions parameter through CreateJobQueue and UpdateJobQueue API actions.

Note

Currently, the only action you can use with jobStateLimitActions.action is to cancel a job.

The jobStateTimeLimitActions parameter is used to specify a set of actions that Amazon Batch performs on jobs in a specific state. You can set a time threshold in seconds through the maxTimeSeconds field.

When a job has been in a RUNNABLE state with the defined statusReason, Amazon Batch performs the action specified after maxTimeSeconds have elapsed.

For example, you can set the jobStateTimeLimitActions parameter to wait up to 4 hours for any job in the RUNNABLE state that is waiting for sufficient capacity to become available. You can do this by setting statusReason to CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY and maxTimeSeconds to 144000 before cancelling the job and allowing the next job to advance to the head of the job queue.

The following are the reasons that Amazon Batch provides when it detects that a job queue is blocked. This list provides the messages returned from the ListJobs and DescribeJobs API actions. These are also the same values you can define for the jobStateLimitActions.statusReason parameter.

  1. Reason: All connected compute environments have insufficient capacity errors. When requested, Amazon Batch detects Amazon EC2 instances that experience insufficient capacity errors. Canceling the job, either manually or by setting the jobStateTimeLimitActions parameter on statusReason, allows the subsequent job to move to the head of the queue.

    • statusReason message while the job is stuck: CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY - Service cannot fulfill the capacity requested for instance type [instanceTypeName]

    • reason used for jobStateTimeLimitActions: CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY

    • statusReason message after the job is canceled: Canceled by JobStateTimeLimit action due to reason: CAPACITY:INSUFFICIENT_INSTANCE_CAPACITY

    Note:

    1. The Amazon Batch service role requires autoscaling:DescribeScalingActivities permission for this detection to work. If you use the AWSServiceRoleForBatch service-linked role (SLR) or the AWSBatchServiceRolePolicy managed policy, then you don’t need to take any action because their permission policies are updated.

    2. If you use the SLR or the managed policy, you must add the autoscaling:DescribeScalingActivities and ec2:DescribeSpotFleetRequestHistory permissions so that you can receive blocked job queue events and updated job status when in RUNNABLE. In addition, Amazon Batch needs these permissions to perform cancellation actions through the jobStateTimeLimitActions parameter even if they are configured on the job queue.

    3. In the case of a multi-node parallel (MNP) job, if the attached high-priority, Amazon EC2 compute environment experiences insufficient capacity errors, it blocks the queue even if a lower priority compute environment does experience this error.

  2. Reason: All compute environments have a maxvCpus parameter that is smaller than the job requirements. Canceling the job, either manually or by setting the jobStateTimeLimitActions parameter on statusReason, allows the subsequent job to move to the head of the queue. Optionally, you can increase the maxvCpus parameter of the primary compute environment to meet the needs of the blocked job.

    • statusReason message while the job is stuck: MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE - CE(s) associated with the job queue cannot meet the CPU requirement of the job.

    • reason used for jobStateTimeLimitActions: MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE

    • statusReason message after the job is canceled: Canceled by JobStateTimeLimit action due to reason: MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE

  3. Reason: None of the compute environments have instances that meet the job requirements. When a job requests resources, Amazon Batch detects that no attached compute environment is able to accommodate the incoming job. Canceling the job, either manually or by setting the jobStateTimeLimitActions parameter on statusReason, allows the subsequent job to move to the head of the queue. Optionally, you can redefine the compute environment's allowed instance types to add the necessary job resources.

    • statusReason message while the job is stuck: MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT - The job resource requirement (vCPU/memory/GPU) is higher than that can be met by the CE(s) attached to the job queue.

    • reason used for jobStateTimeLimitActions: MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT

    • statusReason message after the job is canceled: Canceled by JobStateTimeLimit action due to reason: MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT

  4. Reason: All compute environments have service role issues. To resolve this, compare your service role permissions to the Amazon Batch managed service role permissions and address any gaps.

    It's a best practice to use the Amazon Batch SLR for compute environments to avoid similar errors.

    Canceling the job, either manually or by setting the jobStateTimeLimitActions parameter on statusReason, allows the subsequent job to move to the head of the queue. Without resolving the service role issue(s), it is likely that the next job will also be blocked as well. It's best to manually investigate and resolve this issue.

    • statusReason message while the job is stuck: MISCONFIGURATION:SERVICE_ROLE_PERMISSIONS – Batch service role has a permission issue.

    • reason used for jobStateTimeLimitActions: MISCONFIGURATION:SERVICE_ROLE_PERMISSIONS

    • statusReason message after the job is canceled: Canceled by JobStateTimeLimit action due to reason: MISCONFIGURATION:SERVICE_ROLE_PERMISSIONS

  5. Reason: All compute environments are invalid. For more information, see INVALID compute environment. Note: You can't configure a programmable action through the jobStateTimeLimitActions parameter to resolve this error.

    • statusReason message while the job is stuck: ACTION_REQUIRED - CE(s) associated with the job queue are invalid.

  6. Reason: Amazon Batch has detected a blocked queue, but is unable to determine the reason. Note: You can't configure a programmable action through the jobStateTimeLimitActions parameter to resolve this error. For more information about troubleshooting, see Why is my Amazon Batch job stuck in RUNNABLE on Amazon in re:Post.

    • statusReason message while the job is stuck: UNDETERMINED - Batch job is blocked, root cause is undetermined.

In case you did not receive an event from CloudWatch Events or you received the unknown reason event, here are some common causes for this issue.

The awslogs log driver isn't configured on your compute resources

Amazon Batch jobs send their log information to CloudWatch Logs. To enable this, you must configure your compute resources to use the awslogs log driver. Suppose that you base your compute resource AMI off of the Amazon ECS optimized AMI (or Amazon Linux). Then, this driver is registered by default with the ecs-init package. Now suppose that you use a different base AMI. Then, you must verify that the awslogs log driver is specified as an available log driver with the ECS_AVAILABLE_LOGGING_DRIVERS environment variable when the Amazon ECS container agent is started. For more information, see Compute resource AMI specification and Creating a compute resource AMI.

Insufficient resources

If your job definitions specify more CPU or memory resources than your compute resources can allocate, then your jobs aren't ever placed. For example, suppose that your job specifies 4 GiB of memory, and your compute resources have less than that available. Then it's the case that the job can't be placed on those compute resources. In this case, you must reduce the specified memory in your job definition or add larger compute resources to your environment. Some memory is reserved for the Amazon ECS container agent and other critical system processes. For more information, see Compute Resource Memory Management.

No internet access for compute resources

Compute resources need access to communicate with the Amazon ECS service endpoint. This can be through an interface VPC endpoint or through your compute resources having public IP addresses.

For more information about interface VPC endpoints, see Amazon ECS Interface VPC Endpoints (Amazon PrivateLink) in the Amazon Elastic Container Service Developer Guide.

If you do not have an interface VPC endpoint configured and your compute resources do not have public IP addresses, then they must use network address translation (NAT) to provide this access. For more information, see NAT gateways in the Amazon VPC User Guide. For more information, see Create a VPC.

Amazon EC2 instance limit reached

The number of Amazon EC2 instances that your account can launch in an Amazon Web Services Region is determined by your EC2 instance quota. Certain instance types also have a per-instance-type quota. For more information about your account's Amazon EC2 instance quota including how to request a limit increase, see Amazon EC2 Service Limits in the Amazon EC2 User Guide for Linux Instances.

Amazon ECS container agent isn't installed

The Amazon ECS container agent must be installed on the Amazon Machine Image (AMI) to let Amazon Batch run jobs. The Amazon ECS container agent is installed by default on Amazon ECS optimized AMIs. For more information about the Amazon ECS container agent, see Amazon ECS container agent in the Amazon Elastic Container Service Developer Guide.

For more information, see Why is my Amazon Batch job stuck in RUNNABLE status? in re:Post.

Spot Instances not tagged on creation

Spot Instance tagging for Amazon Batch compute resources is supported as of October 25, 2017. Before, the recommended IAM managed policy (AmazonEC2SpotFleetRole) for the Amazon EC2 Spot Fleet role didn't contain permissions to tag Spot Instances at launch. The new recommended IAM managed policy is called AmazonEC2SpotFleetTaggingRole. It supports tagging Spot Instances at launch.

To fix Spot Instance tagging on creation, follow the following procedure to apply the current recommended IAM managed policy to your Amazon EC2 Spot Fleet role. That way, any future Spot Instances that are created with that role have permissions to apply instance tags when they're created.

To apply the current IAM managed policy to your Amazon EC2 Spot Fleet role
  1. Open the IAM console at https://console.amazonaws.cn/iam/.

  2. Choose Roles, and choose your Amazon EC2 Spot Fleet role.

  3. Choose Attach policy.

  4. Select the AmazonEC2SpotFleetTaggingRole and choose Attach policy.

  5. Choose your Amazon EC2 Spot Fleet role again to remove the previous policy.

  6. Select the x to the right of the AmazonEC2SpotFleetRole policy, and choose Detach.

Spot Instances not scaling down

Amazon Batch introduced the AWSServiceRoleForBatch service-linked role on March 10, 2021. If no role is specified in the serviceRole parameter of the compute environment, this service-linked role is used as the service role. However, suppose that the service-linked role is used in an EC2 Spot compute environment, but the Spot role used doesn't include the AmazonEC2SpotFleetTaggingRole managed policy. Then, the Spot Instance doesn't scale down. As a result, you will receive an error with the following message: "You are not authorized to perform this operation." Use the following steps to update the spot fleet role that you use in the spotIamFleetRole parameter. For more information, see Using service-linked roles and Creating a role to delegate permissions to an Amazon Service in the IAM User Guide.

Attach AmazonEC2SpotFleetTaggingRole managed policy to your Spot Fleet role in the Amazon Web Services Management Console

To apply the current IAM managed policy to your Amazon EC2 Spot Fleet role
  1. Open the IAM console at https://console.amazonaws.cn/iam/.

  2. Choose Roles, and choose your Amazon EC2 Spot Fleet role.

  3. Choose Attach policy.

  4. Select the AmazonEC2SpotFleetTaggingRole and choose Attach policy.

  5. Choose your Amazon EC2 Spot Fleet role again to remove the previous policy.

  6. Select the x to the right of the AmazonEC2SpotFleetRole policy, and choose Detach.

Attach AmazonEC2SpotFleetTaggingRole managed policy to your Spot Fleet role with the Amazon CLI

The example commands assume that your Amazon EC2 Spot Fleet role is named AmazonEC2SpotFleetRole. If your role uses a different name, adjust the commands to match.

To attach the AmazonEC2SpotFleetTaggingRole managed policy to your Spot Fleet role
  1. To attach the AmazonEC2SpotFleetTaggingRole managed IAM policy to your AmazonEC2SpotFleetRole role, run the following command using the Amazon CLI.

    $ aws iam attach-role-policy \ --policy-arn arn:aws-cn:iam::aws:policy/service-role/AmazonEC2SpotFleetTaggingRole \ --role-name AmazonEC2SpotFleetRole
  2. To detach the AmazonEC2SpotFleetRole managed IAM policy from your AmazonEC2SpotFleetRole role, run the following command using the Amazon CLI.

    $ aws iam detach-role-policy \ --policy-arn arn:aws-cn:iam::aws:policy/service-role/AmazonEC2SpotFleetRole \ --role-name AmazonEC2SpotFleetRole

Can't retrieve Secrets Manager secrets

If you use an AMI with an Amazon ECS agent that's earlier than version 1.16.0-1, then you must use the Amazon ECS agent configuration variable ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE=true to use this feature. You can add it to the ./etc/ecs/ecs.config file to a new container instance when you create that instance. Or, you can add it to an existing instance. If you add it to an existing instance, you must restart the ECS agent after you add it. For more information, see Amazon ECS Container Agent Configuration in the Amazon Elastic Container Service Developer Guide.

Can't override job definition resource requirements

The memory and vCPU overrides that are specified in the memory and vcpus members of the containerOverrides structure, which passed to SubmitJob, can't override the memory and vCPU requirements that are specified in the resourceRequirements structure in the job definition.

If you try to override these resource requirements, you might see the following error message:

"This value was submitted in a deprecated key and may conflict with the value provided by the job definition's resource requirements."

To correct this, specify the memory and vCPU requirements in the resourceRequirements member of the containerOverrides. For example, if your memory and vCPU overrides are specified in the following lines.

"containerOverrides": { "memory": 8192, "vcpus": 4 }

Change them to the following:

"containerOverrides": { "resourceRequirements": [ { "type": "MEMORY", "value": "8192" }, { "type": "VCPU", "value": "4" } ], }

Do the same change to the memory and vCPU requirements that are specified in the containerProperties object in the job definition. For example, if your memory and vCPU requirements are specified in the following lines.

{ "containerProperties": { "memory": 4096, "vcpus": 2, }

Change them to the following:

"containerProperties": { "resourceRequirements": [ { "type": "MEMORY", "value": "4096" }, { "type": "VCPU", "value": "2" } ], }

Error message when you update the desiredvCpus setting

You see the following error message when you use the Amazon Batch API to update the desired vCPUs (desiredvCpus) setting.

Manually scaling down compute environment is not supported. Disconnecting job queues from compute environment will cause it to scale-down to minvCpus.

This issue occurs if the updated desiredvCpus value is less than the current desiredvCpus value. When you update the desiredvCpus value, both of the following must be true:

  • The desiredvCpus value must be between the minvCpus and maxvCpus values.

  • The updated desiredvCpus value must be greater than or equal to the current desiredvCpus value.

Amazon Batch on Amazon EKS

INVALID compute environment

It's possible that you might have incorrectly configured a managed compute environment. If you did, the compute environment enters an INVALID state and can't accept jobs for placement. The following sections describe the possible causes and how to troubleshoot based on the cause.

Unsupported Kubernetes version

You might see an error message that resembles the following when you use the CreateComputeEnvironment API operation or UpdateComputeEnvironmentAPI operation to create or update a compute environment. This issue occurs if you specify an unsupported Kubernetes version in EC2Configuration.

At least one imageKubernetesVersion in EC2Configuration is not supported.

To resolve this issue, delete the compute environment and then re-create it with a supported Kubernetes version.

You can perform a minor version upgrade on your Amazon EKS cluster. For example, you can upgrade the cluster from 1.xx to 1.yy even if the minor version isn't supported.

However, the compute environment status might change to INVALID after a major version update. For example, if you perform a major version upgrade from 1.xx to 2.yy. If the major version isn't supported by Amazon Batch, you see an error message that resembles the following.

reason=CLIENT_ERROR - ... EKS Cluster version [2.yy] is unsupported

To resolve this issue, specify a supported Kubernetes version when you use an API operation to create or update a compute environment.

Amazon Batch on Amazon EKS currently supports the following Kubernetes versions:

  • 1.29

  • 1.28

  • 1.27

  • 1.26

  • 1.25

  • 1.24

  • 1.23

Instance profile doesn't exist

If the specified instance profile does not exist, the Amazon Batch on Amazon EKS compute environment status is changed to INVALID. You see an error set in the statusReason parameter that resembles the following.

CLIENT_ERROR - Instance profile arn:aws-cn:iam::...:instance-profile/<name> does not exist

To resolve this issue, specify or create a working instance profile. For more information, see Amazon EKS node IAM role in the Amazon EKS User Guide.

Invalid Kubernetes namespace

If Amazon Batch on Amazon EKS can't validate the namespace for the compute environment, the compute environment status is changed to INVALID. For example, this issue can occur if the namespace doesn't exist.

You see an error message set in the statusReason parameter that resembles the following.

CLIENT_ERROR - Unable to validate Kubernetes Namespace

This issue can occur if any of the following are true:

  • The Kubernetes namespace string in the CreateComputeEnvironment call doesn't exist. For more information, see CreateComputeEnvironment.

  • The required Role-Based Access Control (RBAC) permissions to manage the namespace are not configured correctly.

  • Amazon Batch doesn't have access to the Amazon EKS Kubernetes API server endpoint.

To resolve this issue, see Verify that the aws-auth ConfigMap is configured correctly. For more information, see Getting started with Amazon Batch on Amazon EKS .

Deleted compute environment

Suppose that you delete an Amazon EKS cluster before you delete the attached Amazon Batch on Amazon EKS compute environment. Then, the compute environment status is changed to INVALID. In this scenario, the compute environment doesn't work properly if you re-create the Amazon EKS cluster with the same name.

To resolve this issue, delete and then re-create the Amazon Batch on Amazon EKS compute environment.

Nodes don't join the Amazon EKS cluster

Amazon Batch on Amazon EKS scales down a compute environment if it determines that not all nodes joined the Amazon EKS cluster. When Amazon Batch on Amazon EKS scales down the compute environment, the compute environment status is changed to INVALID.

Note

Amazon Batch doesn't change the compute environment status immediately so that you can debug the issue.

You see an error message set in the statusReason parameter that resembles ones of the following:

Your compute environment has been INVALIDATED and scaled down because none of the instances joined the underlying ECS Cluster. Common issues preventing instances joining are the following: VPC/Subnet configuration preventing communication to ECS, incorrect Instance Profile policy preventing authorization to ECS, or customized AMI or LaunchTemplate configurations affecting ECS agent.

Your compute environment has been INVALIDATED and scaled down because none of the nodes joined the underlying Amazon EKS Cluster. Common issues preventing nodes joining are the following: networking configuration preventing communication to Amazon EKS Cluster, incorrect Amazon EKS Instance Profile or Kubernetes RBAC policy preventing authorization to Amazon EKS Cluster, customized AMI or LaunchTemplate configurations affecting Amazon EKS/Kubernetes node bootstrap.

When using a default Amazon EKS AMI, the most common causes of this issue are the following:

Amazon Batch on Amazon EKS job is stuck in RUNNABLE status

An aws-auth ConfigMap is automatically created and applied to your cluster when you create a managed node group or a node group using eksctl. An aws-auth ConfigMap is initially created to allow nodes to join your cluster. However, you also use the aws-authConfigMap to add role-based access control (RBAC) access to users and roles.

To verify that the aws-auth ConfigMap is configured correctly:

  1. Retrieve the mapped roles in the aws-auth ConfigMap:

    $ kubectl get configmap -n kube-system aws-auth -o yaml
  2. Verify that the roleARN is configured as follows.

    rolearn: arn:aws-cn:iam::aws_account_number:role/AWSServiceRoleForBatch

    Note

    You can also review the Amazon EKS control plane logs. For more information, see Amazon EKS control plane logging in the Amazon EKS User Guide.

To resolve an issue where a job is stuck in a RUNNABLE status, we recommend that you use kubectl to re-apply the manifest. For more information, see Step 1: Preparing your Amazon EKS cluster for Amazon Batch. Or, you can use kubectl to manually edit the aws-auth ConfigMap. For more information, see Enabling IAM user and role access to your cluster in the Amazon EKS User Guide.

Verify that the aws-auth ConfigMap is configured correctly

To verify that the aws-auth ConfigMap is configured correctly:

  1. Retrieve the mapped roles in the aws-auth ConfigMap.

    $ kubectl get configmap -n kube-system aws-auth -o yaml
  2. Verify that the roleARN is configured as follows.

    rolearn: arn:aws-cn:iam::aws_account_number:role/AWSServiceRoleForBatch

    Note

    The path aws-service-role/batch.amazonaws.com/ has been removed from the ARN of the service-linked role. This is because of an issue with the aws-auth configuration map. For more information, see Roles with paths do not work when the path is included in their ARN in the aws-authconfigmap.

    Note

    You can also review the Amazon EKS control plane logs. For more information, see Amazon EKS control plane logging in the Amazon EKS User Guide.

To resolve an issue where a job is stuck in a RUNNABLE status, we recommend that you use kubectl to re-apply the manifest. For more information, see Step 1: Preparing your Amazon EKS cluster for Amazon Batch. Or, you can use kubectl to manually edit the aws-auth ConfigMap. For more information, see Enabling IAM user and role access to your cluster in the Amazon EKS User Guide.

RBAC permissions or bindings aren't configured properly

If you experience any RBAC permissions or binding issues, verify that the aws-batch Kubernetes role can access the Kubernetes namespace:

$ kubectl get namespace namespace --as=aws-batch
$ kubectl auth can-i get ns --as=aws-batch

You can also use the kubectl describe command to view the authorizations for a cluster role or Kubernetes namespace.

$ kubectl describe clusterrole aws-batch-cluster-role

The following is example output.

Name: aws-batch-cluster-role Labels: <none> Annotations: <none> PolicyRule: Resources Non-Resource URLs Resource Names Verbs --------- ----------------- -------------- ----- configmaps [] [] [get list watch] nodes [] [] [get list watch] pods [] [] [get list watch] daemonsets.apps [] [] [get list watch] deployments.apps [] [] [get list watch] replicasets.apps [] [] [get list watch] statefulsets.apps [] [] [get list watch] clusterrolebindings.rbac.authorization.k8s.io [] [] [get list] clusterroles.rbac.authorization.k8s.io [] [] [get list] namespaces [] [] [get]
$ kubectl describe role aws-batch-compute-environment-role -n my-aws-batch-namespace

The following is example output.

Name: aws-batch-compute-environment-role Labels: <none> Annotations: <none> PolicyRule: Resources Non-Resource URLs Resource Names Verbs --------- ----------------- -------------- ----- pods [] [] [create get list watch delete patch] serviceaccounts [] [] [get list] rolebindings.rbac.authorization.k8s.io [] [] [get list] roles.rbac.authorization.k8s.io [] [] [get list]

To resolve this issue, re-apply the RBAC permissions and rolebinding commands. For more information, see Step 1: Preparing your Amazon EKS cluster for Amazon Batch.