Troubleshooting Amazon ECS Managed Instances - Amazon Elastic Container Service
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Troubleshooting Amazon ECS Managed Instances

When launching tasks with Amazon ECS Managed Instances, Amazon ECS first attempts to place tasks on existing capacity and requests additional capacity for tasks that cannot be placed. If instance provisioning fails, the Amazon EC2 request ID is included in the task failure message. You can use this request ID to look up details of the failed request in CloudTrail for further troubleshooting.

Note

If you choose to apply least-privilege permissions and specify your own permissions for the instance profile instead of using the AmazonECSInstanceRolePolicyForManagedInstances managed policy, you can add the following permissions to help with troubleshooting task-related issues with Amazon ECS Managed Instances:

  • ecs:StartTelemetrySession

  • ecs:PutSystemLogEvents

Task definition is incompatible with Amazon ECS Managed Instances

Common cause

This error occurs when your task definition contains parameters or configurations that are not supported by Amazon ECS Managed Instances. Common incompatibilities include unsupported network modes, task roles, or resource requirements.

Resolution

  1. Verify that your task definition uses requiresCompatibilities set to MANAGED_INSTANCES.

  2. Ensure your task definition uses the awsvpc network mode.

  3. Check that CPU and memory values are within supported ranges for Amazon ECS Managed Instances.

  4. Review the detailed error message for specific incompatibility details.

Capacity provider not associated with cluster

Common cause

This error occurs when the capacity provider specified in your capacity provider strategy is not associated with the cluster or does not exist.

Resolution

  1. Verify that the capacity provider exists in your account and region.

  2. Associate the capacity provider with your cluster using the Amazon ECS console or CLI.

  3. Ensure the capacity provider is in ACTIVE status before using it.

Infrastructure role permission errors

Common cause

This error occurs when the Amazon ECS infrastructure role lacks the necessary permissions to perform Amazon EC2 operations on your behalf, or when the role cannot be assumed due to trust relationship issues.

Resolution

  1. Verify that your infrastructure role has the proper trust relationship with Amazon ECS.

  2. Ensure the role has the required Amazon EC2 permissions including ec2:RunInstances, ec2:DescribeInstances, and iam:PassRole.

  3. Check the encoded authorization failure message in CloudTrail for specific permission details.

  4. Update the role policy to include missing permissions identified in the error message.

VcpuLimitExceeded error

Common cause

This error occurs when you've reached your vCPU service quota for the instance type family in the current region. Amazon ECS Managed Instances cannot launch additional instances until capacity is available.

Resolution

  1. Request a service quota increase for the affected instance type family through the Amazon Support Center.

  2. Consider using different instance types that fall under a different vCPU quota category.

  3. Terminate unused Amazon EC2 instances to free up vCPU capacity.

  4. Review your capacity provider configuration to use instance types with lower vCPU requirements.

InsufficientCapacity and related capacity errors

Common cause

These errors occur when Amazon doesn't have sufficient capacity to fulfill your instance request. This can include insufficient instance capacity, address capacity, or volume capacity in the requested Availability Zone.

Resolution

  1. Try launching instances in different Availability Zones by configuring multiple subnets in your capacity provider.

  2. Consider using different instance types that may have more available capacity.

  3. Wait and retry the operation as capacity availability changes frequently.

  4. For persistent capacity needs, consider using Reserved Instances or Savings Plans.

UnauthorizedOperation error

Common cause

This error occurs when the Amazon ECS service doesn't have the necessary permissions to perform Amazon EC2 operations or pass IAM roles. Common scenarios include missing ec2:RunInstances permissions or iam:PassRole permissions for the instance profile.

Resolution

  1. Verify that your Amazon ECS infrastructure role has the necessary permissions to launch Amazon EC2 instances.

  2. Ensure the infrastructure role has iam:PassRole permissions for the instance profile used by your Amazon ECS Managed Instances.

  3. Check the encoded authorization failure message in CloudTrail for specific permission details.

  4. Update the role policy to include the missing permissions identified in the error message.

Task timed out waiting for capacity

Common cause

This error occurs when instances take longer than expected to launch and register with the cluster. This can happen due to Amazon EC2 capacity constraints, instance launch failures, or network connectivity issues.

Resolution

  1. Check Amazon EC2 service health in your region for any ongoing issues.

  2. Verify that your subnets have sufficient IP addresses available.

  3. Ensure your security groups allow the necessary traffic for Amazon ECS agent communication.

  4. Consider using multiple Availability Zones to improve capacity availability.

  5. Retry the task launch operation as capacity constraints are often temporary.

Network configuration errors

Common cause

These errors occur when there are mismatches between your task's network requirements and the capacity provider's network configuration, such as VPC mismatches or missing network configuration.

Resolution

  1. Verify that your capacity provider is configured with the correct VPC and subnets.

  2. Ensure that security groups and subnets belong to the same VPC.

  3. Check that your task definition's network configuration is compatible with the capacity provider.

  4. Update your capacity provider configuration with the correct network settings.

Capacity provider can't be deleted due to stuck instances

Common cause

These errors occur when Amazon ECS Managed Instances are stuck in an ACTIVE or DRAINING state but there are no running tasks on the instances.

Resolution

To allow the deletion of the capacity provider to proceed, you can force deregister the instances that are stuck using the following command.

aws ecs deregister-container-instance \ --cluster arn:aws:ecs:us-east-1:111122223333:cluster/MyCluster \ --container-instance arn:aws:ecs:us-east-1:111122223333:container-instance/a1b2c3d4-5678-90ab-cdef-11111EXAMPLE \ --force