Troubleshoot local Amazon EKS clusters on Amazon Outposts

This topic covers some common errors that you might see while using local clusters and how to troubleshoot them. Local clusters are similar to Amazon EKS clusters in the cloud, but there are some differences in how they’re managed by Amazon EKS.

Important

Never terminate any managed EKS local cluster Kubernetes control-plane instance running on Outpost unless explicitly instructed by Amazon Support. Terminating these instances impose a risk to local cluster service availability, including loss of the local cluster in case multiple instances are simultaneously terminated. EKS local cluster Kubernetes control-plane instances are identified by the tag eks-local:controlplane-name on the EC2 instance console.

Local clusters are created through the Amazon EKS API, but are run in an asynchronous manner. This means that requests to the Amazon EKS API return immediately for local clusters. However, these requests might succeed, fail fast because of input validation errors, or fail and have descriptive validation errors. This behavior is similar to the Kubernetes API.

Local clusters don’t transition to a FAILED status. Amazon EKS attempts to reconcile the cluster state with the user-requested desired state in a continuous manner. As a result, a local cluster might remain in the CREATING state for an extended period of time until the underlying issue is resolved.

Local cluster issues can be discovered using the describe-cluster Amazon EKS Amazon CLI command. Local cluster issues are surfaced by the cluster.health field of the describe-cluster command’s response. The message contained in this field includes an error code, descriptive message, and related resource IDs. This information is available through the Amazon EKS API and Amazon CLI only. In the following example, replace my-cluster with the name of your local cluster.


aws eks describe-cluster --name my-cluster --query 'cluster.health'

An example output is as follows.


{
    "issues": [
        {
            "code": "ConfigurationConflict",
            "message": "The instance type 'm5.large' is not supported in Outpost 'my-outpost-arn'.",
            "resourceIds": [
                "my-cluster-arn"
            ]
        }
    ]
}

If the problem is beyond repair, you might need to delete the local cluster and create a new one. For example, trying to provision a cluster with an instance type that’s not available on your Outpost. The following table includes common health related errors.

Error scenario	Code	Message	ResourceIds
Provided subnets couldn’t be found.	`ResourceNotFound`	`The subnet ID subnet-id does not exist`	All provided subnet IDs
Provided subnets don’t belong to the same VPC.	`ConfigurationConflict`	`Subnets specified must belong to the same VPC`	All provided subnet IDs
Some provided subnets don’t belong to the specified Outpost.	`ConfigurationConflict`	`Subnet subnet-id expected to be in outpost-arn, but is in other-outpost-arn`	Problematic subnet ID
Some provided subnets don’t belong to any Outpost.	`ConfigurationConflict`	`Subnet subnet-id is not part of any Outpost`	Problematic subnet ID
Some provided subnets don’t have enough free addresses to create elastic network interfaces for control plane instances.	`ResourceLimitExceeded`	`The specified subnet does not have enough free addresses to satisfy the request.`	Problematic subnet ID
The specified control plane instance type isn’t supported on your Outpost.	`ConfigurationConflict`	`The instance type type is not supported in Outpost outpost-arn`	Cluster ARN
You terminated a control plane Amazon EC2 instance or `run-instance` succeeded, but the state observed changes to `Terminated`. This can happen for a period of time after your Outpost reconnects and Amazon EBS internal errors cause an Amazon EC2 internal work flow to fail.	`InternalFailure`	`EC2 instance state "Terminated" is unexpected`	Cluster ARN
You have insufficient capacity on your Outpost. This can also happen when a cluster is being created if an Outpost is disconnected from the Amazon Region.	`ResourceLimitExceeded`	`There is not enough capacity on the Outpost to launch or start the instance.`	Cluster ARN
Your account exceeded your security group quota.	`ResourceLimitExceeded`	Error message returned by Amazon EC2 API	Target VPC ID
Your account exceeded your elastic network interface quota.	`ResourceLimitExceeded`	Error message returned by Amazon EC2 API	Target subnet ID
Control plane instances weren’t reachable through Amazon Systems Manager. For resolution, see Control plane instances aren’t reachable through Amazon Systems Manager.	`ClusterUnreachable`	Amazon EKS control plane instances are not reachable through SSM. Please verify your SSM and network configuration, and reference the EKS on Outposts troubleshooting documentation.	Amazon EC2 instance IDs
An error occurred while getting details for a managed security group or elastic network interface.	Based on Amazon EC2 client error code.	Error message returned by Amazon EC2 API	All managed security group IDs
An error occurred while authorizing or revoking security group ingress rules. This applies to both the cluster and control plane security groups.	Based on Amazon EC2 client error code.	Error message returned by Amazon EC2 API	Problematic security group ID
An error occurred while deleting an elastic network interface for a control plane instance.	Based on Amazon EC2 client error code.	Error message returned by Amazon EC2 API	Problematic elastic network interface ID

The following table lists errors from other Amazon services that are presented in the health field of the describe-cluster response.

Amazon EC2 error code	Cluster health issue code	Description
`AuthFailure`	`AccessDenied`	This error can occur for a variety of reasons. The most common reason is that you accidentally removed a tag that the service uses to scope down the service linked role policy from the control plane. If this occurs, Amazon EKS can no longer manage and monitor these Amazon resources.
`UnauthorizedOperation`	`AccessDenied`	This error can occur for a variety of reasons. The most common reason is that you accidentally removed a tag that the service uses to scope down the service linked role policy from the control plane. If this occurs, Amazon EKS can no longer manage and monitor these Amazon resources.
`InvalidSubnetID.NotFound`	`ResourceNotFound`	This error occurs when subnet ID for the ingress rules of a security group can’t be found.
`InvalidPermission.NotFound`	`ResourceNotFound`	This error occurs when the permissions for the ingress rules of a security group aren’t correct.
`InvalidGroup.NotFound`	`ResourceNotFound`	This error occurs when the group of the ingress rules of a security group can’t be found.
`InvalidNetworkInterfaceID.NotFound`	`ResourceNotFound`	This error occurs when the network interface ID for the ingress rules of a security group can’t be found.
`InsufficientFreeAddressesInSubnet`	`ResourceLimitExceeded`	This error occurs when the subnet resource quota is exceeded.
`InsufficientCapacityOnOutpost`	`ResourceLimitExceeded`	This error occurs when the outpost capacity quota is exceeded.
`NetworkInterfaceLimitExceeded`	`ResourceLimitExceeded`	This error occurs when the elastic network interface quota is exceeded.
`SecurityGroupLimitExceeded`	`ResourceLimitExceeded`	This error occurs when the security group quota is exceeded.
`VcpuLimitExceeded`	`ResourceLimitExceeded`	This is observed when creating an Amazon EC2 instance in a new account. The error might be similar to the following: "`You have requested more vCPU capacity than your current vCPU limit of 32 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit."`
`InvalidParameterValue`	`ConfigurationConflict`	Amazon EC2 returns this error code if the specified instance type isn’t supported on the Outpost.
All other failures	`InternalFailure`	None

Local clusters require different permissions and policies than Amazon EKS clusters that are hosted in the cloud. When a cluster fails to create and produces an InvalidPermissions error, double check that the cluster role that you’re using has the AmazonEKSLocalOutpostClusterPolicy managed policy attached to it. All other API calls require the same set of permissions as Amazon EKS clusters in the cloud.

The amount of time it takes to create a local cluster varies depending on several factors. These factors include your network configuration, Outpost configuration, and the cluster’s configuration. In general, a local cluster is created and changes to the ACTIVE status within 15–20 minutes. If a local cluster remains in the CREATING state, you can call describe-cluster for information about the cause in the cluster.health output field.

The most common issues are the following:

Your cluster can’t connect to the control plane instance from the Amazon Region that Systems Manager is in. You can verify this by calling aws ssm start-session --target instance-id from an in-Region bastion host. If that command doesn’t work, check if Systems Manager is running on the control plane instance. Or, another work around is to delete the cluster and then recreate it.
The control plane instances fail to create due to KMS key permissions for EBS volumes. With user managed KMS keys for encrypted EBS volumes, the control plane instances will terminate if the key is not accessible. If the instances are terminated, either switch to an Amazon managed KMS key or ensure that your user managed key policy grants the necessary permissions to the cluster role.
Systems Manager control plane instances might not have internet access. Check if the subnet that you provided when you created the cluster has a NAT gateway and a VPC with an internet gateway. Use VPC reachability analyzer to verify that the control plane instance can reach the internet gateway. For more information, see Getting started with VPC Reachability Analyzer.
The role ARN that you provided is missing policies. Check if the Amazon managed policy: AmazonEKSLocalOutpostClusterPolicy was removed from the role. This can also occur if an Amazon CloudFormation stack is misconfigured.
All the provided subnets must be associated with the same Outpost and must reach each other. When multiple subnets are specified when a cluster is created, Amazon EKS attempts to spread the control plane instances across multiple subnets.
The Amazon EKS managed security groups are applied at the elastic network interface. However, other configuration elements such as NACL firewall rules might conflict with the rules for the elastic network interface.

VPC and subnet DNS configuration is misconfigured or missing

Review Create a VPC and subnets for Amazon EKS clusters on Amazon Outposts.

Amazon EKS automatically updates all existing local clusters to the latest platform versions for their corresponding Kubernetes minor version. For more information about platform versions, please refer to Learn Kubernetes and Amazon EKS platform versions for Amazon Outposts.

During an automatic platform-version rollout a cluster status changes to UPDATING. The update process consists of the replacement of all Kubernetes control-plane instances with new ones containing the latest security pathces and bugfixes released for the respective Kubernetes minor version. In general, a local cluster platform update process completes within less than 30 minutes and the cluster changes back to ACTIVE status. If a local cluster remains in the UPDATING state for an extended period of time, you may call describe-cluster to check for information about the cause in the cluster.health output field.

Amazon EKS ensures at least 2 out of 3 Kubernetes control-plane instances are healthy and operational cluster nodes in order to maintain the local cluster availability and prevent service interruption. If a local cluster is stalled in UPDATING state it is usually because there is some infrastructure or configuration issue preventing the two-instances minimum availability to be guaranteed in case the process continues. So the update process stops progressing to protect the local cluster service interruption.

It is important to troubleshoot a local cluster stuck in UPDATING status and address the root-cause so that the update process can complete and restore the local cluster back to ACTIVE with the high-availability of 3 Kubernetes control-plane instances.

Do not terminate any managed EKS local cluster Kubernetes instances on Outposts unless explicitly instructed by Amazon Support. This is specially important for local clusters stuck in UPDATING state because there’s a high probability that another control-plane nodes is not completely healthy and terminating the wrong instance could cause service interruption and risk local-cluster data loss.

The most common issues are the following:

One or more control-plane instances are unable to connect to System Manager because of a networking configuration change since the local cluster was first created. You can verify this by calling aws ssm start-session --target instance-id from an in-Region bastion host. If that command doesn’t work, check if Systems Manager is running on the control plane instance.
New control plane instances fail to be created due to KMS key permissions for EBS volumes. With user managed KMS keys for encrypted EBS volumes, the control plane instances will terminate if the key is not accessible. If the instances are terminated, either switch to an Amazon managed KMS key or ensure that your user managed key policy grants the necessary permissions to the cluster role.
Systems Manager control plane instances might have lost internet access. Check if the subnet that was provided when you created the cluster has a NAT gateway and a VPC with an internet gateway. Use VPC reachability analyzer to verify that the control plane instance can reach the internet gateway. For more information, see Getting started with VPC Reachability Analyzer. If your private networks don’t have outbound internet connection, ensure that all the required VPC endpoints and gateway endpoint are still present in the Regional subnet from your cluster (see Subnet access to Amazon services).
The role ARN that you provided is missing policies. Check if the Amazon managed policy: AmazonEKSLocalOutpostClusterPolicy was not removed from the role.
One of the new Kubernetes control-plane instances may have experienced an unexpected bootstrapping failure. Please file a ticket with Amazon Support Center for further guidance on troubleshooting and log-collection in this exceptional case.

AMI issues:
- You’re using an incompatible AMI. Only Amazon EKS optimized Amazon Linux 2 AMIs are supported (amazon-linux-2,amazon-linux-2-gpu, amazon-linux-2-arm64). If you attempt to join AL2023 nodes to EKS LocalClusters on Amazon Outposts, the nodes fail to join the cluster. For more information, see Create Amazon Linux nodes on Amazon Outposts.
- If you used an Amazon CloudFormation template to create your nodes, make sure it wasn’t using an unsupported AMI.
Missing the Amazon IAM Authenticator ConfigMap – If it’s missing, you must create it. For more information, see Apply the aws-auth ConfigMap to your cluster .
The wrong security group is used – Make sure to use eks-cluster-sg-cluster-name-uniqueid for your worker nodes' security group. The selected security group is changed by Amazon CloudFormation to allow a new security group each time the stack is used.
Following unexpected private link VPC steps – Wrong CA data (--b64-cluster-ca) or API Endpoint (--apiserver-endpoint) are passed.

When an Outpost gets disconnected from the Amazon Region that it’s associated with, the Kubernetes cluster likely will continue working normally. However, if the cluster doesn’t work properly, follow the troubleshooting steps in Prepare local Amazon EKS clusters on Amazon Outposts for network disconnects. If you encounter other issues, contact Amazon Support. Amazon Support can guide you on downloading and running a log collection tool. That way, you can collect logs from your Kubernetes cluster control plane instances and send them to Amazon Support support for further investigation.

When the Amazon EKS control plane instances aren’t reachable through Amazon Systems Manager (Systems Manager), Amazon EKS displays the following error for your cluster.


Amazon EKS control plane instances are not reachable through SSM. Please verify your SSM and network configuration, and reference the EKS on Outposts troubleshooting documentation.

To resolve this issue, make sure that your VPC and subnets meet the requirements in Create a VPC and subnets for Amazon EKS clusters on Amazon Outposts and that you completed the steps in Setting up Session Manager in the Amazon Systems Manager User Guide.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Capacity considerations

Nodes