Troubleshooting local clusters for Amazon EKS on Amazon Outposts
This topic covers some common errors that you might see while using local clusters and how to troubleshoot them. Local clusters are similar to Amazon EKS clusters in the cloud, but there are some differences in how they're managed by Amazon EKS.
Local clusters are created through the Amazon EKS API, but are run in an asynchronous manner. This means that requests to the Amazon EKS API return immediately for local clusters. However, these requests might succeed, fail fast because of input validation errors, or fail and have descriptive validation errors. This behavior is similar to the Kubernetes API.
Local clusters don't transition to a FAILED
status. Amazon EKS attempts to
reconcile the cluster state with the user-requested desired state in a continuous
manner. As a result, a local cluster might remain in the CREATING
state
for an extended period of time until the underlying issue is resolved.
Local cluster issues can be discovered using the describe-cluster
Amazon EKS Amazon CLI command. Local cluster
issues are surfaced by the cluster.health
field of the
describe-cluster
command's response. The message contained in this
field includes an error code, descriptive message, and related resource IDs. This
information is available through the Amazon EKS API and Amazon CLI only. In the following
example, replace my-cluster
with the name of your local
cluster.
aws eks describe-cluster --name
my-cluster
--query 'cluster.health'
An example output is as follows.
{ "issues": [ { "code": "ConfigurationConflict", "message": "The instance type 'm5.large' is not supported in Outpost '
my-outpost-arn
'.", "resourceIds": [ "my-cluster-arn
" ] } ] }
If the problem is beyond repair, you might need to delete the local cluster and create a new one. For example, trying to provision a cluster with an instance type that's not available on your Outpost. The following table includes common health related errors.
Error scenario | Code | Message | ResourceIds |
---|---|---|---|
Provided subnets couldn't be found. |
|
|
All provided subnet IDs |
Provided subnets don't belong to the same VPC. |
|
|
All provided subnet IDs |
Some provided subnets don't belong to the specified Outpost. |
|
|
Problematic subnet ID |
Some provided subnets don't belong to any Outpost. |
|
|
Problematic subnet ID |
Some provided subnets don't have enough free addresses to create elastic network interfaces for control plane instances. |
|
|
Problematic subnet ID |
The specified control plane instance type isn't supported on your Outpost. |
|
|
Cluster ARN |
You terminated a control plane Amazon EC2 instance or
run-instance succeeded, but the state observed
changes to Terminated . This can happen for a period of
time after your Outpost reconnects and Amazon EBS internal errors cause
an Amazon EC2 internal work flow to fail. |
|
|
Cluster ARN |
You have insufficient capacity on your Outpost. This can also happen when a cluster is being created if an Outpost is disconnected from the Amazon Web Services Region. |
|
|
Cluster ARN |
Your account exceeded your security group quota. |
|
Error message returned by Amazon EC2 API | Target VPC ID |
Your account exceeded your elastic network interface quota. |
|
Error message returned by Amazon EC2 API | Target subnet ID |
Control plane instances weren't reachable through Amazon Systems Manager. For resolution, see Control plane instances aren't reachable through Amazon Systems Manager. |
|
Amazon EKS control plane instances are not reachable through SSM. Please verify your SSM and network configuration, and reference the EKS on Outposts troubleshooting documentation. |
Amazon EC2 instance IDs |
An error occurred while getting details for a managed security group or elastic network interface. |
Based on Amazon EC2 client error code. |
Error message returned by Amazon EC2 API | All managed security group IDs |
An error occurred while authorizing or revoking security group ingress rules. This applies to both the cluster and control plane security groups. | Based on Amazon EC2 client error code. | Error message returned by Amazon EC2 API | Problematic security group ID |
An error occurred while deleting an elastic network interface for a control plane instance. | Based on Amazon EC2 client error code. | Error message returned by Amazon EC2 API | Problematic elastic network interface ID |
The following table lists errors from other Amazon Web Services that are presented in
the health field of the describe-cluster
response.
Amazon EC2 error code | Cluster health issue code | Description |
---|---|---|
|
|
This error can occur for a variety of reasons. The most common reason is that you accidentally removed a tag that the service uses to scope down the service linked role policy from the control plane. If this occurs, Amazon EKS can no longer manage and monitor these Amazon resources. |
|
|
This error can occur for a variety of reasons. The most common reason is that you accidentally removed a tag that the service uses to scope down the service linked role policy from the control plane. If this occurs, Amazon EKS can no longer manage and monitor these Amazon resources. |
|
|
This error occurs when subnet ID for the ingress rules of a security group can't be found. |
|
|
This error occurs when the permissions for the ingress rules of a security group aren't correct. |
|
|
This error occurs when the group of the ingress rules of a security group can't be found. |
|
|
This error occurs when the network interface ID for the ingress rules of a security group can't be found. |
|
|
This error occurs when the subnet resource quota is exceeded. |
|
|
This error occurs when the outpost capacity quota is exceeded. |
|
|
This error occurs when the elastic network interface quota is exceeded. |
|
|
This error occurs when the security group quota is exceeded. |
|
|
This is observed when creating an Amazon EC2 instance in a new
account. The error might be similar to the following: "You
have requested more vCPU capacity than your current vCPU limit
of 32 allows for the instance bucket that the specified instance
type belongs to. Please visit
http://aws.amazon.com/contact-us/ec2-request to request an
adjustment to this limit." |
|
|
Amazon EC2 returns this error code if the specified instance type isn't supported on the Outpost. |
All other failures |
|
None |
Local clusters require different permissions and policies than Amazon EKS clusters that
are hosted in the cloud. When a cluster fails to create and produces an
InvalidPermissions
error, double check that the cluster role that
you're using has the AmazonEKSLocalOutpostClusterPolicy managed policy attached to it. All
other API calls require the same set of permissions as Amazon EKS clusters in the
cloud.
The amount of time it takes to create a local cluster varies depending on several
factors. These factors include your network configuration, Outpost configuration,
and the cluster's configuration. In general, a local cluster is created and changes
to the ACTIVE
status within 15–20 minutes. If a local cluster
remains in the CREATING
state, you can call
describe-cluster
for information about the cause in the
cluster.health
output field.
The most common issues are the following:
Amazon Systems Manager (Systems Manager) encounters the following issues:
-
Your cluster can't connect to the control plane instance from the Amazon Web Services Region that Systems Manager is in. You can verify this by calling
aws ssm start-session --target
from an in-Region bastion host. If that command doesn't work, check if Systems Manager is running on the control plane instance. Or, another work around is to delete the cluster and then recreate it.instance-id
-
Systems Manager control plane instances might not have internet access. Check if the subnet that you provided when you created the cluster has a NAT gateway and a VPC with an internet gateway. Use VPC reachability analyzer to verify that the control plane instance can reach the internet gateway. For more information, see Getting started with VPC Reachability Analyzer.
-
The role ARN that you provided is missing policies. Check if the Amazon managed policy: AmazonEKSLocalOutpostClusterPolicy was removed from the role. This can also occur if an Amazon CloudFormation stack is misconfigured.
Multiple subnets are misconfigured and specified when a cluster is created:
-
All the provided subnets must be associated with the same Outpost and must reach each other. When multiple subnets are specified when a cluster is created, Amazon EKS attempts to spread the control plane instances across multiple subnets.
-
The Amazon EKS managed security groups are applied at the elastic network interface. However, other configuration elements such as NACL firewall rules might conflict with the rules for the elastic network interface.
VPC and subnet DNS configuration is misconfigured or missing
Review Amazon EKS local cluster VPC and subnet requirements and considerations.
Common causes:
-
AMI issues:
-
You're using an unsupported AMI. You must use v20220620
or later for the Amazon EKS optimized Amazon Linux AMIs Amazon EKS optimized Amazon Linux. -
If you used an Amazon CloudFormation template to create your nodes, make sure it wasn't using an unsupported AMI.
-
-
Missing the Amazon IAM Authenticator
ConfigMap
– If it's missing, you must create it. For more information, see Apply the aws-authConfigMap to your cluster . -
The wrong security group is used – Make sure to use
eks-cluster-sg-
for your worker nodes' security group. The selected security group is changed by Amazon CloudFormation to allow a new security group each time the stack is used.cluster-name
-uniqueid
-
Following unexpected private link VPC steps – Wrong CA data (
--b64-cluster-ca
) or API Endpoint (--apiserver-endpoint
) are passed. -
Misconfigured Pod security policy:
-
The CoreDNS and Amazon VPC CNI plugin for Kubernetes Daemonsets must run on nodes for nodes to join and communicate with the cluster.
-
The Amazon VPC CNI plugin for Kubernetes requires some privileged networking features to work properly. You can view the privileged networking features with the following command:
kubectl describe psp eks.privileged
.
We don't recommend modifying the default pod security policy. For more information, see Pod security policy.
-
When an Outpost gets disconnected from the Amazon Web Services Region that it's associated with, the Kubernetes cluster likely will continue working normally. However, if the cluster doesn't work properly, follow the troubleshooting steps in Preparing for network disconnects. If you encounter other issues, contact Amazon Web Services Support. Amazon Web Services Support can guide you on downloading and running a log collection tool. That way, you can collect logs from your Kubernetes cluster control plane instances and send them to Amazon Web Services Support support for further investigation.
When the Amazon EKS control plane instances aren't reachable through Amazon Systems Manager (Systems Manager), Amazon EKS displays the following error for your cluster.
Amazon EKS control plane instances are not reachable through SSM. Please verify your SSM and network configuration, and reference the EKS on Outposts troubleshooting documentation.
To resolve this issue, make sure that your VPC and subnets meet the requirements in Amazon EKS local cluster VPC and subnet requirements and considerations and that you completed the steps in Setting up Session Manager in the Amazon Systems Manager User Guide.