Troubleshooting hybrid nodes - Amazon EKS
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Help improve this page

Want to contribute to this user guide? Choose the Edit this page on GitHub link that is located in the right pane of every page. Your contributions will help make our user guide better for everyone.

Troubleshooting hybrid nodes

This topic covers some common errors that you may see while using Amazon EKS Hybrid Nodes and how to fix them. For other troubleshooting information, see Troubleshoot problems with Amazon EKS clusters and nodes and Knowledge Center tag for Amazon EKS on Amazon re:Post. If you cannot resolve the issue, contact Amazon Support.

You can run the nodeadm debug command from your hybrid nodes to validate networking and credential requirements are met. For more information on the nodeadm debug command, see Hybrid nodes nodeadm reference.

Installing hybrid nodes troubleshooting

The troubleshooting topics in this section are related to installing the hybrid nodes dependencies on hosts with the nodeadm install command.

nodeadm command failed must run as root

The nodeadm install command must be run with a user that has root/sudo privileges on your host. If you run nodeadm install with a user that does not have root/sudo privileges, you will see the following error in the nodeadm output.

"msg":"Command failed","error":"must run as root"

Unable to connect to dependencies

The nodeadm install command installs the dependencies required for hybrid nodes. The hybrid nodes dependencies include containerd, kubelet, kubectl, and Amazon SSM or Amazon IAM Roles Anywhere components. You must have access from where you are running nodeadm install to download these dependencies. For more information on the list of locations that you must be able to access, see Prepare networking for hybrid nodes. If you do not have access, you will see errors similar to the following in the nodeadm install output.

"msg":"Command failed","error":"failed reading file from url: ...: max retries achieved for http request"

Failed to update package manager

The nodeadm install command runs apt update or yum/dnf update before installing the hybrid nodes dependencies. If this step does not succeed you may see errors similar to the following. To remediate, you can run apt update or yum/dnf update before running nodeadm install or you can attempt to re-run nodeadm install.

failed to run update using package manager

Timeout or context deadline exceeded

When running nodeadm install, if you see issues at various stages of the install process with errors that indicate there was a timeout or context deadline exceeded, you may have a slow connection that is preventing the installation of the hybrid nodes dependencies before timeouts are met. To work around these issues, you can attempt to use the --timeout flag in nodeadm to extend the duration of the timeouts for downloading the dependencies.

nodeadm install K8S_VERSION --credential-provider CREDS_PROVIDER --timeout 20m0s

Connecting hybrid nodes troubleshooting

The troubleshooting topics in this section are related to the process of connecting hybrid nodes to EKS clusters with the nodeadm init command.

Operation errors or unsupported scheme

When running nodeadm init, if you see errors related to operation error or unsupported scheme, check your nodeConfig.yaml to make sure it is properly formatted and passed to nodeadm. For more information on the format and options for nodeConfig.yaml, see Hybrid nodes nodeadm reference.

"msg":"Command failed","error":"operation error ec2imds: GetRegion, request canceled, context deadline exceeded"

Hybrid Nodes IAM role missing permissions for the eks:DescribeCluster action

When running nodeadm init, nodeadm attempts to gather information about your EKS cluster by calling Describe Cluster. If your Hybrid Nodes IAM role does not have permission for the eks:DescribeCluster action. For more information on the required permissions for the Hybrid Nodes IAM role, see Prepare credentials for hybrid nodes.

"msg":"Command failed","error":"operation error EKS: DescribeCluster, https response error StatusCode: 403 ... AccessDeniedException"

Hybrid nodes are not appearing in EKS cluster

If you ran nodeadm init and it completed but your hybrid nodes do not appear in your cluster, there may be issues with the network connection between your hybrid nodes and the EKS control plane, you may not have the required security group permissions configured, or you may not have the required mapping of your Hybrid Nodes IAM role to Kubernetes Role-Based Access Control (RBAC). You can start the debugging process by checking the status of kubelet and the kubelet logs with the following commands. Run the following commands from the hybrid nodes that failed to join your cluster.

systemctl status kubelet
journalctl -u kubelet -f

Unable to communicate with cluster

If your hybrid node was unable to communicate with the cluster control plane, you may see logs similar to the following.

"Failed to ensure lease exists, will retry" err="Get ..."
"Unable to register node with API server" err="Post ..."
Failed to contact API server when waiting for CSINode publishing ... dial tcp <ip address>: i/o timeout

If you see these messages, check the following to ensure it meets the hybrid nodes requirements detailed in Prepare networking for hybrid nodes.

  • Confirm the VPC passed to EKS cluster has routes to your Transit Gateway (TGW) or Virtual Private Gateway (VGW) for your on-premises node and optionally pod CIDRs.

  • Confirm you have an additional security group for your EKS cluster has inbound rules for your on-premises node CIDRs and optionally pod CIDRs.

  • Confirm your on-premises router is configured to allow traffic to and from the EKS control plane.

Unauthorized

If your hybrid node was able to communicate with the EKS control plane but was not able to register, you may see logs similar to the following. Note the key difference in the log messages below is the Unauthorized error. This signals that the node was not able to perform its tasks because it does not have the required permissions.

"Failed to ensure lease exists, will retry" err="Unauthorized"
"Unable to register node with API server" err="Unauthorized"
Failed to contact API server when waiting for CSINode publishing: Unauthorized

If you see these messages, check the following to ensure it meets the hybrid nodes requirements details in Prepare credentials for hybrid nodes and Prepare cluster access for hybrid nodes.

  • Confirm the identity of the hybrid nodes matches your expected Hybrid Nodes IAM role. This can be done by running sudo aws sts get-caller-identity from your hybrid nodes.

  • Confirm your Hybrid Nodes IAM role has the required permissions.

  • Confirm that in your cluster you have an EKS access entry for your Hybrid Nodes IAM role or confirm that your aws-auth ConfigMap has an entry for your Hybrid Nodes IAM role. If you are using EKS access entries, confirm your access entry for your Hybrid Nodes IAM role has the HYBRID_LINUX access type. If you are using the aws-auth ConfigMap, confirm your entry for the Hybrid Nodes IAM role meets the requirements and formatting detailed in Prepare cluster access for hybrid nodes.

Hybrid nodes registered with EKS cluster but show status Not Ready

If your hybrid nodes successfully registered with your EKS cluster, but the hybrid nodes show status Not Ready, the first thing to check is your Container Networking Interface (CNI) status. If you have not installed a CNI, then it is expected that your hybrid nodes have status Not Ready. Once a CNI is installed and running successfully, nodes transition to have the status Ready. If you attempted to install a CNI but it is not running successfully, see Hybrid nodes CNI troubleshooting on this page.

Certificate Signing Requests (CSRs) are stuck Pending

After connecting hybrid nodes to your EKS cluster, if you see that there are pending CSRs for your hybrid nodes, your hybrid nodes are not meeting the requirements for automatic approval. CSRs for hybrid nodes are automatically approved and issued if the CSRs for hybrid nodes were created by a node with eks.amazonaws.com/compute-type: hybrid label, and the CSR has the following Subject Alternative Names (SANs): at least one DNS SAN equal to the node name and the IP SANs belong to the remote node network CIDRs.

Hybrid profile already exists

If you changed your nodeadm configuration and attempt to re-register the node with the new configuration, you may see an error that states that the hybrid profile already exists but its contents have changed. Instead of running nodeadm init in between configuration changes, run nodeadm uninstall followed by a nodeadm install and nodeadm init. This ensures a proper clean up with the changes in configuration.

"msg":"Command failed","error":"hybrid profile already exists at /etc/aws/hybrid/config but its contents do not align with the expected configuration"

Hybrid node failed to resolve Private API

After running nodeadm init, if you see an error in the kubelet logs that shows failures to contact the EKS Kubernetes API server because there is no such host, you may have to change your DNS entry for the EKS Kubernetes API endpoint in your on-premises network or at the host level. See Forwarding inbound DNS queries to your VPC in the Amazon Route53 documentation.

Failed to contact API server when waiting for CSINode publishing: Get ... no such host

Can’t view hybrid nodes in the EKS console

If you have registered your hybrid nodes but are unable to view them in the EKS console, check the permissions of the IAM principal you are using to view the console. The IAM principal you’re using must have specific minimum IAM and Kubernetes permissions to view resources in the console. For more information, see View Kubernetes resources in the Amazon Web Services Management Console.

Running hybrid nodes troubleshooting

If your hybrid nodes registered with your EKS cluster, had status Ready, and then transitioned to status Not Ready, there are a wide range of issues that may have contributed to the unhealthy status such as the node lacking sufficient resources for CPU, memory, or available disk space, or the node is disconnected from the EKS control plane. You can use the steps below to troubleshoot your nodes, and if you cannot resolve the issue, contact Amazon Support.

Run nodeadm debug from your hybrid nodes to validate networking and credential requirements are met. For more information on the nodeadm debug command, see Hybrid nodes nodeadm reference.

Get node status

kubectl get nodes -o wide

Check node conditions and events

kubectl describe node NODE_NAME

Get pod status

kubectl get pods -A -o wide

Check pod conditions and events

kubectl describe pod POD_NAME

Check pod logs

kubectl logs POD_NAME

Check kubectl logs

systemctl status kubelet
journalctl -u kubelet -f

Pod liveness probes failing or webhooks are not working

If applications, add-ons, or webhooks running on your hybrid nodes are not starting properly, you may have networking issues that block the communication to the pods. For the EKS control plane to contact webhooks running on hybrid nodes, you must configure your EKS cluster with a remote pod network and have routes for your on-premises pod CIDR in your VPC routing table with the target as your Transit Gateway (TGW), virtual private gateway (VPW), or other gateway you are using to connect your VPC with your on-premises network. For more information on the networking requirements for hybrid nodes, see Prepare networking for hybrid nodes. You additionally must allow this traffic in your on-premises firewall and ensure your router can properly route to your pods.

A common pod log message for this scenario is shown below the following where ip-address is the Cluster IP for the Kubernetes service.

dial tcp <ip-address>:443: connect: no route to host

Hybrid nodes CNI troubleshooting

If you run into issues with initially starting Cilium or Calico with hybrid nodes, it is most often due to networking issues between hybrid nodes or the CNI pods running on hybrid nodes, and the EKS control plane. Make sure your environment meets the requirements in Prepare networking for hybrid nodes. It’s useful to break down the problem into parts.

EKS cluster configuration

Are the RemoteNodeNetwork and RemotePodNetwork configurations correct?

VPC configuration

Are there routes for the RemoteNodeNetwork and RemotePodNetwork in the VPC routing table with the target of the Transit Gateway or Virtual Private Gateway?

Security group configuration

Are there inbound and outbound rules for the RemoteNodeNetwork and RemotePodNetwork ?

On-premises network

Are there routes and access to/from the EKS control plane and to/from the hybrid nodes and the pods running on hybrid nodes?

CNI configuration

If using an overlay network, does the IP pool configuration for the CNI match the RemotePodNetwork configured for the EKS cluster if using webhooks?

Hybrid node has status Ready without a CNI installed

If your hybrid nodes are showing status Ready, but you have not installed a CNI on your cluster, it is possible that there are old CNI artifacts on your hybrid nodes. By default, when you uninstall Cilium and Calico with tools such as Helm, the on-disk resources are not removed from your physical or virtual machines. Additionally, the Custom Resource Definitions (CRDs) for these CNIs may still be present on your cluster from an old installation. For more information, see the Delete Cilium and Delete Calico sections of Configure a CNI for hybrid nodes.

Cilium troubleshooting

If you are having issues running Cilium on hybrid nodes, see the troubleshooting steps in the Cilium documentation. The sections below cover issues that may be unique to deploying Cilium on hybrid nodes.

Cilium isn’t starting

If the Cilium agents that run on each hybrid node are not starting, check the logs of the Cilium agent pods for errors. The Cilium agent requires connectivity to the EKS Kubernetes API endpoint to start. Cilium agent startup will fail if this connectivity is not correctly configured. In this case, you will see log messages similar to the following in the Cilium agent pod logs.

msg="Unable to contact k8s api-server" level=fatal msg="failed to start: Get \"https://<k8s-cluster-ip>:443/api/v1/namespaces/kube-system\": dial tcp <k8s-cluster-ip>:443: i/o timeout"

The Cilium agent runs on the host network. Your EKS cluster must be configured with RemoteNodeNetwork for the Cilium connectivity. Confirm you have an additional security group for your EKS cluster with an inbound rule for your RemoteNodeNetwork, that you have routes in your VPC for your RemodeNodeNetwork, and that your on-premises network is configured correctly to allow connectivity to the EKS control plane.

If the Cilium operator is running and some of your Cilium agents are running but not all, confirm that you have available pod IPs to allocate for all nodes in your cluster. You configure the size of your allocatable Pod CIDRs when using cluster pool IPAM with clusterPoolIPv4PodCIDRList in your Cilium configuration. The per-node CIDR size is configured with the clusterPoolIPv4MaskSize setting in your Cilium configuration. See Expanding the cluster pool in the Cilium documentation for more information.

Cilium BGP is not working

If you are using Cilium BGP Control Plane to advertise your pod or service addresses to your on-premises network, you can use the following Cilium CLI commands to check if BGP is advertising the routes to your resources. For steps to install the Cilium CLI, see Install the Cilium CLI in the Cilium documentation.

If BGP is working correctly, you should your hybrid nodes with Session State established in the output. You may need to work with your networking team to identify the correct values for your environment’s Local AS, Peer AS, and Peer Address.

cilium bgp peers
cilium bgp routes

If you are using Cilium BGP to advertise the IPs of Services with type LoadBalancer, you must have the same label on both your CiliumLoadBalancerIPPool and Service, which should be used in the selector of your CiliumBGPAdvertisement. An example is shown below. Note, if you are using Cilium BGP to advertise the IPs of Services with type LoadBalancer, the BGP routes may be disrupted during Cilium agent restart. For more information, see Failure Scenarios in the Cilium documentation.

Service

kind: Service apiVersion: v1 metadata: name: guestbook labels: app: guestbook spec: ports: - port: 3000 targetPort: http-server selector: app: guestbook type: LoadBalancer

CiliumLoadBalancerIPPool

apiVersion: cilium.io/v2alpha1 kind: CiliumLoadBalancerIPPool metadata: name: guestbook-pool labels: app: guestbook spec: blocks: - cidr: <CIDR to advertise> serviceSelector: matchExpressions: - { key: app, operator: In, values: [ guestbook ] }

CiliumBGPAdvertisement

apiVersion: cilium.io/v2alpha1 kind: CiliumBGPAdvertisement metadata: name: bgp-advertisements-guestbook labels: advertise: bgp spec: advertisements: - advertisementType: "Service" service: addresses: - ExternalIP - LoadBalancerIP selector: matchExpressions: - { key: app, operator: In, values: [ guestbook ] }

Calico troubleshooting

If you are having issues running Calico on hybrid nodes, see the troubleshooting steps in the Calico documentation. The sections below cover issues that may be unique to deploying Calico on hybrid nodes.

The table below summarizes the Calico components and whether they run on the node or pod network by default. If you configured Calico to use NAT for outgoing pod traffic, your on-premises network must be configured to route traffic to your on-premises node CIDR and your VPC routing tables must be configured with a route for your on-premises node CIDR with your transit gateway (TGW) or virtual private gateway (VGW) as the target. If you are not configuring Calico to use NAT for outgoing pod traffic, your on-premises network must be configured to route traffic to your on-premises pod CIDR and your VPC routing tables must be configured with a route for your on-premises pod CIDR with your transit gateway (TGW) or virtual private gateway (VGW) as the target.

Component Network

Calico API server

Node

Calico kube controllers

Pod

Calico node agent

Node

Calico typha

Node

Calico CSI node driver

Pod

Calico operator

Node

Calico resources are scheduled or running on cordoned nodes

The Calico resources that don’t run as a DaemonSet have flexible tolerations by default that enable them to be scheduled on cordoned nodes that are not ready for scheduling or running pods. You can tighten the tolerations for the non-DaemonSet Calico resources by changing your operator installation to include the following.

installation: ... controlPlaneTolerations: - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 calicoKubeControllersDeployment: spec: template: spec: tolerations: - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 typhaDeployment: spec: template: spec: tolerations: - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300

Credentials troubleshooting

For both Amazon SSM hybrid activations and Amazon IAM Roles Anywhere, you can validate that credentials for the Hybrid Nodes IAM role are correctly configured on your hybrid nodes by running the following command from your hybrid nodes. Confirm the node name and Hybrid Nodes IAM Role name are what you expect.

sudo aws sts get-caller-identity
{ "UserId": "ABCDEFGHIJKLM12345678910:<node-name>", "Account": "<aws-account-id>", "Arn": "arn:aws:sts::<aws-account-id>:assumed-role/<hybrid-nodes-iam-role/<node-name>" }

Amazon Systems Manager (SSM) troubleshooting

If you are using Amazon SSM hybrid activations for your hybrid nodes credentials, be aware of the following SSM directories and artifacts that are installed on your hybrid nodes by nodeadm. For more information on the SSM agent, see Working with the SSM agent in the Amazon Systems Manager User Guide.

Description Location

SSM agent

Ubuntu - /snap/amazon-ssm-agent/current/amazon-ssm-agent RHEL & AL2023 - /usr/bin/amazon-ssm-agent

SSM agent logs

/var/log/amazon/ssm

Amazon credentials

/root/.aws/credentials

SSM Setup CLI

/opt/ssm/ssm-setup-cli

Restarting the SSM agent

Some issues can be resolved by restarting the SSM agent. You can use the commands below to restart it.

AL2023 and other operating systems

systemctl restart amazon-ssm-agent

Ubuntu

systemctl restart snap.amazon-ssm-agent.amazon-ssm-agent

Check connectivity to SSM endpoints

Confirm you can connect to the SSM endpoints from your hybrid nodes. For a list of the SSM endpoints, see Amazon Systems Manager endpoints and quotas. Replace us-west-2 in the command below with the Amazon Web Services Region for your Amazon SSM hybrid activation.

ping ssm.us-west-2.amazonaws.com

View connection status of registered SSM instances

You can check the connection status of the instances that are registered with SSM hybrid activations with the following Amazon CLI command. Replace the machine ID with the machine ID of your instance.

aws ssm get-connection-status --target mi-012345678abcdefgh

SSM Setup CLI checksum mismatch

When running nodeadm install if you see an issue with the ssm-setup-cli checksum mismatch you should confirm there are not older existing SSM installations on your host. If there are older SSM installations on your host, remove them and re-run nodeadm install to resolve the issue.

Failed to perform agent-installation/on-prem registration: error while verifying installed ssm-setup-cli checksum: checksum mismatch with latest ssm-setup-cli.

SSM InvalidActivation

If you see an error registering your instance with Amazon SSM, confirm the region, activationCode, and activationId in your nodeConfig.yaml are correct. The Amazon Web Services Region for your EKS cluster must match the region of your SSM hybrid activation. If these values are misconfigured, you may see an error similar to the following.

ERROR Registration failed due to error registering the instance with Amazon SSM. InvalidActivation

SSM ExpiredTokenException: The security token included in the request is expired

If the SSM agent is not able to refresh credentials, you may see an ExpiredTokenException. In this scenario, if you are able to connect to the SSM endpoints from your hybrid nodes, you may need to restart the SSM agent to force a credential refresh.

"msg":"Command failed","error":"operation error SSM: DescribeInstanceInformation, https response error StatusCode: 400, RequestID: eee03a9e-f7cc-470a-9647-73d47e4cf0be, api error ExpiredTokenException: The security token included in the request is expired"

SSM error in running register machine command

If you see an error registering the machine with SSM, you may need to re-run nodeadm install to make sure all of the SSM dependencies are properly installed.

"error":"running register machine command: , error: fork/exec /opt/aws/ssm-setup-cli: no such file or directory"

SSM ActivationExpired

When running nodeadm init, if you see an error registering the instance with SSM due to an expired activation, you need to create a new SSM hybrid activation, update your nodeConfig.yaml with the activationCode and activationId of your new SSM hybrid activation, and re-run nodeadm init.

"msg":"Command failed","error":"SSM activation expired. Please use a valid activation"
ERROR Registration failed due to error registering the instance with Amazon SSM. ActivationExpired

SSM failed to refresh cached credentials

If you see a failure to refresh cached credentials, the /root/.aws/credentials file may have been deleted on your host. First check your SSM hybrid activation and ensure it is active and your hybrid nodes are configured correctly to use the activation. Check the SSM agent logs at /var/log/amazon/ssm and re-run the nodeadm init command once you have resolved the issue on the SSM side.

"Command failed","error":"operation error SSM: DescribeInstanceInformation, get identity: get credentials: failed to refresh cached credentials"

Clean up SSM

To remove the SSM agent from your host, you can run the following commands.

dnf remove -y amazon-ssm-agent sudo apt remove --purge amazon-ssm-agent snap remove amazon-ssm-agent rm -rf /var/lib/amazon/ssm/Vault/Store/RegistrationKey

Amazon IAM Roles Anywhere troubleshooting

If you are using Amazon IAM Roles Anywhere for your hybrid nodes credentials, be aware of the following directories and artifacts that are installed on your hybrid nodes by nodeadm. For more information on the troubleshooting IAM Roles Anywhere, see Troubleshooting Amazon IAM Roles Anywhere identity and access in the Amazon IAM Roles Anywhere User Guide.

Description Location

IAM Roles Anywhere CLI

/usr/local/bin/aws_signing_helper

Default certificate location and name

/etc/iam/pki/server.pem

Default key location and name

/etc/iam/pki/server.key

IAM Roles Anywhere failed to refresh cached credentials

If you see a failure to refresh cached credentials, review the contents of /etc/aws/hybrid/config and confirm that IAM Roles Anywhere was configured correctly in your nodeadm configuration. Confirm that /etc/iam/pki exists. Each node must have a unique certificate and key. By default, when using IAM Roles Anywhere as the credential provider, nodeadm uses /etc/iam/pki/server.pem for the certificate location and name, and /etc/iam/pki/server.key for the private key. You may need to create the directories before placing the certificates and keys in the directories with sudo mkdir -p /etc/iam/pki. You can verify the content of your certificate with the command below.

openssl x509 -text -noout -in server.pem
open /etc/iam/pki/server.pem: no such file or directory could not parse PEM data Command failed {"error": "... get identity: get credentials: failed to refresh cached credentials, process provider error: error in credential_process: exit status 1"}

IAM Roles Anywhere not authorized to perform sts:AssumeRole

In the kubelet logs, if you see an access denied issue for the sts:AssumeRole operation when using IAM Roles Anywhere, check the trust policy of your Hybrid Nodes IAM role to confirm the IAM Roles Anywhere service principal is allowed to assume the Hybrid Nodes IAM Role. Additionally confirm that the trust anchor ARN is configured properly in your Hybrid Nodes IAM role trust policy and that your Hybrid Nodes IAM role is added to your IAM Roles Anywhere profile.

could not get token: AccessDenied: User: ... is not authorized to perform: sts:AssumeRole on resource: ...

IAM Roles Anywhere not authorized to set roleSessionName

In the kubelet logs, if you see an access denied issue for setting the roleSessionName, confirm you have set acceptRoleSessionName to true for your IAM Roles Anywhere profile.

AccessDeniedException: Not authorized to set roleSessionName

Operating system troubleshooting

RHEL

Entitlement or subscription manager registration failures

If you are running nodeadm install and encounter a failure to install the hybrid nodes dependencies due to entitlement registration issues, ensure you have properly set your Red Hat username and password on your host.

This system is not registered with an entitlement server

GLIBC not found

If you are using Ubuntu for your operating system and IAM Roles Anywhere for your credential provider with hybrid nodes and see an issue with GLIBC not found, you can install that dependency manually to resolve the issue.

GLIBC_2.32 not found (required by /usr/local/bin/aws_signing_helper)

Run the following commands to install the dependency:

ldd --version sudo apt update && apt install libc6 sudo apt install glibc-source