Node monitoring agent Node auto repair Node health issues

Enable node auto repair and investigate node health issues

Node health refers to the operational status and capability of a node to effectively run workloads. A healthy node maintains expected connectivity, has sufficient resources, and can successfully run Pods without disruption. For information on getting details about your nodes, see View the health status of your nodes and Retrieve node logs for a managed node using kubectl and S3.

To help with maintaining healthy nodes, Amazon EKS offers the node monitoring agent and node auto repair.

Important

The node monitoring agent and node auto repair are only available on Linux. These features aren’t available on Windows.

Node monitoring agent

The node monitoring agent automatically reads node logs to detect certain health issues. It parses through node logs to detect failures and surfaces various status information about worker nodes. A dedicated NodeCondition is applied on the worker nodes for each category of issues detected, such as storage and networking issues. Descriptions of detected health issues are made available in the observability dashboard. For more information, see Node health issues.

The node monitoring agent is included as a capability for all Amazon EKS Auto Mode clusters. For other cluster types, you can add the monitoring agent as an Amazon EKS add-on. For more information, see Create an Amazon EKS add-on.

Node auto repair

Node auto repair is an additional feature that continuously monitors the health of nodes, automatically reacting to detected problems and replacing nodes when possible. This helps overall availability of the cluster with minimal manual intervention. If a health check fails, the node is automatically cordoned so that no new Pods are scheduled on the node.

By itself, node auto repair can react to the Ready condition of the kubelet and any node objects that are manually deleted. When paired with the node monitoring agent, node auto repair can react to more conditions that wouldn’t be detected otherwise. These additional conditions include KernelReady, NetworkingReady, and StorageReady.

This automated node recovery automatically addresses intermittent node issues such as failures to join the cluster, unresponsive kubelets, and increased accelerator (device) errors. The improved reliability helps reduce application downtime and improve cluster operations. Node auto repair cannot handle certain problems that are reported such as DiskPressure, MemoryPressure, and PIDPressure. Amazon EKS waits 10 minutes before acting on the AcceleratedHardwareReady NodeConditions, and 30 minutes for all other conditions.

Managed node groups will also automatically disable node repairs for safety reasons under two scenarios. Any repair operations that are previously in progress will continue for both situations.

If a zonal shift for your cluster has been triggered through the Application Recovery Controller (ARC), all subsequent repair operations are halted.
If your node group has more than five nodes and more than 20% of the nodes in your node group are in an unhealthy state, repair operations are halted.

You can enable node auto repair when creating or editing a managed node group.

When using the Amazon EKS console, activate the Enable node auto repair checkbox for the managed node group. For more information, see Create a managed node group for your cluster.
When using the Amazon CLI, add the --node-repair-config enabled=true to the eks create nodegroup or eks update-nodegroup-config command.
For an example eksctl ClusterConfig that uses a managed node group with node auto repair, see 44-node-repair.yaml on GitHub.

Amazon EKS provides more granular control over the node auto repair behavior through the following:

maxUnhealthyNodeThresholdCount and maxUnhealthyNodeThresholdPercentage
- These fields allow you to specify a count or percentage threshold of unhealthy nodes, above which node auto repair actions will stop. This provides more control over the "blast radius" of node auto repairs.
- You can set either the absolute count or percentage, but not both.
maxParallelNodesRepairedCount and maxParallelNodesRepairedPercentage
- These fields allow you to specify the maximum number of nodes that can be repaired concurrently or in parallel, expressed as either a count or percentage of all unhealthy nodes. This gives you finer-grained control over the pace of node replacements.
- As with the unhealthy node threshold, you can set either the absolute count or percentage, but not both.
nodeRepairConfigOverrides
- This is a complex structure that allows you to set granular overrides for specific repair actions. These overrides control the repair action and the repair delay time before a node is considered eligible for repair.
- The specific fields in this structure are:
  - nodeMonitoringCondition: The unhealthy condition reported by the node monitoring agent.
  - nodeUnhealthyReason: The reason why the node monitoring agent identified the node as unhealthy.
  - minRepairWaitTimeMins: The minimum time (in minutes) that the repair condition and unhealthy reason must persist before the node is eligible for repair.
  - repairAction: The action the repair system should take when the above conditions are met.
- If you use this field, you must specify all the fields in the structure. You can also provide a list of these overrides.
- The nodeMonitoringCondition and nodeUnhealthyReason are manual text inputs that you set to indicate you want to deviate from the system’s default behavior.
- The minRepairWaitTimeMins and repairAction fields allow you to specify deviations from the system’s default behavior.

Node health issues

The following tables describe node health issues that can be detected by the node monitoring agent. There are two types of issues:

Condition – A terminal issue that warrants a remediation action like an instance replacement or reboot. When auto repair is enabled, Amazon EKS will do a repair action, either as a node replacement or reboot. For more information, see Node conditions.
Event – A temporary issue or sub-optimal node configuration. No auto repair action will take place. For more information, see Node events.

AcceleratedHardware node health issues

The monitoring condition is AcceleratedHardwareReady for issues in the following table that have a severity of “Condition”.

If auto repair is enabled, the repair actions that are listed start 10 minutes after the issue is detected. For more information on XID errors, see Xid Errors in the NVIDIA GPU Deployment and Management Documentation. For more information on the individual XID messages, see Understanding Xid Messages in the NVIDIA GPU Deployment and Management Documentation.

Name	Severity	Description
DCGMDiagnosticFailure	Condition	A test case from the DCGM active diagnostics test suite failed.
DCGMError	Condition	Connection to the DCGM host process was lost or could not be established.
DCGMFieldError[Code]	Event	DCGM detected GPU degredation through a field identifier.
DCGMHealthCode[Code]	Event	A DCGM health check failed in a non-fatal manner.
DCGMHealthCode[Code]	Condition	A DCGM health check failed in a fatal manner.
NeuronDMAError	Condition	A DMA engine encountered an unrecoverable error.
NeuronHBMUncorrectableError	Condition	An HBM encountered an uncorrectable error and produced incorrect results.
NeuronNCUncorrectableError	Condition	A Neuron Core uncorrectable memory error was detected.
NeuronSRAMUncorrectableError	Condition	An on-chip SRAM encountered a parity error and produced incorrect results.
NvidiaDeviceCountMismatch	Event	The number of GPUs visible through NVML is inconsistent with the NVIDIA device count on the filesystem.
NvidiaDoubleBitError	Condition	A double bit error was produced by the GPU driver.
NvidiaNCCLError	Event	A segfault occurred in the NVIDIA Collective Communications library (`libnccl`).
NvidiaNVLinkError	Condition	NVLink errors were reported by the GPU driver.
NvidiaPCIeError	Event	PCIe replays were triggered to recover from transmission errors.
NvidiaPageRetirement	Event	The GPU driver has marked a memory page for retirement. This may occur if there is a single double bit error or two single bit errors are encountered at the same address.
NvidiaPowerError	Event	Power utilization of GPUs breached the allowed thresholds.
NvidiaThermalError	Event	Thermal status of GPUs breached the allowed thresholds.
NvidiaXID[Code]Error	Condition	A critical GPU error occurred.
NvidiaXID[Code]Warning	Event	A non-critical GPU error occurred.

ContainerRuntime node health issues

The monitoring condition is ContainerRuntimeReady for issues in the following table that have a severity of “Condition”.

Name	Severity	Description
ContainerRuntimeFailed	Event	The container runtime has failed to create a container, likely related to any reported issues if occurring repeatedly.
DeprecatedContainerdConfiguration	Event	A container image using deprecated image manifest version 2, schema 1 was recently pulled onto the node through `containerd`.
KubeletFailed	Event	The kubelet entered a failed state.
LivenessProbeFailures	Event	A liveness probe failure was detected, potentially indicating application code issues or insufficient timeout values if occurring repeatedly.
PodStuckTerminating	Condition	A Pod is or was stuck terminating for an excessive amount of time, which can be caused by CRI errors preventing pod state progression.
ReadinessProbeFailures	Event	A readiness probe failure was detected, potentially indicating application code issues or insufficient timeout values if occurring repeatedly.
[Name]RepeatedRestart	Event	A systemd unit is restarting frequently.
ServiceFailedToStart	Event	A systemd unit failed to start.

Kernel node health issues

The monitoring condition is KernelReady for issues in the following table that have a severity of “Condition”.

Name	Severity	Description
AppBlocked	Event	The task has been blocked for a long period of time from scheduling, usually caused by being blocked on input or output.
AppCrash	Event	An application on the node has crashed.
ApproachingKernelPidMax	Event	The number of processes is approaching the maximum number of PIDs that are available per the current `kernel.pid_max` setting, after which no more processes can be launched.
ApproachingMaxOpenFiles	Event	The number of open files is approaching the maximum number of possible open files given the current kernel settings, after which opening new files will fail.
ConntrackExceededKernel	Event	Connection tracking exceeded the maximum for the kernel and new connections could not be established, which can result in packet loss.
ExcessiveZombieProcesses	Event	Processes which can’t be fully reclaimed are accumulating in large numbers, which indicates application issues and may lead to reaching system process limits.
ForkFailedOutOfPIDs	Condition	A fork or exec call has failed due to the system being out of process IDs or memory, which may be caused by zombie processes or physical memory exhaustion.
KernelBug	Event	A kernel bug was detected and reported by the Linux kernel itself, though this may sometimes be caused by nodes with high CPU or memory usage leading to delayed event processing.
LargeEnvironment	Event	The number of environment variables for this process is larger than expected, potentially caused by many services with `enableServiceLinks` set to true, which may cause performance issues.
RapidCron	Event	A cron job is running faster than every five minutes on this node, which may impact performance if the job consumes significant resources.
SoftLockup	Event	The CPU stalled for a given amount of time.

Networking node health issues

The monitoring condition is NetworkingReady for issues in the following table that have a severity of “Condition”.

Name	Severity	Description
BandwidthInExceeded	Event	Packets have been queued or dropped because the inbound aggregate bandwidth exceeded the maximum for the instance.
BandwidthOutExceeded	Event	Packets have been queued or dropped because the outbound aggregate bandwidth exceeded the maximum for the instance.
ConntrackExceeded	Event	Connection tracking exceeded the maximum for the instance and new connections could not be established, which can result in packet loss.
IPAMDInconsistentState	Event	The state of the IPAMD checkpoint on disk does not reflect the IPs in the container runtime.
IPAMDNoIPs	Event	IPAMD is out of IP addresses.
IPAMDNotReady	Condition	IPAMD fails to connect to the API server.
IPAMDNotRunning	Condition	The Amazon VPC CNI process was not found to be running.
IPAMDRepeatedlyRestart	Event	Multiple restarts in the IPAMD service have occurred.
InterfaceNotRunning	Condition	This interface appears to not be running or there are network issues.
InterfaceNotUp	Condition	This interface appears to not be up or there are network issues.
KubeProxyNotReady	Event	Kube-proxy failed to watch or list resources.
LinkLocalExceeded	Event	Packets were dropped because the PPS of traffic to local proxy services exceeded the network interface maximum.
MACAddressPolicyMisconfigured	Event	The systemd-networkd link configuration has the incorrect `MACAddressPolicy` value.
MissingDefaultRoutes	Event	There are missing default route rules.
MissingIPRoutes	Event	There are missing routes for Pod IPs.
MissingIPRules	Event	There are missing rules for Pod IPs.
MissingLoopbackInterface	Condition	The loopback interface is missing from this instance, causing failure of services depending on local connectivity.
NetworkSysctl	Event	This node’s network `sysctl` settings are potentially incorrect.
PPSExceeded	Event	Packets have been queued or dropped because the bidirectional PPS exceeded the maximum for the instance.
PortConflict	Event	If a Pod uses hostPort, it can write `iptables` rules that override the host’s already bound ports, potentially preventing API server access to `kubelet`.
UnexpectedRejectRule	Event	An unexpected `REJECT` or `DROP` rule was found in the `iptables`, potentially blocking expected traffic.

Storage node health issues

The monitoring condition is StorageReady for issues in the following table that have a severity of “Condition”.

Name	Severity	Description
EBSInstanceIOPSExceeded	Event	Maximum IOPS for the instance was exceeded.
EBSInstanceThroughputExceeded	Event	Maximum Throughput for the instance was exceeded.
EBSVolumeIOPSExceeded	Event	Maximum IOPS to a particular EBS Volume was exceeded.
EBSVolumeThroughputExceeded	Event	Maximum Throughput to a particular Amazon EBS volume was exceeded.
EtcHostsMountFailed	Event	Mounting of the kubelet generated `/etc/hosts` failed due to userdata remounting `/var/lib/kubelet/pods` during `kubelet-container` operation.
IODelays	Event	Input or output delay detected in a process, potentially indicating insufficient input-output provisioning if excessive.
KubeletDiskUsageSlow	Event	The `kubelet` is reporting slow disk usage while trying to access the filesystem. This potentially indicates insufficient disk input-output or filesystem issues.
XFSSmallAverageClusterSize	Event	The XFS Average Cluster size is small, indicating excessive free space fragmentation. This can prevent file creation despite available inodes or free space.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Custom builds

View node health