Amazon EKS and Kubernetes Container Insights metrics - Amazon CloudWatch
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Amazon EKS and Kubernetes Container Insights metrics

The following tables list the metrics and dimensions that Container Insights collects for Amazon EKS and Kubernetes. These metrics are in the ContainerInsights namespace. For more information, see Metrics.

If you do not see any Container Insights metrics in your console, be sure that you have completed the setup of Container Insights. Metrics do not appear before Container Insights has been set up completely. For more information, see Setting up Container Insights.

Metric name Dimensions Description

cluster_failed_node_count

ClusterName

The number of failed worker nodes in the cluster. A node is considered failed if it is suffering from any node conditions. For more information, see Conditions in the Kubernetes documentation.

cluster_node_count

ClusterName

The total number of worker nodes in the cluster.

namespace_number_of_running_pods

Namespace ClusterName

ClusterName

The number of pods running per namespace in the resource that is specified by the dimensions that you're using.

node_cpu_limit

ClusterName

The maximum number of CPU units that can be assigned to a single node in this cluster.

node_cpu_reserved_capacity

NodeName, ClusterName, InstanceId

ClusterName

The percentage of CPU units that are reserved for node components, such as kubelet, kube-proxy, and Docker.

Formula: node_cpu_request / node_cpu_limit

Note

node_cpu_request is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

node_cpu_usage_total

ClusterName

The number of CPU units being used on the nodes in the cluster.

node_cpu_utilization

NodeName, ClusterName, InstanceId

ClusterName

The total percentage of CPU units being used on the nodes in the cluster.

Formula: node_cpu_usage_total / node_cpu_limit

node_gpu_limit

ClusterName

ClusterName, InstanceId, NodeName

The total number of GPU(s) available on the node.

node_gpu_usage_total

ClusterName

ClusterName, InstanceId, NodeName

The number of GPU(s) being used by the running pods on the node.

node_gpu_reserved_capacity

ClusterName

ClusterName, InstanceId, NodeName

The percentage of GPU currently being reserved on the node. The formula is, node_gpu_request / node_gpu_limit.

Note

node_gpu_request is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

node_filesystem_utilization

NodeName, ClusterName, InstanceId

ClusterName

The total percentage of file system capacity being used on nodes in the cluster.

Formula: node_filesystem_usage / node_filesystem_capacity

Note

node_filesystem_usage and node_filesystem_capacity are not reported directly as metrics, but are fields in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

node_memory_limit

ClusterName

The maximum amount of memory, in bytes, that can be assigned to a single node in this cluster.

node_memory_reserved_capacity

NodeName, ClusterName, InstanceId

ClusterName

The percentage of memory currently being used on the nodes in the cluster.

Formula: node_memory_request / node_memory_limit

Note

node_memory_request is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

node_memory_utilization

NodeName, ClusterName, InstanceId

ClusterName

The percentage of memory currently being used by the node or nodes. It is the percentage of node memory usage divided by the node memory limitation.

Formula: node_memory_working_set / node_memory_limit.

node_memory_working_set

ClusterName

The amount of memory, in bytes, being used in the working set of the nodes in the cluster.

node_network_total_bytes

NodeName, ClusterName, InstanceId

ClusterName

The total number of bytes per second transmitted and received over the network per node in a cluster.

Formula: node_network_rx_bytes + node_network_tx_bytes

Note

node_network_rx_bytes and node_network_tx_bytes are not reported directly as metrics, but are fields in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

node_number_of_running_containers

NodeName, ClusterName, InstanceId

ClusterName

The number of running containers per node in a cluster.

node_number_of_running_pods

NodeName, ClusterName, InstanceId

ClusterName

The number of running pods per node in a cluster.

pod_cpu_reserved_capacity

PodName, Namespace, ClusterName

ClusterName

The CPU capacity that is reserved per pod in a cluster.

Formula: pod_cpu_request / node_cpu_limit

Note

pod_cpu_request is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_cpu_utilization

PodName, Namespace, ClusterName

Namespace, ClusterName

Service, Namespace, ClusterName

ClusterName

The percentage of CPU units being used by pods.

Formula: pod_cpu_usage_total / node_cpu_limit

pod_cpu_utilization_over_pod_limit

PodName, Namespace, ClusterName

Namespace, ClusterName

Service, Namespace, ClusterName

ClusterName

The percentage of CPU units being used by pods relative to the pod limit.

Formula: pod_cpu_usage_total / pod_cpu_limit

pod_gpu_request

ClusterName

ClusterName, Namespace, PodName

ClusterName, FullPodName, Namespace, PodName

The GPU requests for the pod. This value must always be equal to pod_gpu_limit.

pod_gpu_limit

ClusterName

ClusterName, Namespace, PodName

ClusterName, FullPodName, Namespace, PodName

The maximum number of GPU(s) that can be assigned to the pod in a node.

pod_gpu_usage_total

ClusterName

ClusterName, Namespace, PodName

ClusterName, FullPodName, Namespace, PodName

The number of GPU(s) being allocated on the pod.

pod_gpu_reserved_capacity

ClusterName

ClusterName, Namespace, PodName

ClusterName, FullPodName, Namespace, PodName

The percentage of GPU currently being reserved for the pod. The formula is - pod_gpu_request / node_gpu_reserved_capacity.

pod_memory_reserved_capacity

PodName, Namespace, ClusterName

ClusterName

The percentage of memory that is reserved for pods.

Formula: pod_memory_request / node_memory_limit

Note

pod_memory_request is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_memory_utilization

PodName, Namespace, ClusterName

Namespace, ClusterName

Service, Namespace, ClusterName

ClusterName

The percentage of memory currently being used by the pod or pods.

Formula: pod_memory_working_set / node_memory_limit

pod_memory_utilization_over_pod_limit

PodName, Namespace, ClusterName

Namespace, ClusterName

Service, Namespace, ClusterName

ClusterName

The percentage of memory that is being used by pods relative to the pod limit. If any containers in the pod don't have a memory limit defined, this metric doesn't appear.

Formula: pod_memory_working_set / pod_memory_limit

pod_network_rx_bytes

PodName, Namespace, ClusterName

Namespace, ClusterName

Service, Namespace, ClusterName

ClusterName

The number of bytes per second being received over the network by the pod.

Formula: sum(pod_interface_network_rx_bytes)

Note

pod_interface_network_rx_bytes is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_network_tx_bytes

PodName, Namespace, ClusterName

Namespace, ClusterName

Service, Namespace, ClusterName

ClusterName

The number of bytes per second being transmitted over the network by the pod.

Formula: sum(pod_interface_network_tx_bytes)

Note

pod_interface_network_tx_bytes is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_number_of_container_restarts

PodName, Namespace, ClusterName

The total number of container restarts in a pod.

service_number_of_running_pods

Service, Namespace, ClusterName

ClusterName

The number of pods running the service or services in the cluster.

Kueue metrics

Beginning with version v2.4.0-eksbuild.1 of the the CloudWatch Observability EKS add-on, Container Insights for Amazon EKS supports collecting Kueue metrics from Amazon EKS clusters. For more information about the add-on, see Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart.

For information about enabling the metrics, see Enable Kueue metrics to enable them.

The Kueue metrics that are collected are listed in the following table. These metrics are published into the ContainerInsights/Prometheus namespace in CloudWatch. Some of these metrics use the following dimensions:

  • ClusterQueue is the name of the ClusterQueue

  • The possible values of Status are active and inadmissible

  • The possible values of Reason are Preempted, PodsReadyTimeout, AdmissionCheck, ClusterQueueStopped, and InactiveWorkload

  • Flavor is the referenced flavor.

  • Resource refers to cluster computer resources, such as cpu, memory, gpu, and so on.

Metric name Dimensions Description

kueue_pending_workloads

ClusterName, ClusterQueue, Status

ClusterName, ClusterQueue

ClusterName, Status

ClusterName

The number of pending workloads.

kueue_evicted_workloads_total

ClusterName, ClusterQueue, Reason

ClusterName, ClusterQueue

ClusterName, Reason

ClusterName

The total number of evicted workloads.

kueue_admitted_active_workloads

ClusterName, ClusterQueue

ClusterName

The number of admitted workloads that are active (unsuspended and not finished).

kueue_cluster_queue_resource_usage

ClusterName, ClusterQueue, Resource, Flavor

ClusterName, ClusterQueue, Resource

ClusterName, ClusterQueue, Flavor

ClusterName, ClusterQueue

ClusterName

Reports the total resource usage of the ClusterQueue.

kueue_cluster_queue_nominal_quota

ClusterName, ClusterQueue, Resource, Flavor

ClusterName, ClusterQueue, Resource

ClusterName, ClusterQueue, Flavor

ClusterName, ClusterQueue

ClusterName

Reports the resource quota of the ClusterQueue.