Amazon EKS and Kubernetes Container Insights metrics - Amazon CloudWatch
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Amazon EKS and Kubernetes Container Insights metrics

The following tables list the metrics and dimensions that Container Insights collects for Amazon EKS and Kubernetes. These metrics are in the ContainerInsights namespace. For more information, see Metrics.

If you do not see any Container Insights metrics in your console, be sure that you have completed the setup of Container Insights. Metrics do not appear before Container Insights has been set up completely. For more information, see Setting up Container Insights.

If you are using version 1.5.0 or later of the Amazon EKS add-on or version 1.300035.0 of the CloudWatch agent, most metrics listed in the following table are collected for both Linux and Windows nodes. See the Metric Name column of the table to see which metrics are not collected for Windows.

With the original version of Container Insights, the metrics are charged as custom metrics. With Container Insights with enhanced observability for Amazon EKS, Container Insights metrics are charged per observation instead of being charged per metric stored or log ingested. For more information about CloudWatch pricing, see Amazon CloudWatch Pricing.

Note

On Windows, network metrics such as pod_network_rx_bytes and pod_network_tx_bytes are not collected for host process containers.

Metric name Dimensions with any version of Container Insights Additional dimensions with Container Insights with enhanced observability for Amazon EKS Description

cluster_failed_node_count

ClusterName

The number of failed worker nodes in the cluster. A node is considered failed if it is suffering from any node conditions. For more information, see Conditions in the Kubernetes documentation.

cluster_node_count

ClusterName

The total number of worker nodes in the cluster.

namespace_number_of_running_pods

Namespace ClusterName

ClusterName

The number of pods running per namespace in the resource that is specified by the dimensions that you're using.

node_cpu_limit

ClusterName

ClusterName, InstanceId, NodeName

The maximum number of CPU units that can be assigned to a single node in this cluster.

node_cpu_reserved_capacity

NodeName, ClusterName, InstanceId

ClusterName

The percentage of CPU units that are reserved for node components, such as kubelet, kube-proxy, and Docker.

Formula: node_cpu_request / node_cpu_limit

Note

node_cpu_request is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

node_cpu_usage_total

ClusterName

ClusterName, InstanceId, NodeName

The number of CPU units being used on the nodes in the cluster.

node_cpu_utilization

NodeName, ClusterName, InstanceId

ClusterName

The total percentage of CPU units being used on the nodes in the cluster.

Formula: node_cpu_usage_total / node_cpu_limit

node_filesystem_utilization

NodeName, ClusterName, InstanceId

ClusterName

The total percentage of file system capacity being used on nodes in the cluster.

Formula: node_filesystem_usage / node_filesystem_capacity

Note

node_filesystem_usage and node_filesystem_capacity are not reported directly as metrics, but are fields in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

node_memory_limit

ClusterName

ClusterName, InstanceId, NodeName

The maximum amount of memory, in bytes, that can be assigned to a single node in this cluster.

node_filesystem_inodes

This metric is available only with Container Insights with enhanced observability for Amazon EKS. It is not available on Windows.

ClusterName

ClusterName, InstanceId, NodeName

The total number of inodes (used and unused) on a node.

node_filesystem_inodes_free

This metric is available only with Container Insights with enhanced observability for Amazon EKS. It is not available on Windows.

ClusterName

ClusterName, InstanceId, NodeName

The number of unused inodes on a node.

node_memory_reserved_capacity

NodeName, ClusterName, InstanceId

ClusterName

The percentage of memory currently being used on the nodes in the cluster.

Formula: node_memory_request / node_memory_limit

Note

node_memory_request is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

node_memory_utilization

NodeName, ClusterName, InstanceId

ClusterName

The percentage of memory currently being used by the node or nodes. It is the percentage of node memory usage divided by the node memory limitation.

Formula: node_memory_working_set / node_memory_limit.

node_memory_working_set

ClusterName

ClusterName, InstanceId, NodeName

The amount of memory, in bytes, being used in the working set of the nodes in the cluster.

node_network_total_bytes

NodeName, ClusterName, InstanceId

ClusterName

The total number of bytes per second transmitted and received over the network per node in a cluster.

Formula: node_network_rx_bytes + node_network_tx_bytes

Note

node_network_rx_bytes and node_network_tx_bytes are not reported directly as metrics, but are fields in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

node_number_of_running_containers

NodeName, ClusterName, InstanceId

ClusterName

The number of running containers per node in a cluster.

node_number_of_running_pods

NodeName, ClusterName, InstanceId

ClusterName

The number of running pods per node in a cluster.

node_status_allocatable_pods

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, InstanceId, NodeName

The number of pods that can be assigned to a node based on its allocatable resources, which is defined as the remainder of a node's capacity after accounting for system daemons reservations and hard eviction thresholds.

node_status_capacity_pods

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, InstanceId, NodeName

The number of pods that can be assigned to a node based on its capacity.

node_status_condition_ready

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, InstanceId, NodeName

Indicates whether the node status condition Ready is true.

node_status_condition_memory_pressure

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, InstanceId, NodeName

Indicates whether the node status condition MemoryPressure is true.

node_status_condition_pid_pressure

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, InstanceId, NodeName

Indicates whether the node status condition PIDPressure is true.

node_status_condition_disk_pressure

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, InstanceId, NodeName

Indicates whether the node status condition OutOfDisk is true.

node_status_condition_unknown

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, InstanceId, NodeName

Indicates whether any of the node status conditions are Unknown.

node_interface_network_rx_dropped

This metric is available only with Container Insights with enhanced observability for Amazon EKS

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, InstanceId, NodeName

The number of packets which were received and subsequently dropped by a network interface on the node.

node_interface_network_tx_dropped

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, InstanceId, NodeName

The number of packets which were due to be transmitted but were dropped by a network interface on the node.

node_diskio_io_service_bytes_total

This metric is available only with Container Insights with enhanced observability for Amazon EKS. It is not available on Windows.

ClusterName

ClusterName, InstanceId, NodeName

The total number of bytes transferred by all I/O operations on the node.

node_diskio_io_serviced_total

This metric is available only with Container Insights with enhanced observability for Amazon EKS. It is not available on Windows.

ClusterName

ClusterName, InstanceId, NodeName

The total number of I/O operations on the node.

pod_cpu_reserved_capacity

PodName, Namespace, ClusterName

ClusterName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, Service

The CPU capacity that is reserved per pod in a cluster.

Formula: pod_cpu_request / node_cpu_limit

Note

pod_cpu_request is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_cpu_utilization

PodName, Namespace, ClusterName

Namespace, ClusterName

Service, Namespace, ClusterName

ClusterName

ClusterName, Namespace, PodName, FullPodName

The percentage of CPU units being used by pods.

Formula: pod_cpu_usage_total / node_cpu_limit

Note

pod_cpu_usage_total is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_cpu_utilization_over_pod_limit

PodName, Namespace, ClusterName

Namespace, ClusterName

Service, Namespace, ClusterName

ClusterName

ClusterName, Namespace, PodName, FullPodName

The percentage of CPU units being used by pods relative to the pod limit.

Formula: pod_cpu_usage_total / pod_cpu_limit

Note

pod_cpu_usage_total and pod_cpu_limit are not reported directly as metrics, but are fields in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_memory_reserved_capacity

PodName, Namespace, ClusterName

ClusterName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, Service

The percentage of memory that is reserved for pods.

Formula: pod_memory_request / node_memory_limit

Note

pod_memory_request is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_memory_utilization

PodName, Namespace, ClusterName

Namespace, ClusterName

Service, Namespace, ClusterName

ClusterName

ClusterName, Namespace, PodName, FullPodName

The percentage of memory currently being used by the pod or pods.

Formula: pod_memory_working_set / node_memory_limit

Note

pod_memory_working_set is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_memory_utilization_over_pod_limit

PodName, Namespace, ClusterName

Namespace, ClusterName

Service, Namespace, ClusterName

ClusterName

ClusterName, Namespace, PodName, FullPodName

The percentage of memory that is being used by pods relative to the pod limit. If any containers in the pod don't have a memory limit defined, this metric doesn't appear.

Formula: pod_memory_working_set / pod_memory_limit

Note

pod_memory_working_set is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_network_rx_bytes

PodName, Namespace, ClusterName

Namespace, ClusterName

Service, Namespace, ClusterName

ClusterName

ClusterName, Namespace, PodName, FullPodName

The number of bytes per second being received over the network by the pod.

Formula: sum(pod_interface_network_rx_bytes)

Note

pod_interface_network_rx_bytes is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_network_tx_bytes

PodName, Namespace, ClusterName

Namespace, ClusterName

Service, Namespace, ClusterName

ClusterName

ClusterName, Namespace, PodName, FullPodName

The number of bytes per second being transmitted over the network by the pod.

Formula: sum(pod_interface_network_tx_bytes)

Note

pod_interface_network_tx_bytes is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_cpu_request

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

The CPU requests for the pod.

Formula: sum(container_cpu_request)

Note

pod_cpu_request is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_memory_request

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

The memory requests for the pod.

Formula: sum(container_memory_request)

Note

pod_memory_request is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_cpu_limit

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

The CPU limit defined for the containers in the pod. If any containers in the pod don't have a CPU limit defined, this metric doesn't appear.

Formula: sum(container_cpu_limit)

Note

pod_cpu_limit is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_memory_limit

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

The memory limit defined for the containers in the pod. If any containers in the pod don't have a memory limit defined, this metric doesn't appear.

Formula: sum(container_memory_limit)

Note

pod_cpu_limit is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_status_failed

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Indicates that all containers in the pod have terminated, and at least one container has terminated with a non-zero status or was terminated by the system.

pod_status_ready

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Indicates that all containers in the pod are ready, having reached the condition of ContainerReady.

pod_status_running

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Indicates that all containers in the pod are running.

pod_status_scheduled

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Indicates that the pod has been scheduled to a node.

pod_status_unknown

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Indicates that status of the pod can't be obtained.

pod_status_pending

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Indicates that the pod has been accepted by the cluster but one or more of the containers has not become ready yet.

pod_status_succeeded

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Indicates that all containers in the pod have successfully terminated and will not be restarted.

pod_number_of_containers

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Reports the number of containers defined in the pod specification.

pod_number_of_running_containers

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Reports the number of containers in the pod which are currently in the Running state.

pod_container_status_terminated

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Reports the number of containers in the pod which are in the Terminated state.

pod_container_status_running

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Reports the number of containers in the pod which are in the Running state.

pod_container_status_waiting

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Reports the number of containers in the pod which are in the Waiting state.

pod_interface_network_rx_dropped

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

The number of packets which were received and subsequently dropped a network interface for the pod.

pod_interface_network_tx_dropped

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

The number of packets which were due to be transmitted but were dropped for the pod.

container_cpu_utilization

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName, ContainerName

PodName, Namespace, ClusterName, ContainerName, FullPodName

The percentage of CPU units being used by the container.

Formula: container_cpu_usage_total / node_cpu_limit

Note

container_cpu_utilization is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

container_cpu_utilization_over_container_limit

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName, ContainerName

PodName, Namespace, ClusterName, ContainerName, FullPodName

The percentage of CPU units being used by the container relative to the container limit. If the container doesn't have a CPU limit defined, this metric doesn't appear.

Formula: container_cpu_usage_total / container_cpu_limit

Note

container_cpu_utilization_over_container_limit is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

container_memory_utilization

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName, ContainerName

PodName, Namespace, ClusterName, ContainerName, FullPodName

The percentage of memory units being used by the container.

Formula: container_memory_working_set / node_memory_limit

Note

container_memory_utilization is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

container_memory_utilization_over_container_limit

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName, ContainerName

PodName, Namespace, ClusterName, ContainerName, FullPodName

The percentage of memory units being used by the container relative to the container limit. If the container doesn't have a memory limit defined, this metric doesn't appear.

Formula: container_memory_working_set / container_memory_limit

Note

container_memory_utilization_over_container_limit is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

container_memory_failures_total

This metric is available only with Container Insights with enhanced observability for Amazon EKS. It is not available on Windows.

ClusterName

PodName, Namespace, ClusterName, ContainerName

PodName, Namespace, ClusterName, ContainerName, FullPodName

The number of memory allocation failures experienced by the container.

pod_number_of_container_restarts

PodName, Namespace, ClusterName

The total number of container restarts in a pod.

service_number_of_running_pods

Service, Namespace, ClusterName

ClusterName

The number of pods running the service or services in the cluster.

replicas_desired

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

The number of pods desired for a workload as defined in the workload specification.

replicas_ready

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

The number of pods for a workload that have reached the ready status.

status_replicas_available

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

The number of pods for a workload which are available. A pod is available when it has been ready for the minReadySeconds defined in the workload specification.

status_replicas_unavailable

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

PodName, Namespace, ClusterName

The number of pods for a workload which are unavailable. A pod is available when it has been ready for the minReadySeconds defined in the workload specification. Pods are unavailable if they have not met this criterion.

apiserver_storage_objects

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, resource

The number of objects stored in etcd at the time of the last check.

apiserver_request_total

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, code, verb

The total number of API requests to the Kubernetes API server.

apiserver_request_duration_seconds

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, verb

Responce latency for API requests to the Kubernetes API server.

apiserver_admission_controller_admission_duration_seconds

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, operation

Admission controller latency in seconds. An admission controller is code which intercepts requests to the Kubernetes API server.

rest_client_request_duration_seconds

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, operation

Reponse latency experienced by clients calling the Kubernetes API server. This metric is experimental and may change in future releases of Kubernetes.

rest_client_requests_total

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, code, method

The total number of API requests to the Kubernetes API server made by clients. This metric is experimental and may change in future releases of Kubernetes.

etcd_request_duration_seconds

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, operation

Response latency of API calls to Etcd. This metric is experimental and may change in future releases of Kubernetes.

apiserver_storage_size_bytes

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, endpoint

Size of the storage database file physically allocated in bytes. This metric is experimental and may change in future releases of Kubernetes.

apiserver_longrunning_requests

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, resource

The number of active long-running requests to the Kubernetes API server.

apiserver_current_inflight_requests

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, request_kind

The number of requests that are being processed by Kubernetes API server.

apiserver_admission_webhook_admission_duration_seconds

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, name

Admission webhook latency in seconds. Admission webhooks are HTTP callbacks that receive admission requests and do something with them.

apiserver_admission_step_admission_duration_seconds

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, operation

Admission sub-step latency in seconds.

apiserver_requested_deprecated_apis

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, group

Number of requests to deprecated APIs on the Kubernetes API server.

apiserver_request_total_5XX

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, code, verb

Number of requests to the Kubernetes API server which were responded to with a 5XX HTTP response code.

apiserver_storage_list_duration_seconds

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, resource

Response latency of listing objects from Etcd. This metric is experimental and may change in future releases of Kubernetes.

apiserver_current_inqueue_requests

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, request_kind

The number queued requests queued by the Kubernetes API server. This metric is experimental and may change in future releases of Kubernetes.

apiserver_flowcontrol_rejected_requests_total

This metric is available only with Container Insights with enhanced observability for Amazon EKS

ClusterName

ClusterName, reason

Number of requests rejected by API Priority and Fairness subsystem. This metric is experimental and may change in future releases of Kubernetes.

NVIDIA GPU metrics

Beginning with version 1.300034.0 of the CloudWatch agent, Container Insights with enhanced observability for Amazon EKS collects NVIDIA GPU metrics from EKS workloads by default. The CloudWatch agent must be installed using the CloudWatch Observability EKS add-on version v1.3.0-eksbuild.1 or later. For more information, see Install the CloudWatch agent by using the Amazon CloudWatch Observability EKS add-on. These NVIDIA GPU metrics that are collected are listed in the table in this section.

For Container Insights to collect NVIDIA GPU metrics, you must meet the following prerequisites:

  • You must be using Container Insights with enhanced observability for Amazon EKS, with the Amazon CloudWatch Observability EKS add-on version v1.3.0-eksbuild.1 or later.

  • The NVIDIA device plugin for Kubernetes must be installed in the cluster.

  • The NVIDIA container toolkit must be installed on the nodes of the cluster. For example, the Amazon EKS optimized accelerated AMIs are built with the necessary components.

You can opt out of collecting NVIDIA GPU metrics by setting the accelerated_compute_metrics option in the beginn CloudWatch agent configuration file to false. For more information and an example opt-out configuration, see (Optional) Additional configuration.

Metric name Dimensions Description

container_gpu_memory_total

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, GpuDevice

The total frame buffer size, in bytes, on the GPU(s) allocated to the container.

container_gpu_memory_used

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, GpuDevice

The bytes of frame buffer used on the GPU(s) allocated to the container.

container_gpu_memory_utilization

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, GpuDevice

The percentage of frame buffer used of the GPU(s) allocated to the container.

container_gpu_power_draw

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, GpuDevice

The power usage in watts of the GPU(s) allocated to the container.

container_gpu_temperature

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, GpuDevice

The temperature in degrees celsius of the GPU(s) allocated to the container.

container_gpu_utilization

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, GpuDevice

The percentage utilization of the GPU(s) allocated to the container.

node_gpu_memory_total

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceId, InstanceType, NodeName, GpuDevice

The total frame buffer size, in bytes, on the GPU(s) allocated to the node.

node_gpu_memory_used

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceId, InstanceType, NodeName, GpuDevice

The bytes of frame buffer used on the GPU(s) allocated to the node.

node_gpu_memory_utilization

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceId, InstanceType, NodeName, GpuDevice

The percentage of frame buffer used on the GPU(s) allocated to the node.

node_gpu_power_draw

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceId, InstanceType, NodeName, GpuDevice

The power usage in watts of the GPU(s) allocated to the node.

node_gpu_temperature

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceId, InstanceType, NodeName, GpuDevice

The temperature in degrees celsius of the GPU(s) allocated to the node.

node_gpu_utilization

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceId, InstanceType, NodeName, GpuDevice

The percentage utilization of the GPU(s) allocated to the node.

pod_gpu_memory_total

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName. GpuDevice

The total frame buffer size, in bytes, on the GPU(s) allocated to the pod.

pod_gpu_memory_used

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName. GpuDevice

The bytes of frame buffer used on the GPU(s) allocated to the pod.

pod_gpu_memory_utilization

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName. GpuDevice

The percentage of frame buffer used of the GPU(s) allocated to the pod.

pod_gpu_power_draw

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName. GpuDevice

The power usage in watts of the GPU(s) allocated to the pod.

pod_gpu_temperature

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName. GpuDevice

The temperature in degrees celsius of the GPU(s) allocated to the pod.

pod_gpu_utilization

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, GpuDevice

The percentage utilization of the GPU(s) allocated to the pod.

Amazon Neuron metrics for Amazon Trainium and Amazon Inferentia

Beginning with version 1.300036.0 of the CloudWatch agent, Container Insights with enhanced observability for Amazon EKS collects accelerated computing metrics from Amazon Trainium and Amazon Inferentia accelerators by default. The CloudWatch agent must be installed using the CloudWatch Observability EKS add-on version v1.5.0-eksbuild.1 or later. For more information about the add-on, see Install the CloudWatch agent by using the Amazon CloudWatch Observability EKS add-on. For more information about Amazon Trainium, see Amazon Trainium. For more information about Amazon Inferentia, see Amazon Inferentia.

For Container Insights to collect Amazon Neuron metrics, you must meet the following prerequisites:

  • You must be using Container Insights with enhanced observability for Amazon EKS, with the Amazon CloudWatch Observability EKS add-on version v1.5.0-eksbuild.1 or later.

  • The Neuron driver must be installed on the nodes of the cluster.

  • The Neuron device plugin must be installed on the cluster. For example, the Amazon EKS optimized accelerated AMIs are built with the necessary components.

The metrics that are collected are listed in the table in this section. The metrics are collected for Amazon Trainium, Amazon Inferentia, and Amazon Inferentia2.

The CloudWatch agent collects these metrics from the Neuron monitor and does the necessary Kubernetes resource correlation to deliver metrics at the pod and container levels

Metric name Dimensions Description

container_neuroncore_utilization

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice, NeuronCore

NeuronCore utilization, during the captured period of the NeuronCore allocated to the container.

Unit: Percent

container_neuroncore_memory_usage_constants

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice, NeuronCore

The amount of device memory used for constants during training by the NeuronCore that is allocated to the container (or weights during inference).

Unit: Bytes

container_neuroncore_memory_usage_model_code

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice, NeuronCore

The amount of device memory used for the models' executable code by the NeuronCore that is allocated to the container.

Unit: Bytes

container_neuroncore_memory_usage_model_shared_scratchpad

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice, NeuronCore

The amount of device memory used for the scratchpad shared of the models by the NeuronCore that is allocated to the container. This memory region is reserved for the models.

Unit: Bytes

container_neuroncore_memory_usage_runtime_memory

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice, NeuronCore

The amount of device memory used for the Neuron runtime by the NeuronCore allocated to the container.

Unit: Bytes

container_neuroncore_memory_usage_tensors

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice, NeuronCore

The amount of device memory used for tensors by the NeuronCore allocated to the container.

Unit: Bytes

container_neuroncore_memory_usage_total

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice, NeuronCore

The total amount of memory used by the NeuronCore allocated to the container.

Unit: Bytes

container_neurondevice_hw_ecc_events_total

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice

The number of corrected and uncorrected ECC events for the on-chip SRAM and device memory of the Neuron device on the node.

Unit: Count

pod_neuroncore_utilization

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice, NeuronCore

The NeuronCore utilization during the captured period of the NeuronCore allocated to the pod.

Unit: Percent

pod_neuroncore_memory_usage_constants

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice, NeuronCore

The amount of device memory used for constants during training by the NeuronCore that is allocated to the pod (or weights during inference).

Unit: Bytes

pod_neuroncore_memory_usage_model_code

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice, NeuronCore

The amount of device memory used for the models' executable code by the NeuronCore that is allocated to the pod.

Unit: Bytes

pod_neuroncore_memory_usage_model_shared_scratchpad

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice, NeuronCore

The amount of device memory used for the scratchpad shared of the models by the NeuronCore that is allocated to the pod. This memory region is reserved for the models.

Unit: Bytes

pod_neuroncore_memory_usage_runtime_memory

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice, NeuronCore

The amount of device memory used for the Neuron runtime by the NeuronCore allocated to the pod.

Unit: Bytes

pod_neuroncore_memory_usage_tensors

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice, NeuronCore

The amount of device memory used for tensors by the NeuronCore allocated to the pod.

Unit: Bytes

pod_neuroncore_memory_usage_total

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice, NeuronCore

The total amount of memory used by the NeuronCore allocated to the pod.

Unit: Bytes

pod_neurondevice_hw_ecc_events_total

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice

The number of corrected and uncorrected ECC events for the on-chip SRAM and device memory of the Neuron device allocated to a pod.

Unit: Bytes

node_neuroncore_utilization

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceType, InstanceId, NodeName, NeuronDevice, NeuronCore

The NeuronCore utilization during the captured period of the NeuronCore allocated to the node.

Unit: Percent

node_neuroncore_memory_usage_constants

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceType, InstanceId, NodeName, NeuronDevice, NeuronCore

The amount of device memory used for constants during training by the NeuronCore that is allocated to the node (or weights during inference).

Unit: Bytes

node_neuroncore_memory_usage_model_code

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceType, InstanceId, NodeName, NeuronDevice, NeuronCore

The amount of device memory used for models' executable code by the NeuronCore that is allocated to the node.

Unit: Bytes

node_neuroncore_memory_usage_model_shared_scratchpad

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceType, InstanceId, NodeName, NeuronDevice, NeuronCore

The amount of device memory used for the scratchpad shared of the models by the NeuronCore that is allocated to the node. This is a memory region reserved for the models.

Unit: Bytes

node_neuroncore_memory_usage_runtime_memory

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceType, InstanceId, NodeName, NeuronDevice, NeuronCore

The amount of device memory used for the Neuron runtime by the NeuronCore that is allocated to the node.

Unit: Bytes

node_neuroncore_memory_usage_tensors

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceType, InstanceId, NodeName, NeuronDevice, NeuronCore

The amount of device memory used for tensors by the NeuronCore that is allocated to the node.

Unit: Bytes

node_neuroncore_memory_usage_total

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceType, InstanceId, NodeName, NeuronDevice, NeuronCore

The total amount of memory used by the NeuronCore that is allocated to the node.

Unit: Bytes

node_neuron_execution_errors_total

ClusterName

ClusterName, InstanceId, NodeName

The total number of execution errors on the node. This is calculated by the CloudWatch agent by aggregating the errors of the following types: generic, numerical, transient, model, runtime, and hardware

Unit: Count

node_neurondevice_runtime_memory_used_bytes

ClusterName

ClusterName, InstanceId, NodeName

The total Neuron device memory usage in bytes on the node.

Unit: Bytes

node_neuron_execution_latency

ClusterName

ClusterName, InstanceId, NodeName

In seconds, the latency for an execution on the node as measured by the Neuron runtime.

Unit: Seconds

node_neurondevice_hw_ecc_events_total

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceId, NodeName, NeuronDevice

The number of corrected and uncorrected ECC events for the on-chip SRAM and device memory of the Neuron device on the node.

Unit: Count

Amazon Elastic Fabric Adapter (EFA) metrics

Beginning with version 1.300037.0 of the CloudWatch agent, Container Insights with enhanced observability for Amazon EKS collects Amazon Elastic Fabric Adapter (EFA) metrics from Amazon EKS clusters on Linux instances. The CloudWatch agent must be installed using the CloudWatch Observability EKS add-on version v1.5.2-eksbuild.1 or later. For more information about the add-on, see Install the CloudWatch agent by using the Amazon CloudWatch Observability EKS add-on. For more information about Amazon Elastic Fabric Adapter, see Elastic Fabric Adapter.

For Container Insights to collect Amazon Elastic Fabric adapter metrics, you must meet the following prerequisites:

  • You must be using Container Insights with enhanced observability for Amazon EKS, with the Amazon CloudWatch Observability EKS add-on version v1.5.2-eksbuild.1 or later.

  • The EFA device plugin must be installed on the cluster. For more information, see aws-efa-k8s-device-plugin on GitHub.

The metrics that are collected are listed in the following table.

Metric name Dimensions Description

container_efa_rx_bytes

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, EfaDevice

The number of bytes per second received by the EFA device(s) allocated to the container.

Unit: Bytes/Second

container_efa_tx_bytes

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, EfaDevice

The number of bytes per second transmitted by the EFA device(s) allocated to the container.

Unit: Bytes/Second

container_efa_rx_dropped

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, EfaDevice

The number of packets that were received and then dropped by the EFA device(s) allocated to the container.

Unit: Count/Second

container_efa_rdma_read_bytes

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, EfaDevice

The number of bytes per second received using remote direct memory access read operations by the EFA device(s) allocated to the container.

Unit: Bytes/Second

container_efa_rdma_write_bytes

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, EfaDevice

The number of bytes per second transmitted using remote direct memory access read operations by the EFA device(s) allocated to the container.

Unit: Bytes/Second

container_efa_rdma_write_recv_bytes

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, EfaDevice

The number of bytes per second received during remote direct memory access write operations by the EFA device(s) allocated to the container.

Unit: Bytes/Second

pod_efa_rx_bytes

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, EfaDevice

The number of bytes per second received by the EFA device(s) allocated to the pod.

Unit: Bytes/Second

pod_efa_tx_bytes

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, EfaDevice

The number of bytes per second transmitted by the EFA device(s) allocated to the pod.

Unit: Bytes/Second

pod_efa_rx_dropped

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, EfaDevice

The number of packets that were received and then dropped by the EFA device(s) allocated to the pod.

Unit: Count/Second

pod_efa_rdma_read_bytes

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, EfaDevice

The number of bytes per second received using remote direct memory access read operations by the EFA device(s) allocated to the pod.

Unit: Bytes/Second

pod_efa_rdma_write_bytes

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, EfaDevice

The number of bytes per second transmitted using remote direct memory access read operations by the EFA device(s) allocated to the pod.

Unit: Bytes/Second

pod_efa_rdma_write_recv_bytes

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, EfaDevice

The number of bytes per second received during remote direct memory access write operations by the EFA device(s) allocated to the pod.

Unit: Bytes/Second

node_efa_rx_bytes

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceId, InstanceType, NodeName, EfaDevice

The number of bytes per second received by the EFA device(s) allocated to the node.

Unit: Bytes/Second

node_efa_tx_bytes

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceId, InstanceType, NodeName, EfaDevice

The number of bytes per second transmitted by the EFA device(s) allocated to the node.

Unit: Bytes/Second

node_efa_rx_dropped

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceId, InstanceType, NodeName, EfaDevice

The number of packets that were received and then dropped by the EFA device(s) allocated to the node.

Unit: Count/Second

node_efa_rdma_read_bytes

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceId, InstanceType, NodeName, EfaDevice

The number of bytes per second received using remote direct memory access read operations by the EFA device(s) allocated to the node.

Unit: Bytes/Second

pod_efa_rdma_write_bytes

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceId, InstanceType, NodeName, EfaDevice

The number of bytes per second transmitted using remote direct memory access read operations by the EFA device(s) allocated to the pod.

Unit: Bytes/Second

node_efa_rdma_write_recv_bytes

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceId, InstanceType, NodeName, EfaDevice

The number of bytes per second received during remote direct memory access write operations by the EFA device(s) allocated to the node.

Unit: Bytes/Second