View a markdown version of this page

SageMaker AI Insights OpenTelemetry metrics reference - Amazon CloudWatch
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

SageMaker AI Insights OpenTelemetry metrics reference

This section provides a comprehensive list of OpenTelemetry metrics emitted by SageMaker AI detailed observability. Detailed observability is built on OpenTelemetry (OTel) and collects fine‐grained operational metrics from the GPU, node, and inference framework layers, publishing them to CloudWatch with rich labels.

Note

Detailed observability publishes OpenTelemetry (OTel) metrics to CloudWatch via OTLP. These are not Prometheus metrics. The metrics are natively stored in CloudWatch as OTel metric data and are queryable using PromQL syntax.

Account-level aggregate metrics

MetricTypeDescription
TotalEndpointsAggregateCount of all endpoints in the account
TotalICsAggregateCount of all inference components
TotalGPUsAggregateCount of all GPUs (derived from DCGM_FI_DEV_GPU_UTIL)

Endpoint-level metrics

MetricTypeDescription
InstanceCountPerEndpointAggregateInstance count serving traffic
InstancesPerAZAggregateInstances per availability zone
EmptyInstanceCountAggregateInstances with no ICs placed
ScalingActionRawScaling event (in/out)
ICECountRawInsufficient Capacity Error count
CPUUtilization (EP)AggregateEndpoint-level CPU utilization
MemoryUtilization (EP)AggregateEndpoint-level memory utilization
DiskUtilization (EP)AggregateEndpoint-level disk utilization
E2EScalingLatencyRawEnd-to-end scaling duration
RebalancingTypeRawType of rebalancing event
RebalancingDurationRawRebalancing duration (seconds)
RebalancingCopiesMovedRawIC copies moved during rebalancing
AZSkewDeltaRawAZ distribution imbalance
InstancesReleasedRawInstances freed by rebalancing

Inference framework metrics (vLLM / SGLang)

MetricTypeDescription
InputTokensPerSecondAggregateInput token rate
OutputTokensPerSecondAggregateOutput token rate
TotalTPSAggregateTotal tokens per second
TPSUtilizationAggregateTPS utilization percentage
TTFTRawTime to First Token (histogram)
InterTokenLatencyRawInter-Token Latency (histogram)
KVCacheUtilizationRawKV cache usage percentage
QueueDepthRawWaiting requests (queue)
BatchSizeRawRunning requests (in-flight)
TPSPerInstanceAggregateTPS aggregated per instance
ConcurrentReqsPerCopyRawPer-copy in-flight requests
FirstChunkLatencyPerICRawTTFT per IC
4XXErrorRatePerICAggregateClient errors per IC
5XXErrorRatePerICAggregateServer errors per IC
TTFTPerICRawTTFT filtered by IC
ITLPerICRawITL filtered by IC
InputTPSPerICAggregateInput TPS per IC
OutputTPSPerICAggregateOutput TPS per IC
KVCachePerICRawKV cache per IC
ModelErrorBreakdownAggregateError breakdown by type
Note

TTFT and InterTokenLatency are framework‐dependent and only supported for vLLM and SGLang.

GPU (DCGM) metrics

These metrics are collected from the NVIDIA Data Center GPU Manager (DCGM) exporter. They are available on all GPU endpoints regardless of inference framework.

MetricTypeDescription
DCGM_FI_DEV_GPU_UTILRawGPU utilization (%)
DCGM_FI_DEV_MEM_COPY_UTILRawMemory copy utilization (%)
DCGM_FI_DEV_GPU_TEMPRawGPU temperature (°C)
DCGM_FI_DEV_MEMORY_TEMPRawMemory temperature (°C)
DCGM_FI_DEV_FB_FREERawFramebuffer memory free (bytes)
DCGM_FI_DEV_FB_USEDRawFramebuffer memory used (bytes)
DCGM_FI_DEV_SM_ACTIVERawStreaming multiprocessor active (%)

Node (instance) metrics

These metrics are collected from the node exporter. They provide instance‐level visibility into CPU, memory, and disk utilization.

MetricTypeDescription
node_cpu_seconds_totalRawCPU time per mode per core (seconds)
node_memory_MemTotal_bytesRawTotal memory (bytes)
node_memory_MemAvailable_bytesRawAvailable memory (bytes)
node_filesystem_size_bytesRawFilesystem total size (bytes)
node_filesystem_avail_bytesRawFilesystem available space (bytes)

HA and IC placement metrics

MetricTypeDescription
ICCopyCountPerAZAggregateIC copies per AZ
ICCopiesPerAZAggregateIC copies per AZ (alternative view)
AZSkewScoreAggregateAZ imbalance score
AZBalanceScoreAggregateEndpoint-level balance
ICDensityPerAZAggregateIC density per AZ
ICCopiesPerInstanceAggregateIC copies per instance
ICListPerInstanceAggregateWhich ICs on which instance

Lifecycle and provisioning metrics

MetricTypeDescription
ModelDownloadTimeRawTime for Amazon S3 to instance download
GPULoadTimeRawTime for weights to GPU memory
ContainerStartDurationRawContainer start to health check
ColdStartDurationRawEnd-to-end: creation to first invocation

Metrics available only as CloudWatch classic metrics

The following metrics are available only as CloudWatch classic metrics (namespace and dimension model) and require OTel enrichment to appear as OpenTelemetry metrics.

MetricNotes
InvocationsPerICBlocked from direct OpenTelemetry emission
InvocationsPerCopyBlocked
ModelLatencyPerICBlocked—use vllm:e2e_request_latency_seconds as the OTel alternative
OverheadLatencyPerICBlocked
MidStreamErrorsPerICBlocked (bidirectional streaming only)