View a markdown version of this page

SageMaker AI Insights dashboard - Amazon CloudWatch
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

SageMaker AI Insights dashboard

The SageMaker AI Insights dashboard is organized into three tabs: Performance, Capacity, and Reliability. Each tab provides a focused view of your inference endpoint health.

Dashboard layout

Summary bar

The summary bar at the top of the dashboard shows fleet‐wide totals.

SageMaker AI Insights dashboard header with summary bar showing Invocations, Instances, Inference Components, and Avg AZ Skew.
MetricDescription
InvocationsTotal invocations across all endpoints in the selected time range
InstancesTotal number of instances currently serving traffic
Inference ComponentsTotal number of inference components across all endpoints
Avg AZ SkewAverage availability zone distribution imbalance (0% = perfectly balanced)

Filters panel

FilterDescription
EndpointSelect a specific endpoint or All endpoints for the fleet view
InstanceSelect a specific instance (requires an endpoint selected first)
Inference componentSelect a specific IC for granular filtering

Drill-down

SageMaker AI Insights supports progressive drill‐down.

  1. Fleet level (default)—all endpoints visible, summary metrics aggregated

  2. Endpoint level—select an endpoint from the filter panel or choose an endpoint link in any table

  3. IC level—select an inference component from the filter panel

Cross-linking with the SageMaker AI console

  • Choose an endpoint name to open the endpoint detail page in the SageMaker AI console.

  • Choose View logs to open CloudWatch Logs filtered to that endpoint or IC.

  • From the SageMaker AI console, choose View in SageMaker AI Insights or Metrics for a per‐IC row.

Accessing the dashboard

You can access the SageMaker AI Insights dashboard from multiple locations in the console.

  • From the Endpoints list page: Choose View in SageMaker AI Insights to open the dashboard at the fleet level with no filters applied.

    Endpoints list page showing the View in SageMaker AI Insights button.
  • From the endpoint detail page: Choose View in SageMaker AI Insights to open the dashboard filtered to that endpoint.

    Endpoint detail page showing the View in SageMaker AI Insights button.
  • From the inference component detail page: Choose View in SageMaker AI Insights to open the dashboard filtered to that endpoint and inference component.

    Inference component detail page showing the View in SageMaker AI Insights button.
  • Direct navigation: CloudWatch console → Infrastructure monitoringSageMaker AI Insights.

Performance tab

The Performance tab answers "is it healthy?" and "why is it slow?"—flowing from fleet‐wide health at the top to per‐IC diagnostics at the bottom.

Performance health

Honeycombs grouped by AZ showing alarm state for instances, IC copies, and endpoints at a glance.

Performance health honeycombs showing alarm state by AZ.
Instance performance table

Per‐instance breakdown showing TTFT (P50/P99), output TPS (avg/max), concurrent requests (live/max), and KV cache utilization.

Instance performance table showing TTFT, output TPS, concurrent requests, and KV cache.
Token streaming

Time‐series chart showing TTFT and inter‐token latency (ITL) with a P50/P99 toggle, broken down by framework.

Token streaming chart showing TTFT and ITL P50/P99 over time.
Token throughput

Input and output tokens per second by framework, with toggles for Input/Output, Percentiles, and By instance views.

Token throughput chart showing input and output tokens per second.
Engine and request pressure

KV cache utilization (%), running requests, and waiting requests over time—key saturation signals for the inference engine.

Engine and request pressure panel showing KV cache, running requests, and waiting requests.
Traffic distribution

Per‐instance or per‐IC‐copy table showing invocations per minute, 4XX rate, and 5XX rate to identify routing imbalances.

Traffic distribution table showing invocations per minute, 4XX rate, and 5XX rate by instance.
Error mix over time

Line chart showing 4XX, 5XX, and mid‐stream error rates over time per IC.

Error mix over time line chart showing 4XX, 5XX, and mid-stream errors.
Latency breakdown over time

Stacked area chart with tabs for Invoke (model latency + overhead latency) and Streaming (first chunk model + first chunk overhead), with a P50/P90 toggle.

Latency breakdown over time stacked area chart showing model latency and overhead latency.

Capacity tab

The Capacity tab answers "do I have headroom?" and "is my hardware healthy?"—showing actual utilization compared to reserved capacity.

Capacity health

The same honeycomb visualization as the Performance tab, showing alarm state for instances, IC copies, and endpoints.

Capacity health honeycombs showing alarm state by AZ.
Instance capacity table

Per‐instance utilization bars for GPU, GPU memory, CPU, memory, and disk.

Instance capacity table showing GPU, GPU memory, CPU, memory, and disk utilization.
Fleet utilization

Time‐series showing CPU, GPU, GPU memory, memory, and disk utilization per instance, with toggles for Instance, IC copies, and Endpoint views.

Fleet utilization chart showing CPU, GPU, GPU memory, memory, and disk utilization per instance over time.

Reliability tab

The Reliability tab answers "is it resilient?" and "why did scaling fail?"—covering AZ distribution, scaling behavior, and provisioning events.

Availability Zone distribution

Bar chart showing instance or IC copy count per AZ to validate high availability (HA) compliance.

Availability Zone distribution bar chart showing instance count per AZ.
Cold start anatomy

Horizontal stacked bar showing the breakdown of provisioning time into model download, GPU load, container start, and platform overhead (IC endpoints only).

Cold start anatomy showing model download, GPU load, container start, and platform overhead.
ICE diagnostics

Insufficient Capacity Error (ICE) count over time with an event table showing time, endpoint, failed instance type, and failed AZ. Non‐zero values indicate capacity constraints.

ICE diagnostics panel showing ICE count over time and an event table.