Running real-time online inference workloads on Amazon EKS

Tip

This section is designed to help you deploy and operate real-time online inference workloads on Amazon Elastic Kubernetes Service (EKS). You’ll find guidance on building optimized clusters with GPU-accelerated nodes, integrating Amazon services for storage and autoscaling, deploying sample models for validation, and key architectural considerations such as decoupling CPU and GPU tasks, selecting appropriate AMIs and instance types, and ensuring low-latency exposure of inference endpoints.

Topics

Best Practices Cluster Setup Guide for Real-Time Inference on Amazon EKS
Quickstart: High-throughput LLM inference with vLLM on Amazon EKS

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

AI/ML on EKS

Create cluster