Using Amazon ElastiCache for Valkey for semantic caching - Amazon ElastiCache
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Using Amazon ElastiCache for Valkey for semantic caching

Large language models (LLMs) are the foundation for generative AI and agentic AI applications that power use cases from chatbots and search assistants to code generation tools and recommendation engines. As the use of AI applications in production grows, customers seek ways to optimize cost and performance. Most AI applications invoke the LLM for every user query, even when queries are repeated or semantically similar. Semantic caching is a method to reduce cost and latency in generative AI applications by reusing responses for identical or semantically similar requests using vector embeddings.

This topic explains how to implement a semantic cache using vector search on Amazon ElastiCache for Valkey, including the concepts, architecture, implementation, benchmarks, and best practices.