Why semantic, not exact match?Key benefits Where semantic caching is effective

Overview of semantic caching

Unlike traditional caches that rely on exact string matches, a semantic cache retrieves data based on semantic similarity. A semantic cache uses vector embeddings produced by models like Amazon Titan Text Embeddings to capture semantic meaning in a high-dimensional vector space.

In generative AI applications, a semantic cache stores vector representations of queries and their corresponding responses. The system compares the vector embedding of each new query against cached vectors of prior queries to determine if a similar query has been answered before. If the cache contains a similar query above a configured similarity threshold, the system returns the previously generated response instead of invoking the LLM. Otherwise, the system invokes the LLM to generate a response and caches the query embedding and response together for future reuse.

Why semantic, not exact match?

Consider an IT help chatbot where thousands of users ask the same question. The following queries are different strings but carry the same meaning:

"How do I install the VPN app on my laptop?"
"Can you guide me through setting up the company VPN?"
"Steps to get VPN working on my computer"

An exact-match cache treats each query as unique and invokes the LLM three times. A semantic cache recognizes these queries as semantically equivalent and returns the cached response for all three, invoking the LLM only once.

Key benefits

Semantic caching provides the following benefits for generative AI and agentic AI applications:

Reduced costs – Reusing answers for similar questions reduces the number of LLM calls and overall inference spend. In benchmarks, semantic caching reduced LLM inference cost by up to 86%.
Lower latency – Serving answers from the cache provides faster responses than running LLM inference. Cache hits return responses in milliseconds rather than seconds, achieving up to 88% latency reduction.
Improved scalability – Reducing LLM calls for similar or repeated queries enables you to serve more requests within the same model throughput limits without increasing capacity.
Improved consistency – Using the same cached response for semantically similar requests helps deliver a consistent answer for the same underlying question.

Where semantic caching is effective

Semantic caching is particularly effective for the following types of applications:

Application type	Description	Example
RAG-based assistants and copilots	Many queries are duplicate requests from different users against a shared knowledge base	IT help chatbot, product FAQ bot, documentation assistant
Agentic AI applications	Agents break tasks into multiple small steps that may repeatedly look up similar information	Compliance agent reusing policy lookups, research agent reusing prior findings
Multimodal applications	Matching similar audio segments, images, or video queries	Automated phone systems reusing guidance for repeated requests like store hours

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Semantic caching

Why ElastiCache for Valkey for semantic caching