Open Source
Semantic Cache
Meaning-based caching for LLM queries
01
The Problem
Traditional caching matches on exact strings. Ask an LLM “What is the capital of France?” and the response gets cached — but “What’s France’s capital city?” is a cache miss, triggering another expensive API call for the same answer.
In production LLM applications this adds up fast: duplicate API costs, unnecessary latency, and wasted compute for semantically identical queries.
02
How It Works
Every incoming query is embedded into a 1024-dimensional vector using VoyageAI. That vector is searched against MongoDB Atlas Vector Search to find cached queries with similar meaning. If the similarity exceeds a configurable threshold, the cached response is returned instantly — no LLM call needed.
On a cache miss, the query goes to the LLM provider, and the response is stored alongside its embedding for future lookups. The result is a cache that understands meaning, not just characters.
- Embed — Convert the query to a vector via VoyageAI (voyage-3.5).
- Search — Find semantically similar cached queries with MongoDB Atlas Vector Search.
- Hit or miss — Above the threshold, return the cached response. Below it, query the LLM and cache the result.
03
Multi-Provider & Structured Output
Built on the Vercel AI SDK, the cache works with any supported LLM provider — OpenAI, Anthropic, Google Gemini, Mistral, AWS Bedrock, or Azure. Swap providers without changing your caching logic.
Structured output is first-class. Define response schemas with Zod, and the cache stores and returns typed objects, not raw strings. This makes it straightforward to integrate with downstream systems that expect validated data shapes.