Open Source

Semantic Cache

Meaning-based caching for LLM queries

The Problem

Traditional caching matches on exact strings. Ask an LLM “What is the capital of France?” and the response gets cached — but “What’s France’s capital city?” is a cache miss, triggering another expensive API call for the same answer.

In production LLM applications this adds up fast: duplicate API costs, unnecessary latency, and wasted compute for semantically identical queries.

How It Works

Every incoming query is embedded into a 1024-dimensional vector using VoyageAI. That vector is searched against MongoDB Atlas Vector Search to find cached queries with similar meaning. If the similarity exceeds a configurable threshold, the cached response is returned instantly — no LLM call needed.

On a cache miss, the query goes to the LLM provider, and the response is stored alongside its embedding for future lookups. The result is a cache that understands meaning, not just characters.

Embed — Convert the query to a vector via VoyageAI (voyage-3.5).
Search — Find semantically similar cached queries with MongoDB Atlas Vector Search.
Hit or miss — Above the threshold, return the cached response. Below it, query the LLM and cache the result.

Multi-Provider & Structured Output

Built on the Vercel AI SDK, the cache works with any supported LLM provider — OpenAI, Anthropic, Google Gemini, Mistral, AWS Bedrock, or Azure. Swap providers without changing your caching logic.

Structured output is first-class. Define response schemas with Zod, and the cache stores and returns typed objects, not raw strings. This makes it straightforward to integrate with downstream systems that expect validated data shapes.