Article
RAG From the Ground Up
How retrieval-augmented generation turns your data into context an LLM can actually use
01
The Problem with Vanilla LLMs
Ask a large language model about last quarter’s earnings, and it will confidently recite numbers — numbers it made up. Ask it to summarize your company’s internal design spec, and it will apologize: the document simply does not exist in its world. These are not edge cases. They are the baseline behavior of every general-purpose LLM in production today.
Three limitations sit at the root of the problem.
Knowledge cutoff. Every LLM has a training freeze date — a line in the calendar after which the model knows nothing. It does not know that it does not know. It will still answer, drawing on patterns that may no longer hold.
Hallucination. When an LLM lacks the information to answer correctly, it does not say “I don’t know.” It generates plausible-sounding text that has no grounding in fact. A 2024 study published in JMIR found that GPT-3.5 hallucinated 39.6% of generated references, and GPT-4 hallucinated 28.6%.
No access to private data. Your proprietary databases, internal wikis, customer records, legal contracts — none of it was part of the training corpus. The LLM cannot query it, reason over it, or even acknowledge it exists.
Knowledge Cutoff
Every LLM has a training freeze date. Anything after it is invisible — the model doesn't know what it doesn't know.
Months behind
Structural lag
Hallucination
When lacking information, LLMs generate plausible-sounding text with no grounding in fact.
28.6% of refs
GPT-4 — JMIR 2024
No Private Data
Internal wikis, customer records, legal contracts — none of it exists in the model's training corpus.
0% access
By design
These are not bugs to be patched in the next model release. Knowledge cutoffs are structural — retraining is expensive and always lags reality. Hallucination is an emergent property of how language models generate text. And private data stays private by design.
The fix is not a better model. It is a better architecture.
In 2020, Lewis et al. at Meta AI published a paper that gave this architecture its name: Retrieval-Augmented Generation — RAG. The core idea is deceptively simple: instead of relying solely on what the model memorized during training, you retrieve relevant information from an external knowledge base at query time and inject it into the prompt as context. The LLM generates its response grounded in real, current, verified data — not its own parametric memory.
RAG does not replace the LLM. It gives the LLM something it never had: a source of truth.
02
The RAG Pipeline, End to End
A RAG system has two distinct phases that operate at different times, with different performance profiles and different failure modes.
Phase one runs offline. Before any user ever asks a question, you prepare your knowledge base. Raw documents — PDFs, database records, API responses, web pages, internal wikis — get loaded, split into smaller pieces called chunks, converted into vector embeddings, and stored in a vector database alongside the original text and any relevant metadata. This is the ingestion pipeline.
Phase two runs online. When a user submits a question, the system embeds the query using the same embedding model, searches the vector database for the most semantically similar chunks, injects those chunks into the LLM’s prompt as context, and generates a response. Under the hood, four distinct operations happened in sequence — embed, retrieve, augment, generate.
The critical constraint between these two phases: the embedding model must be identical for both ingestion and query. Embeddings from different models live in incompatible vector spaces. Same model, same dimensions, same space — always.
Offline Ingestion
Documents
Raw data sources — PDFs, databases, APIs, wikis.
Chunking
Split into semantically meaningful segments (400–512 tokens).
Embedding
Convert chunks into high-dimensional vectors.
Vector DB
Store embeddings with original text and metadata.
Same embedding model
Online Query
User Query
A natural-language question from the user.
Embedding
Embed the query with the same model used during ingestion.
Vector Search
Find top-k similar chunks via HNSW nearest neighbor.
Prompt Assembly
Inject retrieved chunks into the LLM prompt as context.
LLM Response
Generate a response grounded in retrieved context.
Offline Ingestion
Documents
Raw data sources — PDFs, databases, APIs, wikis.
Chunking
Split into semantically meaningful segments (400–512 tokens).
Embedding
Convert chunks into high-dimensional vectors.
Vector DB
Store embeddings with original text and metadata.
Same model
Online Query
User Query
A natural-language question from the user.
Embedding
Embed the query with the same model used during ingestion.
Vector Search
Find top-k similar chunks via HNSW nearest neighbor.
Prompt Assembly
Inject retrieved chunks into the LLM prompt as context.
LLM Response
Generate a response grounded in retrieved context.
That diagram is the skeleton of every RAG system in production — from a weekend prototype to a platform serving millions of queries. The remaining sections zoom into each step: how to chunk your documents intelligently, how to retrieve the right chunks efficiently, how to structure the prompt so the LLM actually uses the context, and how to push beyond the basics when simple RAG is not enough.
One thing to notice: RAG does not require a specific LLM, a specific embedding model, or a specific vector database. It is a pattern, not a product. The architecture stays the same.
03
Chunking: Breaking Documents into Pieces
Why not just embed the entire document and search against it? Two reasons.
First, meaning gets diluted in large documents. A 50-page report about MongoDB performance tuning also contains a table of contents, legal disclaimers, author bios, and an appendix on licensing. Embed the whole thing, and that single vector becomes a blurry average of every topic in the document.
Second, context windows have limits. Even models with 128K-token windows perform better with focused, relevant context than with everything-and-the-kitchen-sink dumps.
The challenge is deciding where to cut. Three strategies dominate.
Fixed-size chunking splits text at rigid character or token boundaries — every 512 tokens, full stop. It is fast, predictable, and dead simple. It also has a nasty habit of slicing sentences in half.
Recursive chunking is smarter. It tries to split on paragraph boundaries first. If a paragraph is too long, it falls back to sentence boundaries. Then characters. This hierarchical approach — implemented as RecursiveCharacterTextSplitter in LangChain — preserves semantic coherence far better and is the recommended default for most use cases.
Semantic chunking takes a fundamentally different approach. Instead of counting characters, it uses an embedding model to measure similarity between consecutive sentences. When similarity drops below a threshold — indicating a topic shift — it cuts. This can improve retrieval accuracy by up to 70% over fixed-size chunking, at the cost of computational overhead.
For chunk size, the consensus is tight: 400–512 tokens with 10–20% overlap. The overlap matters because if a key sentence straddles a chunk boundary, both chunks retain the complete thought.
5
Chunks
200
Avg chars
4
Overlaps
The chunk-size trade-off is worth internalizing. Too large and you retrieve noisy context — the LLM gets irrelevant information mixed with the answer. Too small and you lose context — a chunk might contain the answer to “what” but not the surrounding sentences that explain “why.”
A pragmatic hybrid exists: parent-child chunking. You index small chunks (256 tokens) for precise retrieval, but when a small chunk matches, you return its parent chunk (1024 tokens) to the LLM. The small chunk gave you high-precision matching; the parent gives the LLM enough surrounding context to generate a coherent answer.
If you are starting from scratch, begin with RecursiveCharacterTextSplitter at 512 tokens and 50-token overlap. Measure your retrieval quality. Then experiment with semantic chunking or parent-child strategies only if the metrics justify the complexity.
04
Retrieval: Finding the Right Chunks
You have a vector database full of chunk embeddings. A user asks a question. Now you need to find the chunks that contain the answer.
The process starts by embedding the user’s query with the same model you used during ingestion. That query embedding is a point in the same high-dimensional space as your chunks. If you have read What Are Vector Embeddings?, you already know the key property: semantically similar text lands near each other in embedding space.
The search itself uses an algorithm called HNSW — Hierarchical Navigable Small Worlds. It is an approximate nearest neighbor (ANN) algorithm that trades a tiny amount of accuracy for massive speed gains. Instead of comparing the query to every single chunk (O(n)), HNSW builds a multi-layered graph. Search time scales as O(log n), which means doubling your dataset barely affects latency.
For most RAG pipelines using normalized embeddings, cosine similarity or dot product is the right choice — direction captures meaning, magnitude is noise.
Query
How does the HNSW algorithm work?
HNSW builds a multi-layered graph where each layer contains a subset of nodes...
You first create a vector search index on a collection field that contains vector embeddings...
Atlas Vector Search uses the HNSW algorithm for approximate nearest neighbor search...
Pre-filtering narrows the candidate set before vector search begins...
3
Matched chunks
0.94
Best score
A critical performance knob in production: retrieve more candidates than you return. In MongoDB Atlas Vector Search, the $vectorSearch stage exposes this with two parameters — numCandidates (how many HNSW graph nodes to evaluate) and limit (how many results to return). Setting numCandidates: 150 with limit: 5 means the algorithm considers 150 candidates and returns only the best 5.
Here is what that looks like in practice:
Atlas Vector Search — aggregation pipeline
db.knowledge_base.aggregate([
{
$vectorSearch: {
index: "vector_index",
path: "embedding",
queryVector: queryEmbedding,
numCandidates: 150,
limit: 5,
filter: { department: "engineering" }
}
},
{
$project: {
content: 1,
source: 1,
score: { $meta: "vectorSearchScore" }
}
}
])Notice the filter parameter. This is pre-filtering — before the vector search even begins, it narrows the candidate set using standard MongoDB query expressions. If a user asks about engineering documentation, there is no reason to search the entire corpus. Pre-filtering by metadata dramatically improves both relevance and speed.
05
Prompt Augmentation: Putting It All Together
You have your top-k chunks. Now you need to assemble a prompt that turns raw retrieved context into a grounded, useful LLM response.
A well-structured RAG prompt has three distinct parts.
The system instruction tells the LLM how to behave. “Answer using only the provided context. If the context does not contain enough information to answer, say so explicitly.” Without this guardrail, the LLM will happily fill gaps with its own parametric memory — which is exactly the hallucination problem RAG was supposed to solve.
The retrieved context is the chunks you fetched, clearly delimited so the LLM can distinguish between sources. Label them explicitly. This also enables the LLM to cite its sources in the response, which is essential for trust and auditability.
The user’s question comes last. It grounds the entire prompt in a specific intent.
But the order in which you arrange chunks matters more than most developers realize. In 2023, Liu et al. published “Lost in the Middle” — LLM performance degrades by more than 30% when the most relevant information sits in the middle of the context window. Models attend most strongly to the beginning and end of their input.
Prompt Assembly
System Instruction
You are a helpful assistant. Answer the user's question using ONLY the provided context. If the context doesn't contain enough information, say so. For each claim, cite the source number in square brackets.
Source 1: MongoDB Docs, Vector Search
0.94Atlas Vector Search uses the HNSW algorithm for approximate nearest neighbor search. Search time scales as O(log n).
Source 2: MongoDB Docs, $vectorSearch
0.89The $vectorSearch stage supports numCandidates (graph nodes to evaluate) and limit (results to return) parameters.
Source 3: MongoDB Docs, Pre-filtering
0.82Pre-filtering narrows candidates using standard MongoDB query expressions before vector search begins.
User Question
How does Atlas Vector Search find similar documents?
LLM Response
Assembling prompt…
Limit to 3–5 chunks. More context is not always better context. Each additional chunk dilutes attention and increases the risk that the LLM latches onto an irrelevant passage.
Place the most relevant chunk first. The Lost in the Middle finding makes this non-negotiable. If ordering five chunks, put the highest-scoring chunk at position one and the second-highest at the end — a sandwich pattern that exploits the model’s attention bias toward both boundaries.
Use explicit grounding instructions. Tell the LLM: “Base your answer only on the sources above. For each claim, cite the source number in square brackets.”
Handle the “no relevant context” case. If retrieval returns low-confidence chunks (scores below 0.7), it is better to respond with “I do not have enough information to answer that” than to force the LLM to generate from weak context. A simple similarity threshold check before prompt assembly pays dividends in trust.
06
Beyond Naive RAG
The pipeline described in Sections 02 through 05 — chunk, embed, retrieve, augment, generate — is naive RAG. And for a surprising number of use cases, naive RAG works well enough. It is simple, debuggable, and fast to build.
But when it falls short, you have three levers to pull.
Hybrid search combines vector search with traditional keyword search (BM25) and merges the results using Reciprocal Rank Fusion (RRF). Semantic search excels at understanding meaning, but stumbles on exact matches: product codes, proper nouns, acronyms. MongoDB supports hybrid search natively by combining $vectorSearch with $search in a single aggregation pipeline.
Reranking adds a second-stage relevance model after initial retrieval. A cross-encoder reranker scores each query-document pair together through the same transformer, achieving much deeper semantic understanding. Anthropic’s research found that adding reranking to contextual retrieval reduced retrieval failures by 67%, while academic benchmarks show NDCG@10 gains of 5–25% depending on the reranker model and baseline quality.
Query transformation rewrites the user’s question before retrieval. HyDE (Hypothetical Document Embeddings) generates a hypothetical answer first, then embeds that answer for retrieval — converting query-to-document matching into document-to-document matching. Multi-query retrieval generates three to five rephrased versions of the query, each retrieving independently, with results merged and de-duplicated.
| Pattern | When to Use | Complexity | Impact |
|---|---|---|---|
| Naive RAGStart here | Starting point for all implementations | low | Good enough for ~70% of use cases |
| Hybrid Search | Keyword queries failing semantic search | medium | Covers keyword + semantic retrieval |
| Reranking | Top-k results not precise enough | medium | 20–35% accuracy gain |
| Query Transformation | Ambiguous or complex user queries | high | 27pp gain on multi-hop questions |
| Agentic RAG | Dynamic multi-step reasoning needed | very high | Self-correcting pipeline |
Rule of thumb
Start with naive RAG. Add complexity only when metrics tell you where the bottleneck is.
Low recall? Try hybrid search. High recall but low faithfulness? Try reranking. Use RAGAS for reference-free evaluation — no ground-truth annotations required.
The temptation is to build the most advanced version from day one. Resist it.
Start with naive RAG. Measure retrieval quality using Precision@K and Recall@K — are the right chunks showing up in the top 5? Measure answer quality using faithfulness — is the LLM’s response actually grounded in the retrieved context?
Only add complexity when the metrics tell you where the bottleneck is. Low recall? Try hybrid search or query transformation. High recall but low faithfulness? Try reranking and better prompt engineering. Good metrics but slow latency? Optimize your chunk size and numCandidates tuning.
RAG is not a single algorithm. It is a design space. The naive version is the foundation; every advanced pattern is an optimization applied to a specific failure mode. Build the simplest thing that works, instrument it ruthlessly, and evolve from there.
The pieces of this pipeline — vector embeddings that capture meaning, distance metrics that measure similarity, chunking strategies that preserve context, and retrieval patterns that surface the right information — are now part of your toolkit. What you build with them is up to you.