An LLM trained in 2024 doesn't know your internal documentation, your latest product specs, or what happened yesterday. Retrieval-Augmented Generation (RAG) solves this without retraining: at query time, fetch the relevant documents from a corpus, stuff them into the model's context window, and let the model answer with that grounded information.
The RAG Architecture
User query
│
▼
[Embed query] ──▶ [Vector DB search] ──▶ Top-k relevant chunks
│
▼
[Build prompt: instructions + chunks + query]
│
▼
[Call LLM]
│
▼
Final answer (with citations)
Two pipelines run independently:
- Indexing (offline): Process documents → chunks → embeddings → vector DB
- Querying (online): Embed query → retrieve → prompt → generate
Step 1: Document Loading and Chunking
Documents come in many formats (PDF, HTML, Markdown, Confluence, Notion, SharePoint). Use loaders (LangChain, LlamaIndex, or your own) to normalise them to text + metadata.
Then chunk — split long documents into pieces small enough to fit several into the model's context. Chunking is the most impactful design choice in RAG.
Chunking strategies
| Strategy | Description | Best for |
|---|---|---|
| Fixed-size | N tokens per chunk, optional overlap | Simple corpora; baseline |
| Recursive character | Split on \n\n, then \n, then sentence boundary | General prose; preserves paragraphs |
| Semantic | Split where embedding similarity changes | Heterogeneous content |
| Structure-aware | Split on headings, sections, code blocks | Technical docs, Markdown |
| Sliding window | Fixed-size with overlap (e.g., 1024 tokens, 128 overlap) | Default for most cases |
Typical chunk size: 256-1024 tokens. Too small and individual chunks lack context; too large and you waste context window with irrelevant material per chunk.
Metadata
Attach metadata to each chunk: source URL, document title, section, last_updated, author, permissions tag. Metadata enables filtering ("only search docs the user can access") and citation.
Step 2: Embeddings
An embedding model converts text to a dense vector (768, 1024, 1536, 3072 dimensions are common) where semantic similarity ≈ vector distance.
| Embedding model | Dim | Notes |
|---|---|---|
| OpenAI text-embedding-3-small / -large | 1536 / 3072 | Strong general-purpose |
| Cohere embed-v4 | 1024 | Excellent multilingual |
| Voyage-3 | 1024 | Strong on technical content |
| BGE-large / E5-large | 1024 | Best open-weight option |
| NV-Embed-v2 | 4096 | Top of MTEB leaderboard |
Use the same embedding model for indexing and querying — vectors from different models aren't comparable.
Step 3: Vector Database
A vector DB stores embeddings + metadata and does fast approximate nearest-neighbour (ANN) search via HNSW, IVF, or DiskANN indexes.
| Vector DB | Type |
|---|---|
| Pinecone | Hosted, managed |
| Weaviate | Open-source + hosted |
| Qdrant | Open-source + hosted; written in Rust |
| Milvus | Open-source; scales to billions |
| pgvector (Postgres) | Add-on extension; "good enough" for many apps |
| Elasticsearch / OpenSearch | Vector + BM25 hybrid out of the box |
| Vertex AI Vector Search / Bedrock Knowledge Bases / Azure AI Search | Cloud-managed |
For most teams, start with pgvector or Azure AI Search / Bedrock Knowledge Bases. Migrate to specialist DBs only if you outgrow them.
Step 4: Hybrid Search
Vector similarity captures meaning but misses exact-keyword matches (product codes, error IDs, names). Hybrid search combines:
- BM25 (sparse, keyword-based) — strong on exact tokens
- Vector (dense, semantic) — strong on paraphrase
Common fusion: Reciprocal Rank Fusion (RRF) blends the two ranked lists. Most production RAG systems use hybrid search; pure vector is for prototypes.
Step 5: Reranking
Initial retrieval returns ~20-50 candidates. A reranker — a cross-encoder model that scores (query, chunk) pairs directly — reorders them by true relevance. The top 3-10 then go to the LLM.
Rerankers (Cohere Rerank, BGE Reranker, voyage-rerank) typically lift retrieval quality 10-30 points on standard metrics for the cost of a small per-query latency.
Step 6: Building the Prompt
You are a documentation assistant. Answer the user's question using ONLY the
context provided. If the context doesn't contain the answer, say so.
<context>
[1] {{ chunk_1.text }} (source: {{ chunk_1.source }})
[2] {{ chunk_2.text }} (source: {{ chunk_2.source }})
[3] {{ chunk_3.text }} (source: {{ chunk_3.source }})
</context>
Question: {{ user_query }}
Cite sources by number [1], [2], etc.
Key elements: instruction to ground in context only, the retrieved chunks with citation tags, the user's question, instruction to cite. Without these, the model will mix context with its parametric knowledge — risking hallucination.
Evaluation: Four Metrics
| Metric | What it measures |
|---|---|
| Recall@k | Did the retrieval pull the right chunks? |
| Precision@k | Are the retrieved chunks relevant? |
| Faithfulness | Does the answer stay grounded in the provided context? |
| Answer relevance | Does the answer actually address the user's question? |
Separate retrieval evaluation (recall, precision) from generation evaluation (faithfulness, relevance) — you fix each layer with different techniques. Frameworks like RAGAS, TruLens, and DeepEval automate the scoring.
Common Pitfalls
- Chunks too small or too large — sweet spot is workload-specific; A/B test
- No metadata filters — your VP's notes shouldn't surface for end-user queries
- Ignoring updates — re-index changed docs; orphaned embeddings poison retrieval
- Pure vector search — always add BM25 for keyword queries
- No reranker — usually 1-2 weeks of work for outsized gains
- Putting too many chunks in the prompt — lost-in-the-middle hurts; 3-8 is usually enough
- No citations — users can't verify; trust collapses on the first hallucination
Advanced RAG Patterns
- HyDE: Hypothetical Document Embeddings — ask the LLM to draft a plausible answer, embed that, retrieve against it. Helps when queries are short or jargon-heavy.
- Query rewriting: Use an LLM to expand or rephrase the query into multiple variants before retrieval
- Multi-hop RAG: Retrieve, generate intermediate result, retrieve again with refined query
- Agentic RAG: Give the agent a "retrieve" tool; it decides when and what to fetch
- GraphRAG (Microsoft): Build a knowledge graph from the corpus; combine graph traversal with vector search
When NOT to Use RAG
| Scenario | Better fit |
|---|---|
| The whole knowledge base fits in context | Stuff it all in — simpler, no embedding pipeline |
| You need the model to behave differently, not know more | Fine-tuning (next lesson) |
| Highly structured queries against a DB | Text-to-SQL or direct query, not RAG |
| The data is small and rarely changes | Include directly in system prompt |
RAG is the default for "answer questions about my private corpus" — but it isn't a silver bullet. The next lesson covers the strategic decision between RAG, fine-tuning, and pure prompt engineering.