Skip to content
8 min read·Lesson 5 of 8

Retrieval-Augmented Generation (RAG)

Ground LLM answers in your data — chunking, embeddings, vector databases, hybrid search, reranking, and the RAG control flow.

An LLM trained in 2024 doesn't know your internal documentation, your latest product specs, or what happened yesterday. Retrieval-Augmented Generation (RAG) solves this without retraining: at query time, fetch the relevant documents from a corpus, stuff them into the model's context window, and let the model answer with that grounded information.

The RAG Architecture

User query
   │
   ▼
[Embed query] ──▶ [Vector DB search] ──▶ Top-k relevant chunks
                                              │
                                              ▼
                                       [Build prompt: instructions + chunks + query]
                                              │
                                              ▼
                                          [Call LLM]
                                              │
                                              ▼
                                          Final answer (with citations)

Two pipelines run independently:

  1. Indexing (offline): Process documents → chunks → embeddings → vector DB
  2. Querying (online): Embed query → retrieve → prompt → generate

Step 1: Document Loading and Chunking

Documents come in many formats (PDF, HTML, Markdown, Confluence, Notion, SharePoint). Use loaders (LangChain, LlamaIndex, or your own) to normalise them to text + metadata.

Then chunk — split long documents into pieces small enough to fit several into the model's context. Chunking is the most impactful design choice in RAG.

Chunking strategies

StrategyDescriptionBest for
Fixed-sizeN tokens per chunk, optional overlapSimple corpora; baseline
Recursive characterSplit on \n\n, then \n, then sentence boundaryGeneral prose; preserves paragraphs
SemanticSplit where embedding similarity changesHeterogeneous content
Structure-awareSplit on headings, sections, code blocksTechnical docs, Markdown
Sliding windowFixed-size with overlap (e.g., 1024 tokens, 128 overlap)Default for most cases

Typical chunk size: 256-1024 tokens. Too small and individual chunks lack context; too large and you waste context window with irrelevant material per chunk.

Metadata

Attach metadata to each chunk: source URL, document title, section, last_updated, author, permissions tag. Metadata enables filtering ("only search docs the user can access") and citation.

Step 2: Embeddings

An embedding model converts text to a dense vector (768, 1024, 1536, 3072 dimensions are common) where semantic similarity ≈ vector distance.

Embedding modelDimNotes
OpenAI text-embedding-3-small / -large1536 / 3072Strong general-purpose
Cohere embed-v41024Excellent multilingual
Voyage-31024Strong on technical content
BGE-large / E5-large1024Best open-weight option
NV-Embed-v24096Top of MTEB leaderboard

Use the same embedding model for indexing and querying — vectors from different models aren't comparable.

Step 3: Vector Database

A vector DB stores embeddings + metadata and does fast approximate nearest-neighbour (ANN) search via HNSW, IVF, or DiskANN indexes.

Vector DBType
PineconeHosted, managed
WeaviateOpen-source + hosted
QdrantOpen-source + hosted; written in Rust
MilvusOpen-source; scales to billions
pgvector (Postgres)Add-on extension; "good enough" for many apps
Elasticsearch / OpenSearchVector + BM25 hybrid out of the box
Vertex AI Vector Search / Bedrock Knowledge Bases / Azure AI SearchCloud-managed

For most teams, start with pgvector or Azure AI Search / Bedrock Knowledge Bases. Migrate to specialist DBs only if you outgrow them.

Step 4: Hybrid Search

Vector similarity captures meaning but misses exact-keyword matches (product codes, error IDs, names). Hybrid search combines:

  • BM25 (sparse, keyword-based) — strong on exact tokens
  • Vector (dense, semantic) — strong on paraphrase

Common fusion: Reciprocal Rank Fusion (RRF) blends the two ranked lists. Most production RAG systems use hybrid search; pure vector is for prototypes.

Step 5: Reranking

Initial retrieval returns ~20-50 candidates. A reranker — a cross-encoder model that scores (query, chunk) pairs directly — reorders them by true relevance. The top 3-10 then go to the LLM.

Rerankers (Cohere Rerank, BGE Reranker, voyage-rerank) typically lift retrieval quality 10-30 points on standard metrics for the cost of a small per-query latency.

Step 6: Building the Prompt

You are a documentation assistant. Answer the user's question using ONLY the
context provided. If the context doesn't contain the answer, say so.

<context>
[1] {{ chunk_1.text }}    (source: {{ chunk_1.source }})
[2] {{ chunk_2.text }}    (source: {{ chunk_2.source }})
[3] {{ chunk_3.text }}    (source: {{ chunk_3.source }})
</context>

Question: {{ user_query }}

Cite sources by number [1], [2], etc.

Key elements: instruction to ground in context only, the retrieved chunks with citation tags, the user's question, instruction to cite. Without these, the model will mix context with its parametric knowledge — risking hallucination.

Evaluation: Four Metrics

MetricWhat it measures
Recall@kDid the retrieval pull the right chunks?
Precision@kAre the retrieved chunks relevant?
FaithfulnessDoes the answer stay grounded in the provided context?
Answer relevanceDoes the answer actually address the user's question?

Separate retrieval evaluation (recall, precision) from generation evaluation (faithfulness, relevance) — you fix each layer with different techniques. Frameworks like RAGAS, TruLens, and DeepEval automate the scoring.

Common Pitfalls

  • Chunks too small or too large — sweet spot is workload-specific; A/B test
  • No metadata filters — your VP's notes shouldn't surface for end-user queries
  • Ignoring updates — re-index changed docs; orphaned embeddings poison retrieval
  • Pure vector search — always add BM25 for keyword queries
  • No reranker — usually 1-2 weeks of work for outsized gains
  • Putting too many chunks in the prompt — lost-in-the-middle hurts; 3-8 is usually enough
  • No citations — users can't verify; trust collapses on the first hallucination

Advanced RAG Patterns

  • HyDE: Hypothetical Document Embeddings — ask the LLM to draft a plausible answer, embed that, retrieve against it. Helps when queries are short or jargon-heavy.
  • Query rewriting: Use an LLM to expand or rephrase the query into multiple variants before retrieval
  • Multi-hop RAG: Retrieve, generate intermediate result, retrieve again with refined query
  • Agentic RAG: Give the agent a "retrieve" tool; it decides when and what to fetch
  • GraphRAG (Microsoft): Build a knowledge graph from the corpus; combine graph traversal with vector search

When NOT to Use RAG

ScenarioBetter fit
The whole knowledge base fits in contextStuff it all in — simpler, no embedding pipeline
You need the model to behave differently, not know moreFine-tuning (next lesson)
Highly structured queries against a DBText-to-SQL or direct query, not RAG
The data is small and rarely changesInclude directly in system prompt

RAG is the default for "answer questions about my private corpus" — but it isn't a silver bullet. The next lesson covers the strategic decision between RAG, fine-tuning, and pure prompt engineering.

Key Takeaways

  • RAG fetches relevant documents at query time and passes them to the LLM as context — no fine-tuning needed.
  • Embedding models convert text to vectors; vector DBs do fast nearest-neighbor search.
  • Chunking strategy (size, overlap, semantic boundaries) is the most impactful design choice.
  • Hybrid search (vector + BM25) plus reranking dramatically improves retrieval quality.
  • Track recall, precision, faithfulness, and answer relevance as separate evaluation metrics.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →