Retrieval-Augmented Generation (RAG) — Generative AI & Prompt Engineering | CertQnA

An LLM trained in 2024 doesn't know your internal documentation, your latest product specs, or what happened yesterday. Retrieval-Augmented Generation (RAG) solves this without retraining: at query time, fetch the relevant documents from a corpus, stuff them into the model's context window, and let the model answer with that grounded information.

The RAG Architecture

User query
   │
   ▼
[Embed query] ──▶ [Vector DB search] ──▶ Top-k relevant chunks
                                              │
                                              ▼
                                       [Build prompt: instructions + chunks + query]
                                              │
                                              ▼
                                          [Call LLM]
                                              │
                                              ▼
                                          Final answer (with citations)

Two pipelines run independently:

Indexing (offline): Process documents → chunks → embeddings → vector DB
Querying (online): Embed query → retrieve → prompt → generate

Step 1: Document Loading and Chunking

Documents come in many formats (PDF, HTML, Markdown, Confluence, Notion, SharePoint). Use loaders (LangChain, LlamaIndex, or your own) to normalise them to text + metadata.

Then chunk — split long documents into pieces small enough to fit several into the model's context. Chunking is the most impactful design choice in RAG.

Chunking strategies

Strategy	Description	Best for
Fixed-size	N tokens per chunk, optional overlap	Simple corpora; baseline
Recursive character	Split on \n\n, then \n, then sentence boundary	General prose; preserves paragraphs
Semantic	Split where embedding similarity changes	Heterogeneous content
Structure-aware	Split on headings, sections, code blocks	Technical docs, Markdown
Sliding window	Fixed-size with overlap (e.g., 1024 tokens, 128 overlap)	Default for most cases

Typical chunk size: 256-1024 tokens. Too small and individual chunks lack context; too large and you waste context window with irrelevant material per chunk.

Metadata

Attach metadata to each chunk: source URL, document title, section, last_updated, author, permissions tag. Metadata enables filtering ("only search docs the user can access") and citation.

Step 2: Embeddings

An embedding model converts text to a dense vector (768, 1024, 1536, 3072 dimensions are common) where semantic similarity ≈ vector distance.

Embedding model	Dim	Notes
OpenAI text-embedding-3-small / -large	1536 / 3072	Strong general-purpose
Cohere embed-v4	1024	Excellent multilingual
Voyage-3	1024	Strong on technical content
BGE-large / E5-large	1024	Best open-weight option
NV-Embed-v2	4096	Top of MTEB leaderboard

Use the same embedding model for indexing and querying — vectors from different models aren't comparable.

Step 3: Vector Database

A vector DB stores embeddings + metadata and does fast approximate nearest-neighbour (ANN) search via HNSW, IVF, or DiskANN indexes.

Vector DB	Type
Pinecone	Hosted, managed
Weaviate	Open-source + hosted
Qdrant	Open-source + hosted; written in Rust
Milvus	Open-source; scales to billions
pgvector (Postgres)	Add-on extension; "good enough" for many apps
Elasticsearch / OpenSearch	Vector + BM25 hybrid out of the box
Vertex AI Vector Search / Bedrock Knowledge Bases / Azure AI Search	Cloud-managed

For most teams, start with pgvector or Azure AI Search / Bedrock Knowledge Bases. Migrate to specialist DBs only if you outgrow them.

Step 4: Hybrid Search

Vector similarity captures meaning but misses exact-keyword matches (product codes, error IDs, names). Hybrid search combines:

BM25 (sparse, keyword-based) — strong on exact tokens
Vector (dense, semantic) — strong on paraphrase

Common fusion: Reciprocal Rank Fusion (RRF) blends the two ranked lists. Most production RAG systems use hybrid search; pure vector is for prototypes.

Step 5: Reranking

Initial retrieval returns ~20-50 candidates. A reranker — a cross-encoder model that scores (query, chunk) pairs directly — reorders them by true relevance. The top 3-10 then go to the LLM.

Rerankers (Cohere Rerank, BGE Reranker, voyage-rerank) typically lift retrieval quality 10-30 points on standard metrics for the cost of a small per-query latency.

Step 6: Building the Prompt

You are a documentation assistant. Answer the user's question using ONLY the
context provided. If the context doesn't contain the answer, say so.

<context>
[1] {{ chunk_1.text }}    (source: {{ chunk_1.source }})
[2] {{ chunk_2.text }}    (source: {{ chunk_2.source }})
[3] {{ chunk_3.text }}    (source: {{ chunk_3.source }})
</context>

Question: {{ user_query }}

Cite sources by number [1], [2], etc.

Key elements: instruction to ground in context only, the retrieved chunks with citation tags, the user's question, instruction to cite. Without these, the model will mix context with its parametric knowledge — risking hallucination.

Evaluation: Four Metrics

Metric	What it measures
Recall@k	Did the retrieval pull the right chunks?
Precision@k	Are the retrieved chunks relevant?
Faithfulness	Does the answer stay grounded in the provided context?
Answer relevance	Does the answer actually address the user's question?

Separate retrieval evaluation (recall, precision) from generation evaluation (faithfulness, relevance) — you fix each layer with different techniques. Frameworks like RAGAS, TruLens, and DeepEval automate the scoring.

Common Pitfalls

Chunks too small or too large — sweet spot is workload-specific; A/B test
No metadata filters — your VP's notes shouldn't surface for end-user queries
Ignoring updates — re-index changed docs; orphaned embeddings poison retrieval
Pure vector search — always add BM25 for keyword queries
No reranker — usually 1-2 weeks of work for outsized gains
Putting too many chunks in the prompt — lost-in-the-middle hurts; 3-8 is usually enough
No citations — users can't verify; trust collapses on the first hallucination

Advanced RAG Patterns

HyDE: Hypothetical Document Embeddings — ask the LLM to draft a plausible answer, embed that, retrieve against it. Helps when queries are short or jargon-heavy.
Query rewriting: Use an LLM to expand or rephrase the query into multiple variants before retrieval
Multi-hop RAG: Retrieve, generate intermediate result, retrieve again with refined query
Agentic RAG: Give the agent a "retrieve" tool; it decides when and what to fetch
GraphRAG (Microsoft): Build a knowledge graph from the corpus; combine graph traversal with vector search

When NOT to Use RAG

Scenario	Better fit
The whole knowledge base fits in context	Stuff it all in — simpler, no embedding pipeline
You need the model to behave differently, not know more	Fine-tuning (next lesson)
Highly structured queries against a DB	Text-to-SQL or direct query, not RAG
The data is small and rarely changes	Include directly in system prompt

RAG is the default for "answer questions about my private corpus" — but it isn't a silver bullet. The next lesson covers the strategic decision between RAG, fine-tuning, and pure prompt engineering.