Skip to content
6 min read·Lesson 6 of 8

Fine-Tuning vs RAG vs Prompt Engineering

Three ways to customise an LLM — when to fine-tune, when to use RAG, when prompts alone are enough, and how they combine.

You have three levers to customise an LLM's behaviour to your domain. Each is a different point on the cost/effort/durability curve. Choosing the wrong one wastes months.

The Three Levers

LeverWhat changesLasts for
Prompt engineeringThe single API callThat request
RAGWhat the model "knows" at query timePer-query (updates whenever your corpus updates)
Fine-tuningThe model's parametersPermanently (until you retrain)

When Prompt Engineering Is Enough

Most "I need to customise" needs turn out to be prompt engineering needs. Try this first if:

  • The model already has the knowledge — you just need to direct it
  • The customisation is about format, tone, or workflow
  • You can fit needed context (instructions + examples + data) in the context window
  • You're iterating fast and don't yet know the final requirements

When RAG Is the Right Tool

Use RAG when:

  • You have a large, structured corpus the model doesn't know (product docs, internal wiki, ticket history)
  • That corpus changes frequently (daily or weekly)
  • You need citations and traceability — users must verify claims
  • Permissions matter — different users see different data
  • The knowledge is large enough you can't fit it all in context

RAG is by far the most common production pattern for "AI on my data" — and it should be the default starting point for any knowledge-grounded application.

When to Fine-Tune

Fine-tuning is justified when:

  • You need the model to behave differently (specific style, format, decision pattern) — not just know different things
  • You have hundreds or thousands of high-quality (input, desired output) pairs
  • Prompt engineering hits a ceiling — even with great prompts, the model keeps drifting
  • Cost or latency matters and you can run a smaller model that does the job after fine-tuning
  • The task is narrow and well-defined (classification, extraction, structured generation)

Common fine-tuning use cases:

  • Specialised classification (spam vs ham, intent recognition, ticket routing)
  • Structured extraction from text (invoice fields, medical entities)
  • Style/tone matching (your brand voice, customer support phrasing)
  • Code generation in a domain-specific language
  • Compressing a larger model's behaviour into a cheaper one (distillation)

How Fine-Tuning Works

Full fine-tuning updates every parameter of the model — billions of them, expensive, requires beefy GPUs, often forgets unrelated knowledge ("catastrophic forgetting").

Parameter-Efficient Fine-Tuning (PEFT) updates a small fraction of parameters via adapters:

  • LoRA (Low-Rank Adaptation): Train two small matrices that, when added to the original weights, produce the adapted behaviour. Often 0.1-1% of base parameters.
  • QLoRA: LoRA on a quantised (4-bit) base model — fine-tunes 70B models on a single consumer GPU.
  • Prefix tuning / prompt tuning: Learn a "virtual" prefix of tokens. Even smaller.

LoRA is the dominant choice in 2026. Workflow: prepare a JSONL of input/output pairs → train a LoRA adapter (HF Transformers, Unsloth, Axolotl, OpenAI/Anthropic fine-tuning APIs) → deploy the base model + adapter.

How much data do you need?

GoalExamples needed
Style/format adjustment50-500
Domain classification (5-10 classes)500-2,000
Extraction or transformation1,000-10,000
Major behavioural change10,000+

Quality matters more than quantity. 500 carefully labelled examples often beat 5,000 noisy ones.

Distillation

A specialised fine-tune: use a large frontier model (GPT-5, Claude Opus) to label examples, then train a small model (GPT-4o-mini, Llama 3B) to imitate the labels. The result: a model 10-100× cheaper that performs nearly as well on your specific task.

This is the dominant pattern for putting LLMs in cost-sensitive production paths — high-volume classification, routing, simple extraction. Frontier models are too expensive to call millions of times daily; a distilled small model is.

The Decision Tree

Need to customise an LLM?
├── Is it about behaviour/style/format that prompts can't pin down?
│   └── YES → Consider fine-tuning
├── Is it about giving the model new knowledge?
│   ├── Knowledge changes often or is large?
│   │   └── YES → RAG
│   └── Knowledge is small and stable?
│       └── Put in system prompt
└── Is the issue cost/latency?
    └── Distill a frontier model into a small one

They Stack

Production GenAI systems rarely use just one approach:

  • A fine-tuned small model handles routing/classification cheaply
  • RAG injects fresh corpus knowledge into the prompt
  • Carefully engineered prompts steer the model's response style and constraints
  • Tool use lets the model fetch the data RAG missed

Knowing how to mix these is the real skill. There is no single right answer — only the right combination for the problem and the budget.

The Cost Math

ApproachOne-time costPer-query cost
Prompt engineeringEngineer time (days)Standard model cost
RAGPipeline build (weeks); embeddings storageStd cost + small embed cost + retrieval ~ms
Fine-tuning (LoRA)$10-$1000 training; data labellingStd cost (often a cheaper model)
Full fine-tuning$5K-$100K+Often hosting costs (custom model)
DistillationFrontier API for labels + fine-tune costSmall-model cost — usually 10-100× cheaper at runtime

A Pragmatic Path

  1. Build with prompts only on the best frontier model
  2. Measure failure modes carefully
  3. If the gaps are missing knowledge → add RAG
  4. If the gaps are behavioural → try few-shot, then CoT, then a reasoning model
  5. If the cost is too high → distil into a cheaper model
  6. If quality is still capped → fine-tune for the specific behaviour gap

This path keeps you out of expensive early commitments and lets evidence guide each decision.

Key Takeaways

  • Prompting changes behaviour for a single call; RAG injects fresh knowledge; fine-tuning changes the model itself.
  • Use RAG when knowledge is large, dynamic, or auditable; use fine-tuning when behaviour/style/format needs to change permanently.
  • LoRA / QLoRA adapters make fine-tuning affordable — train millions of parameters, not billions.
  • Distillation lets you teach a small model to mimic a large one — major cost savings.
  • These approaches stack: a fine-tuned small model + RAG + good prompts beats any single technique.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →