Fine-Tuning vs RAG vs Prompt Engineering — Generative AI & Prompt Engineering | CertQnA

You have three levers to customise an LLM's behaviour to your domain. Each is a different point on the cost/effort/durability curve. Choosing the wrong one wastes months.

The Three Levers

Lever	What changes	Lasts for
Prompt engineering	The single API call	That request
RAG	What the model "knows" at query time	Per-query (updates whenever your corpus updates)
Fine-tuning	The model's parameters	Permanently (until you retrain)

When Prompt Engineering Is Enough

Most "I need to customise" needs turn out to be prompt engineering needs. Try this first if:

The model already has the knowledge — you just need to direct it
The customisation is about format, tone, or workflow
You can fit needed context (instructions + examples + data) in the context window
You're iterating fast and don't yet know the final requirements

When RAG Is the Right Tool

Use RAG when:

You have a large, structured corpus the model doesn't know (product docs, internal wiki, ticket history)
That corpus changes frequently (daily or weekly)
You need citations and traceability — users must verify claims
Permissions matter — different users see different data
The knowledge is large enough you can't fit it all in context

RAG is by far the most common production pattern for "AI on my data" — and it should be the default starting point for any knowledge-grounded application.

When to Fine-Tune

Fine-tuning is justified when:

You need the model to behave differently (specific style, format, decision pattern) — not just know different things
You have hundreds or thousands of high-quality (input, desired output) pairs
Prompt engineering hits a ceiling — even with great prompts, the model keeps drifting
Cost or latency matters and you can run a smaller model that does the job after fine-tuning
The task is narrow and well-defined (classification, extraction, structured generation)

Common fine-tuning use cases:

Specialised classification (spam vs ham, intent recognition, ticket routing)
Structured extraction from text (invoice fields, medical entities)
Style/tone matching (your brand voice, customer support phrasing)
Code generation in a domain-specific language
Compressing a larger model's behaviour into a cheaper one (distillation)

How Fine-Tuning Works

Full fine-tuning updates every parameter of the model — billions of them, expensive, requires beefy GPUs, often forgets unrelated knowledge ("catastrophic forgetting").

Parameter-Efficient Fine-Tuning (PEFT) updates a small fraction of parameters via adapters:

LoRA (Low-Rank Adaptation): Train two small matrices that, when added to the original weights, produce the adapted behaviour. Often 0.1-1% of base parameters.
QLoRA: LoRA on a quantised (4-bit) base model — fine-tunes 70B models on a single consumer GPU.
Prefix tuning / prompt tuning: Learn a "virtual" prefix of tokens. Even smaller.

LoRA is the dominant choice in 2026. Workflow: prepare a JSONL of input/output pairs → train a LoRA adapter (HF Transformers, Unsloth, Axolotl, OpenAI/Anthropic fine-tuning APIs) → deploy the base model + adapter.

How much data do you need?

Goal	Examples needed
Style/format adjustment	50-500
Domain classification (5-10 classes)	500-2,000
Extraction or transformation	1,000-10,000
Major behavioural change	10,000+

Quality matters more than quantity. 500 carefully labelled examples often beat 5,000 noisy ones.

Distillation

A specialised fine-tune: use a large frontier model (GPT-5, Claude Opus) to label examples, then train a small model (GPT-4o-mini, Llama 3B) to imitate the labels. The result: a model 10-100× cheaper that performs nearly as well on your specific task.

This is the dominant pattern for putting LLMs in cost-sensitive production paths — high-volume classification, routing, simple extraction. Frontier models are too expensive to call millions of times daily; a distilled small model is.

The Decision Tree

Need to customise an LLM?
├── Is it about behaviour/style/format that prompts can't pin down?
│   └── YES → Consider fine-tuning
├── Is it about giving the model new knowledge?
│   ├── Knowledge changes often or is large?
│   │   └── YES → RAG
│   └── Knowledge is small and stable?
│       └── Put in system prompt
└── Is the issue cost/latency?
    └── Distill a frontier model into a small one

They Stack

Production GenAI systems rarely use just one approach:

A fine-tuned small model handles routing/classification cheaply
RAG injects fresh corpus knowledge into the prompt
Carefully engineered prompts steer the model's response style and constraints
Tool use lets the model fetch the data RAG missed

Knowing how to mix these is the real skill. There is no single right answer — only the right combination for the problem and the budget.

The Cost Math

Approach	One-time cost	Per-query cost
Prompt engineering	Engineer time (days)	Standard model cost
RAG	Pipeline build (weeks); embeddings storage	Std cost + small embed cost + retrieval ~ms
Fine-tuning (LoRA)	$10-$1000 training; data labelling	Std cost (often a cheaper model)
Full fine-tuning	$5K-$100K+	Often hosting costs (custom model)
Distillation	Frontier API for labels + fine-tune cost	Small-model cost — usually 10-100× cheaper at runtime

A Pragmatic Path

Build with prompts only on the best frontier model
Measure failure modes carefully
If the gaps are missing knowledge → add RAG
If the gaps are behavioural → try few-shot, then CoT, then a reasoning model
If the cost is too high → distil into a cheaper model
If quality is still capped → fine-tune for the specific behaviour gap

This path keeps you out of expensive early commitments and lets evidence guide each decision.