You have three levers to customise an LLM's behaviour to your domain. Each is a different point on the cost/effort/durability curve. Choosing the wrong one wastes months.
The Three Levers
| Lever | What changes | Lasts for |
|---|---|---|
| Prompt engineering | The single API call | That request |
| RAG | What the model "knows" at query time | Per-query (updates whenever your corpus updates) |
| Fine-tuning | The model's parameters | Permanently (until you retrain) |
When Prompt Engineering Is Enough
Most "I need to customise" needs turn out to be prompt engineering needs. Try this first if:
- The model already has the knowledge — you just need to direct it
- The customisation is about format, tone, or workflow
- You can fit needed context (instructions + examples + data) in the context window
- You're iterating fast and don't yet know the final requirements
When RAG Is the Right Tool
Use RAG when:
- You have a large, structured corpus the model doesn't know (product docs, internal wiki, ticket history)
- That corpus changes frequently (daily or weekly)
- You need citations and traceability — users must verify claims
- Permissions matter — different users see different data
- The knowledge is large enough you can't fit it all in context
RAG is by far the most common production pattern for "AI on my data" — and it should be the default starting point for any knowledge-grounded application.
When to Fine-Tune
Fine-tuning is justified when:
- You need the model to behave differently (specific style, format, decision pattern) — not just know different things
- You have hundreds or thousands of high-quality (input, desired output) pairs
- Prompt engineering hits a ceiling — even with great prompts, the model keeps drifting
- Cost or latency matters and you can run a smaller model that does the job after fine-tuning
- The task is narrow and well-defined (classification, extraction, structured generation)
Common fine-tuning use cases:
- Specialised classification (spam vs ham, intent recognition, ticket routing)
- Structured extraction from text (invoice fields, medical entities)
- Style/tone matching (your brand voice, customer support phrasing)
- Code generation in a domain-specific language
- Compressing a larger model's behaviour into a cheaper one (distillation)
How Fine-Tuning Works
Full fine-tuning updates every parameter of the model — billions of them, expensive, requires beefy GPUs, often forgets unrelated knowledge ("catastrophic forgetting").
Parameter-Efficient Fine-Tuning (PEFT) updates a small fraction of parameters via adapters:
- LoRA (Low-Rank Adaptation): Train two small matrices that, when added to the original weights, produce the adapted behaviour. Often 0.1-1% of base parameters.
- QLoRA: LoRA on a quantised (4-bit) base model — fine-tunes 70B models on a single consumer GPU.
- Prefix tuning / prompt tuning: Learn a "virtual" prefix of tokens. Even smaller.
LoRA is the dominant choice in 2026. Workflow: prepare a JSONL of input/output pairs → train a LoRA adapter (HF Transformers, Unsloth, Axolotl, OpenAI/Anthropic fine-tuning APIs) → deploy the base model + adapter.
How much data do you need?
| Goal | Examples needed |
|---|---|
| Style/format adjustment | 50-500 |
| Domain classification (5-10 classes) | 500-2,000 |
| Extraction or transformation | 1,000-10,000 |
| Major behavioural change | 10,000+ |
Quality matters more than quantity. 500 carefully labelled examples often beat 5,000 noisy ones.
Distillation
A specialised fine-tune: use a large frontier model (GPT-5, Claude Opus) to label examples, then train a small model (GPT-4o-mini, Llama 3B) to imitate the labels. The result: a model 10-100× cheaper that performs nearly as well on your specific task.
This is the dominant pattern for putting LLMs in cost-sensitive production paths — high-volume classification, routing, simple extraction. Frontier models are too expensive to call millions of times daily; a distilled small model is.
The Decision Tree
Need to customise an LLM?
├── Is it about behaviour/style/format that prompts can't pin down?
│ └── YES → Consider fine-tuning
├── Is it about giving the model new knowledge?
│ ├── Knowledge changes often or is large?
│ │ └── YES → RAG
│ └── Knowledge is small and stable?
│ └── Put in system prompt
└── Is the issue cost/latency?
└── Distill a frontier model into a small one
They Stack
Production GenAI systems rarely use just one approach:
- A fine-tuned small model handles routing/classification cheaply
- RAG injects fresh corpus knowledge into the prompt
- Carefully engineered prompts steer the model's response style and constraints
- Tool use lets the model fetch the data RAG missed
Knowing how to mix these is the real skill. There is no single right answer — only the right combination for the problem and the budget.
The Cost Math
| Approach | One-time cost | Per-query cost |
|---|---|---|
| Prompt engineering | Engineer time (days) | Standard model cost |
| RAG | Pipeline build (weeks); embeddings storage | Std cost + small embed cost + retrieval ~ms |
| Fine-tuning (LoRA) | $10-$1000 training; data labelling | Std cost (often a cheaper model) |
| Full fine-tuning | $5K-$100K+ | Often hosting costs (custom model) |
| Distillation | Frontier API for labels + fine-tune cost | Small-model cost — usually 10-100× cheaper at runtime |
A Pragmatic Path
- Build with prompts only on the best frontier model
- Measure failure modes carefully
- If the gaps are missing knowledge → add RAG
- If the gaps are behavioural → try few-shot, then CoT, then a reasoning model
- If the cost is too high → distil into a cheaper model
- If quality is still capped → fine-tune for the specific behaviour gap
This path keeps you out of expensive early commitments and lets evidence guide each decision.