You've prompted, retrieved, fine-tuned, and evaluated. Now the application has to run reliably, cheaply, and safely for thousands of users. This lesson covers the production architecture that holds it all together.
Reference Architecture
[User]
│
▼
[App backend] ──▶ [Prompt builder] ──▶ [Cache check] ──▶ [hit: return]
│
▼ (miss)
[Retriever (RAG)]
│
▼
[Model router]
│
▼
[LLM provider call (streaming)]
│
▼
[Output validator]
│
▼
[Cache write]
│
▼
[Log + metrics]
│
▼
[User] (streamed response)
Model Routing
Different queries deserve different models. A cheap+fast model handles 70-90% of traffic; route to frontier only when needed.
| Query type | Best fit |
|---|---|
| Classification, simple Q&A | Small/fast (Haiku, GPT-4o-mini, Gemini Flash) |
| Code generation | Code-specialised (Claude Sonnet, GPT-5-codex) |
| Long-document reasoning | Frontier or long-context variants |
| Complex multi-step logic | Reasoning model (o3, Claude Sonnet 4 thinking) |
| Multimodal (image, audio) | Native multimodal (GPT-5, Gemini 2.x, Claude Opus 4) |
Implement routing as code: a small classifier (often itself a small LLM) tags the query, the router dispatches to the right model. Save 5-10× on cost without sacrificing quality.
Caching
Exact-match cache
Hash the full prompt; if seen, return the cached response. Use Redis or Memcached. Saves nothing if every prompt has user-specific data.
Semantic cache
Embed the query; check if any cached query's embedding is within a similarity threshold; if so, return its response. Powerful for FAQ-style traffic — different users ask the same thing in different words. Standard libraries: GPTCache, redis-semantic-cache.
Prompt-prefix cache (provider-side)
Frontier providers support prompt caching: you mark a large prefix (system prompt, retrieved docs) as cacheable; subsequent requests with the same prefix get 50-90% discount and lower latency. Use this aggressively in RAG and tool-heavy workloads.
Streaming
For any chat or generation UI, stream the response. Time-to-first-token is typically 200-1000ms; without streaming, the user waits the full generation time (5-30s for long outputs). Server-Sent Events (SSE) is the standard transport.
// Example: streaming with the Anthropic / OpenAI SDK
for await (const chunk of stream) {
res.write(`data: ${JSON.stringify(chunk)}\n\n`);
}
res.end();
Build retry logic that resumes streaming on transient errors — providers occasionally drop connections mid-stream.
Retries and Fallbacks
| Error | Strategy |
|---|---|
| 429 rate limit | Exponential backoff; respect Retry-After header |
| 5xx server error | Retry 2-3× with backoff |
| Timeout (no response) | Retry with shorter max_tokens |
| Context length exceeded | Truncate or summarise context, then retry |
| Provider down | Failover to a different provider with comparable model |
Multi-provider failover (OpenAI primary, Anthropic backup, or via a gateway like LiteLLM, OpenRouter, or AWS Bedrock) is the resilience pattern. Test failover regularly — providers do have multi-hour outages.
Cost Control
LLM bills can explode silently. Controls every production system needs:
- Per-user / per-org budget caps — hard stop at $X/month
- Per-request token cap — set max_tokens conservatively
- Aggregated cost dashboard — daily by feature/model
- Anomaly alerts — Slack on 2× day-over-day spend
- Cheap model defaults — make expensive opt-in
- Cache hit rate metric — target 30%+ on chatty FAQ workloads
- Token-counting before send — refuse oversized prompts client-side
Observability
Every LLM call traced. Per call log:
{
"trace_id": "...",
"user_id": "...",
"feature": "support-bot",
"model": "claude-opus-4-20251030",
"prompt_tokens": 1247,
"completion_tokens": 318,
"total_tokens": 1565,
"latency_ms": 2840,
"ttft_ms": 410,
"cost_usd": 0.0231,
"cache_hit": false,
"tool_calls": ["search_docs", "create_ticket"],
"user_feedback": "thumbs_up"
}
This single record powers cost reports, quality dashboards, latency SLO tracking, and debugging.
Feature Flags and Kill Switches
Wrap every GenAI feature in a feature flag. When (not if) a model regresses, a provider has an outage, or a customer complains about output, you can disable instantly without a deploy.
Build a graceful degradation path: if GenAI is off, fall back to keyword search, a rule-based response, or "this feature is temporarily unavailable" — never an opaque 500.
Latency Budget
Set explicit budgets per user journey:
| Pattern | Target |
|---|---|
| Interactive chat (with streaming) | TTFT < 1s |
| Background classification | P95 < 5s |
| Heavy reasoning task | P95 < 30s; show progress UI |
| Async / batch generation | Minutes acceptable |
If you can't meet the budget, restructure: use a smaller model, parallelise retrieval, cache aggressively, or move to async with notification.
Versioning Prompts
- Store prompts as code or in a prompt registry — never hardcode in a script
- Version every change; tag deployments with the prompt version
- Roll out new prompts behind a feature flag with A/B testing — measure quality before full rollout
- Snapshot the (model, prompt) pair that produced each output for reproducibility
The Production Checklist
- Multi-model router with explicit per-feature mapping
- Exact + semantic cache; prompt-prefix caching enabled
- Streaming for any user-facing chat
- Retry with exponential backoff; cross-provider failover for critical paths
- Per-user / per-org cost caps and alerts
- Full request/response tracing with token + cost metrics
- Golden-set evaluation in CI before deploys
- Online quality sampling + LLM-as-a-judge
- Safety filter pre + post processing
- Feature flags with graceful fallback
- Documented incident runbook for: model regression, provider outage, cost spike
The Path From Here
You now have a complete production playbook: the landscape, the mechanics, prompting, advanced reasoning, RAG, customisation, evaluation/safety, and architecture. Next steps to deepen:
- Take the AWS Certified AI Practitioner (AIF-C01) or Azure AI Engineer (AI-102) exam — both formalise this knowledge in a cloud-specific context
- Build a small RAG app over your own docs — there is no substitute for the end-to-end experience
- Read the OpenAI Cookbook, Anthropic's prompt engineering interactive tutorial, and the Hugging Face documentation
- Follow the eval / RAGAS / promptfoo communities — evaluation is where the cutting edge of the practitioner field lives
GenAI is a moving target — the principles in this course are durable, but the specific models and tools will keep changing. The skill you've built — to reason about what to use when and how to measure it — transfers across every revision the field will bring.