Building Production GenAI Applications

You've prompted, retrieved, fine-tuned, and evaluated. Now the application has to run reliably, cheaply, and safely for thousands of users. This lesson covers the production architecture that holds it all together.

Reference Architecture

[User]
   │
   ▼
[App backend] ──▶ [Prompt builder] ──▶ [Cache check] ──▶ [hit: return]
                                              │
                                              ▼ (miss)
                                       [Retriever (RAG)]
                                              │
                                              ▼
                                       [Model router]
                                              │
                                              ▼
                              [LLM provider call (streaming)]
                                              │
                                              ▼
                                       [Output validator]
                                              │
                                              ▼
                                       [Cache write]
                                              │
                                              ▼
                                       [Log + metrics]
                                              │
                                              ▼
[User] (streamed response)

Model Routing

Different queries deserve different models. A cheap+fast model handles 70-90% of traffic; route to frontier only when needed.

Query type	Best fit
Classification, simple Q&A	Small/fast (Haiku, GPT-4o-mini, Gemini Flash)
Code generation	Code-specialised (Claude Sonnet, GPT-5-codex)
Long-document reasoning	Frontier or long-context variants
Complex multi-step logic	Reasoning model (o3, Claude Sonnet 4 thinking)
Multimodal (image, audio)	Native multimodal (GPT-5, Gemini 2.x, Claude Opus 4)

Implement routing as code: a small classifier (often itself a small LLM) tags the query, the router dispatches to the right model. Save 5-10× on cost without sacrificing quality.

Caching

Exact-match cache

Hash the full prompt; if seen, return the cached response. Use Redis or Memcached. Saves nothing if every prompt has user-specific data.

Semantic cache

Embed the query; check if any cached query's embedding is within a similarity threshold; if so, return its response. Powerful for FAQ-style traffic — different users ask the same thing in different words. Standard libraries: GPTCache, redis-semantic-cache.

Prompt-prefix cache (provider-side)

Frontier providers support prompt caching: you mark a large prefix (system prompt, retrieved docs) as cacheable; subsequent requests with the same prefix get 50-90% discount and lower latency. Use this aggressively in RAG and tool-heavy workloads.

Streaming

For any chat or generation UI, stream the response. Time-to-first-token is typically 200-1000ms; without streaming, the user waits the full generation time (5-30s for long outputs). Server-Sent Events (SSE) is the standard transport.

// Example: streaming with the Anthropic / OpenAI SDK
for await (const chunk of stream) {
  res.write(`data: ${JSON.stringify(chunk)}\n\n`);
}
res.end();

Build retry logic that resumes streaming on transient errors — providers occasionally drop connections mid-stream.

Retries and Fallbacks

Error	Strategy
429 rate limit	Exponential backoff; respect Retry-After header
5xx server error	Retry 2-3× with backoff
Timeout (no response)	Retry with shorter max_tokens
Context length exceeded	Truncate or summarise context, then retry
Provider down	Failover to a different provider with comparable model

Multi-provider failover (OpenAI primary, Anthropic backup, or via a gateway like LiteLLM, OpenRouter, or AWS Bedrock) is the resilience pattern. Test failover regularly — providers do have multi-hour outages.

Cost Control

LLM bills can explode silently. Controls every production system needs:

Per-user / per-org budget caps — hard stop at $X/month
Per-request token cap — set max_tokens conservatively
Aggregated cost dashboard — daily by feature/model
Anomaly alerts — Slack on 2× day-over-day spend
Cheap model defaults — make expensive opt-in
Cache hit rate metric — target 30%+ on chatty FAQ workloads
Token-counting before send — refuse oversized prompts client-side

Observability

Every LLM call traced. Per call log:

{
  "trace_id": "...",
  "user_id": "...",
  "feature": "support-bot",
  "model": "claude-opus-4-20251030",
  "prompt_tokens": 1247,
  "completion_tokens": 318,
  "total_tokens": 1565,
  "latency_ms": 2840,
  "ttft_ms": 410,
  "cost_usd": 0.0231,
  "cache_hit": false,
  "tool_calls": ["search_docs", "create_ticket"],
  "user_feedback": "thumbs_up"
}

This single record powers cost reports, quality dashboards, latency SLO tracking, and debugging.

Feature Flags and Kill Switches

Wrap every GenAI feature in a feature flag. When (not if) a model regresses, a provider has an outage, or a customer complains about output, you can disable instantly without a deploy.

Build a graceful degradation path: if GenAI is off, fall back to keyword search, a rule-based response, or "this feature is temporarily unavailable" — never an opaque 500.

Latency Budget

Set explicit budgets per user journey:

Pattern	Target
Interactive chat (with streaming)	TTFT < 1s
Background classification	P95 < 5s
Heavy reasoning task	P95 < 30s; show progress UI
Async / batch generation	Minutes acceptable

If you can't meet the budget, restructure: use a smaller model, parallelise retrieval, cache aggressively, or move to async with notification.

Versioning Prompts

Store prompts as code or in a prompt registry — never hardcode in a script
Version every change; tag deployments with the prompt version
Roll out new prompts behind a feature flag with A/B testing — measure quality before full rollout
Snapshot the (model, prompt) pair that produced each output for reproducibility

The Production Checklist

Multi-model router with explicit per-feature mapping
Exact + semantic cache; prompt-prefix caching enabled
Streaming for any user-facing chat
Retry with exponential backoff; cross-provider failover for critical paths
Per-user / per-org cost caps and alerts
Full request/response tracing with token + cost metrics
Golden-set evaluation in CI before deploys
Online quality sampling + LLM-as-a-judge
Safety filter pre + post processing
Feature flags with graceful fallback
Documented incident runbook for: model regression, provider outage, cost spike

The Path From Here

You now have a complete production playbook: the landscape, the mechanics, prompting, advanced reasoning, RAG, customisation, evaluation/safety, and architecture. Next steps to deepen:

Take the AWS Certified AI Practitioner (AIF-C01) or Azure AI Engineer (AI-102) exam — both formalise this knowledge in a cloud-specific context
Build a small RAG app over your own docs — there is no substitute for the end-to-end experience
Read the OpenAI Cookbook, Anthropic's prompt engineering interactive tutorial, and the Hugging Face documentation
Follow the eval / RAGAS / promptfoo communities — evaluation is where the cutting edge of the practitioner field lives

GenAI is a moving target — the principles in this course are durable, but the specific models and tools will keep changing. The skill you've built — to reason about what to use when and how to measure it — transfers across every revision the field will bring.