Skip to content
7 min read·Lesson 8 of 8

Building Production GenAI Applications

Architecture for shipping GenAI to real users — model routing, caching, streaming, retries, cost control, and observability.

You've prompted, retrieved, fine-tuned, and evaluated. Now the application has to run reliably, cheaply, and safely for thousands of users. This lesson covers the production architecture that holds it all together.

Reference Architecture

[User]
   │
   ▼
[App backend] ──▶ [Prompt builder] ──▶ [Cache check] ──▶ [hit: return]
                                              │
                                              ▼ (miss)
                                       [Retriever (RAG)]
                                              │
                                              ▼
                                       [Model router]
                                              │
                                              ▼
                              [LLM provider call (streaming)]
                                              │
                                              ▼
                                       [Output validator]
                                              │
                                              ▼
                                       [Cache write]
                                              │
                                              ▼
                                       [Log + metrics]
                                              │
                                              ▼
[User] (streamed response)

Model Routing

Different queries deserve different models. A cheap+fast model handles 70-90% of traffic; route to frontier only when needed.

Query typeBest fit
Classification, simple Q&ASmall/fast (Haiku, GPT-4o-mini, Gemini Flash)
Code generationCode-specialised (Claude Sonnet, GPT-5-codex)
Long-document reasoningFrontier or long-context variants
Complex multi-step logicReasoning model (o3, Claude Sonnet 4 thinking)
Multimodal (image, audio)Native multimodal (GPT-5, Gemini 2.x, Claude Opus 4)

Implement routing as code: a small classifier (often itself a small LLM) tags the query, the router dispatches to the right model. Save 5-10× on cost without sacrificing quality.

Caching

Exact-match cache

Hash the full prompt; if seen, return the cached response. Use Redis or Memcached. Saves nothing if every prompt has user-specific data.

Semantic cache

Embed the query; check if any cached query's embedding is within a similarity threshold; if so, return its response. Powerful for FAQ-style traffic — different users ask the same thing in different words. Standard libraries: GPTCache, redis-semantic-cache.

Prompt-prefix cache (provider-side)

Frontier providers support prompt caching: you mark a large prefix (system prompt, retrieved docs) as cacheable; subsequent requests with the same prefix get 50-90% discount and lower latency. Use this aggressively in RAG and tool-heavy workloads.

Streaming

For any chat or generation UI, stream the response. Time-to-first-token is typically 200-1000ms; without streaming, the user waits the full generation time (5-30s for long outputs). Server-Sent Events (SSE) is the standard transport.

// Example: streaming with the Anthropic / OpenAI SDK
for await (const chunk of stream) {
  res.write(`data: ${JSON.stringify(chunk)}\n\n`);
}
res.end();

Build retry logic that resumes streaming on transient errors — providers occasionally drop connections mid-stream.

Retries and Fallbacks

ErrorStrategy
429 rate limitExponential backoff; respect Retry-After header
5xx server errorRetry 2-3× with backoff
Timeout (no response)Retry with shorter max_tokens
Context length exceededTruncate or summarise context, then retry
Provider downFailover to a different provider with comparable model

Multi-provider failover (OpenAI primary, Anthropic backup, or via a gateway like LiteLLM, OpenRouter, or AWS Bedrock) is the resilience pattern. Test failover regularly — providers do have multi-hour outages.

Cost Control

LLM bills can explode silently. Controls every production system needs:

  • Per-user / per-org budget caps — hard stop at $X/month
  • Per-request token cap — set max_tokens conservatively
  • Aggregated cost dashboard — daily by feature/model
  • Anomaly alerts — Slack on 2× day-over-day spend
  • Cheap model defaults — make expensive opt-in
  • Cache hit rate metric — target 30%+ on chatty FAQ workloads
  • Token-counting before send — refuse oversized prompts client-side

Observability

Every LLM call traced. Per call log:

{
  "trace_id": "...",
  "user_id": "...",
  "feature": "support-bot",
  "model": "claude-opus-4-20251030",
  "prompt_tokens": 1247,
  "completion_tokens": 318,
  "total_tokens": 1565,
  "latency_ms": 2840,
  "ttft_ms": 410,
  "cost_usd": 0.0231,
  "cache_hit": false,
  "tool_calls": ["search_docs", "create_ticket"],
  "user_feedback": "thumbs_up"
}

This single record powers cost reports, quality dashboards, latency SLO tracking, and debugging.

Feature Flags and Kill Switches

Wrap every GenAI feature in a feature flag. When (not if) a model regresses, a provider has an outage, or a customer complains about output, you can disable instantly without a deploy.

Build a graceful degradation path: if GenAI is off, fall back to keyword search, a rule-based response, or "this feature is temporarily unavailable" — never an opaque 500.

Latency Budget

Set explicit budgets per user journey:

PatternTarget
Interactive chat (with streaming)TTFT < 1s
Background classificationP95 < 5s
Heavy reasoning taskP95 < 30s; show progress UI
Async / batch generationMinutes acceptable

If you can't meet the budget, restructure: use a smaller model, parallelise retrieval, cache aggressively, or move to async with notification.

Versioning Prompts

  • Store prompts as code or in a prompt registry — never hardcode in a script
  • Version every change; tag deployments with the prompt version
  • Roll out new prompts behind a feature flag with A/B testing — measure quality before full rollout
  • Snapshot the (model, prompt) pair that produced each output for reproducibility

The Production Checklist

  • Multi-model router with explicit per-feature mapping
  • Exact + semantic cache; prompt-prefix caching enabled
  • Streaming for any user-facing chat
  • Retry with exponential backoff; cross-provider failover for critical paths
  • Per-user / per-org cost caps and alerts
  • Full request/response tracing with token + cost metrics
  • Golden-set evaluation in CI before deploys
  • Online quality sampling + LLM-as-a-judge
  • Safety filter pre + post processing
  • Feature flags with graceful fallback
  • Documented incident runbook for: model regression, provider outage, cost spike

The Path From Here

You now have a complete production playbook: the landscape, the mechanics, prompting, advanced reasoning, RAG, customisation, evaluation/safety, and architecture. Next steps to deepen:

  • Take the AWS Certified AI Practitioner (AIF-C01) or Azure AI Engineer (AI-102) exam — both formalise this knowledge in a cloud-specific context
  • Build a small RAG app over your own docs — there is no substitute for the end-to-end experience
  • Read the OpenAI Cookbook, Anthropic's prompt engineering interactive tutorial, and the Hugging Face documentation
  • Follow the eval / RAGAS / promptfoo communities — evaluation is where the cutting edge of the practitioner field lives

GenAI is a moving target — the principles in this course are durable, but the specific models and tools will keep changing. The skill you've built — to reason about what to use when and how to measure it — transfers across every revision the field will bring.

Key Takeaways

  • Route queries to the cheapest model that handles them well — a small/fast model for most, frontier for the hard ones.
  • Cache both exact and semantic — embedding-based caches deduplicate near-identical queries.
  • Stream responses for any user-facing chat — perceived latency drops dramatically.
  • Set per-user and per-org cost budgets; alert before they're hit.
  • Build a "kill switch" — feature flag to disable GenAI if a model regresses or a provider has an outage.
🎉

Course Complete!

You've finished Generative AI & Prompt Engineering. Now put your knowledge to the test with real exam-style practice questions.