Performance has a vocabulary. Confusing the terms is the fastest way to design something wrong.
Latency vs Throughput
| Term | Definition | Unit |
|---|---|---|
| Latency | Time to handle one request | ms (or µs, s) |
| Throughput | Requests handled per unit time | req/sec, QPS, RPS |
| Bandwidth | Data per unit time | MB/s, Gbps |
| Concurrency | Requests in flight at once | count |
Little's Law links them: concurrency = throughput × latency. If your service handles 1,000 req/sec at 100 ms latency, you have ~100 in-flight requests on average. Pool sizes, thread counts, and connection limits all flow from this.
Percentiles, Not Averages
"Average latency 50 ms" tells you almost nothing. The user experience lives in the tail.
- p50 (median) — half the requests are faster than this.
- p95 — 5% of requests are slower than this. The "slow but not terrible" range.
- p99 — 1% of requests are slower. This is what your worst customers feel.
- p99.9 / p99.99 — what you watch when uptime matters.
Tail latency compounds: a page that fans out 10 backend calls in parallel feels p99 ≈ the slowest of 10 random p99s, which is much worse than any one service's p99. This is why "the slow tail kills you at scale".
Vertical vs Horizontal Scaling
Vertical (scale up)
Bigger machine — more CPU, RAM, faster disk.
- + Simple, no code changes.
- + Strong consistency naturally (one box, one source of truth).
- − Hard cap: there is a biggest available machine.
- − Single point of failure unless paired with replicas.
- − Cost scales superlinearly past a point.
Horizontal (scale out)
More machines, work spread across them.
- + Effectively unbounded.
- + Failures are localised.
- − Coordination, consistency, partitioning become real problems.
- − Most code that "works on one box" needs rethinking for many.
The general rule: scale vertically until the simplicity benefit runs out, then scale horizontally. Don't pre-shard a database that fits comfortably on one machine.
Stateless vs Stateful
The thing that makes horizontal scaling hard is state:
- Stateless services hold no per-user data between requests; any instance can serve any request. Add or remove instances freely. Most web/app servers should be stateless.
- Stateful services hold data: databases, caches, queues, file servers. They scale via replication and sharding (next lessons).
The classic three-tier pattern — stateless web tier, stateless app tier, stateful data tier — exists precisely so that the easy-to-scale parts can be scaled easily, and complexity is concentrated only where state demands it.
Amdahls Law and the Serial Bottleneck
If a fraction s of your workload is inherently serial, no amount of parallelism makes the system faster than 1/s times the serial portion. A 5%-serial workload caps out at 20× speedup, no matter how many cores or nodes you throw at it.
Practical consequences:
- A single shared lock or counter ruins scaling.
- A single primary database write path caps your write throughput.
- A single coordinator service becomes the bottleneck of the whole platform.
Architects spend much of their time hunting and removing serial bottlenecks.
Capacity Planning
Capacity = (resource per request) × (requests per second) ÷ (resource per server) — with a generous headroom.
Example:
Per request: 50 ms CPU on a 4-core box → 200 ms of CPU per request available
→ 4 cores × 1000 ms = 4000 ms / sec capacity
→ 4000 / 50 = 80 req/sec per box at 100% CPU
Target: 1000 req/sec at peak, headroom 50% (run at 50% CPU) → effective 40 req/sec per box
Boxes needed: 1000 / 40 = 25 boxes
Always plan for headroom. Servers running at 90% CPU fail badly under any spike.
Where Latency Comes From
| Source | Typical |
|---|---|
| L1/L2 CPU cache | ~1 ns |
| Main memory | ~100 ns |
| SSD random read | ~100 µs |
| HDD random read | ~10 ms |
| Same-AZ network | ~0.5–1 ms RTT |
| Cross-AZ network | ~1–2 ms RTT |
| Cross-region network | ~30–150 ms RTT |
| Cross-continent | ~100–250 ms RTT |
"Add a network hop" is never free. Microservices designs that fan out 10 hops per request inherit the cost of 10 hops. Co-locate chatty services or use bigger requests, fewer hops.
Throughput Levers
- Concurrency — more threads / async tasks per box (until contention or context switching dominates).
- Batching — process N items in one round-trip instead of N round-trips.
- Caching — return precomputed answers; the cheapest request is the one that didn't happen.
- Async — accept the work, return 202, process in the background. Reply once the user actually needs it.
- Sharding — partition the workload so independent shards scale horizontally.
The Cheapest Request
The single most important performance principle: the request you don't make is free. CDNs, caches, deduplication, conditional GETs, idempotent retries with stored results — all of these exist to avoid work entirely. Optimising the request you do make matters less than removing requests altogether.
Closing the Loop on Numbers
Every design decision in the rest of this course will come with numbers. When you read about caching, ask: how much of my QPS does it absorb? When you read about sharding, ask: how does throughput scale per shard? Without numbers, system design is just words. With them, it becomes engineering.