Scalability, Latency, and Throughput — System Design Fundamentals | CertQnA

Performance has a vocabulary. Confusing the terms is the fastest way to design something wrong.

Latency vs Throughput

Term	Definition	Unit
Latency	Time to handle one request	ms (or µs, s)
Throughput	Requests handled per unit time	req/sec, QPS, RPS
Bandwidth	Data per unit time	MB/s, Gbps
Concurrency	Requests in flight at once	count

Little's Law links them: concurrency = throughput × latency. If your service handles 1,000 req/sec at 100 ms latency, you have ~100 in-flight requests on average. Pool sizes, thread counts, and connection limits all flow from this.

Percentiles, Not Averages

"Average latency 50 ms" tells you almost nothing. The user experience lives in the tail.

p50 (median) — half the requests are faster than this.
p95 — 5% of requests are slower than this. The "slow but not terrible" range.
p99 — 1% of requests are slower. This is what your worst customers feel.
p99.9 / p99.99 — what you watch when uptime matters.

Tail latency compounds: a page that fans out 10 backend calls in parallel feels p99 ≈ the slowest of 10 random p99s, which is much worse than any one service's p99. This is why "the slow tail kills you at scale".

Vertical vs Horizontal Scaling

Vertical (scale up)

Bigger machine — more CPU, RAM, faster disk.

+ Simple, no code changes.
+ Strong consistency naturally (one box, one source of truth).
− Hard cap: there is a biggest available machine.
− Single point of failure unless paired with replicas.
− Cost scales superlinearly past a point.

Horizontal (scale out)

More machines, work spread across them.

+ Effectively unbounded.
+ Failures are localised.
− Coordination, consistency, partitioning become real problems.
− Most code that "works on one box" needs rethinking for many.

The general rule: scale vertically until the simplicity benefit runs out, then scale horizontally. Don't pre-shard a database that fits comfortably on one machine.

Stateless vs Stateful

The thing that makes horizontal scaling hard is state:

Stateless services hold no per-user data between requests; any instance can serve any request. Add or remove instances freely. Most web/app servers should be stateless.
Stateful services hold data: databases, caches, queues, file servers. They scale via replication and sharding (next lessons).

The classic three-tier pattern — stateless web tier, stateless app tier, stateful data tier — exists precisely so that the easy-to-scale parts can be scaled easily, and complexity is concentrated only where state demands it.

Amdahls Law and the Serial Bottleneck

If a fraction s of your workload is inherently serial, no amount of parallelism makes the system faster than 1/s times the serial portion. A 5%-serial workload caps out at 20× speedup, no matter how many cores or nodes you throw at it.

Practical consequences:

A single shared lock or counter ruins scaling.
A single primary database write path caps your write throughput.
A single coordinator service becomes the bottleneck of the whole platform.

Architects spend much of their time hunting and removing serial bottlenecks.

Capacity Planning

Capacity = (resource per request) × (requests per second) ÷ (resource per server) — with a generous headroom.

Example:
  Per request: 50 ms CPU on a 4-core box → 200 ms of CPU per request available
                                          → 4 cores × 1000 ms = 4000 ms / sec capacity
                                          → 4000 / 50 = 80 req/sec per box at 100% CPU

  Target: 1000 req/sec at peak, headroom 50% (run at 50% CPU) → effective 40 req/sec per box
  Boxes needed: 1000 / 40 = 25 boxes

Always plan for headroom. Servers running at 90% CPU fail badly under any spike.

Where Latency Comes From

Source	Typical
L1/L2 CPU cache	~1 ns
Main memory	~100 ns
SSD random read	~100 µs
HDD random read	~10 ms
Same-AZ network	~0.5–1 ms RTT
Cross-AZ network	~1–2 ms RTT
Cross-region network	~30–150 ms RTT
Cross-continent	~100–250 ms RTT

"Add a network hop" is never free. Microservices designs that fan out 10 hops per request inherit the cost of 10 hops. Co-locate chatty services or use bigger requests, fewer hops.

Throughput Levers

Concurrency — more threads / async tasks per box (until contention or context switching dominates).
Batching — process N items in one round-trip instead of N round-trips.
Caching — return precomputed answers; the cheapest request is the one that didn't happen.
Async — accept the work, return 202, process in the background. Reply once the user actually needs it.
Sharding — partition the workload so independent shards scale horizontally.

The Cheapest Request

The single most important performance principle: the request you don't make is free. CDNs, caches, deduplication, conditional GETs, idempotent retries with stored results — all of these exist to avoid work entirely. Optimising the request you do make matters less than removing requests altogether.

Closing the Loop on Numbers

Every design decision in the rest of this course will come with numbers. When you read about caching, ask: how much of my QPS does it absorb? When you read about sharding, ask: how does throughput scale per shard? Without numbers, system design is just words. With them, it becomes engineering.