One database server eventually runs out: of CPU, of disk, of write capacity, or of "what if it dies?" tolerance. Replication and sharding are the two answers.
Replication
Replication keeps copies of the same data on multiple machines. It buys you:
- Availability — failover when the primary dies.
- Read scalability — distribute read traffic to replicas.
- Geographic locality — put a replica near each region of users.
- Backups / analytics — run heavy queries off the replica without slowing the primary.
Single-leader (primary-replica)
Writes
▼
┌────────┐ replication log ┌─────────┐
│ Leader │ ────────────────────→ │ Replica │
└────────┘ └─────────┘
│ ▲
└────── Reads (consistent) ───────┴── Reads (possibly stale)
The default model in Postgres, MySQL, MongoDB, SQL Server. One node accepts writes; the rest follow.
- + Simple, well-understood.
- + Strong consistency on the leader.
- − Single write bottleneck.
- − Failover is non-trivial; "split brain" is a real risk.
Multi-leader
Multiple nodes accept writes. Used across data centres or for collaborative apps (e.g. Couchbase, Mongo with multi-master, application-level CRDTs).
- + No single write bottleneck; writes survive a partition.
- − Conflict resolution is hard. Two leaders accept conflicting writes — who wins?
- − Most teams who think they want this actually don't.
Leaderless
Any node accepts any write; reads and writes go to a quorum. Pioneered by Dynamo, used by Cassandra, ScyllaDB, Riak.
- Quorum:
R + W > Nensures that a read overlaps with the latest write (where N is replica count, R/W are read/write quorum sizes). - + Highly available, no failover step.
- − Consistency tunable but trickier; needs read repair, anti-entropy, hinted handoff.
Sync vs Async Replication
| Mode | Behaviour | Trade-off |
|---|---|---|
| Asynchronous | Leader acks write before replicas confirm | Fast writes; can lose data on leader crash; replicas can lag seconds |
| Synchronous | Leader waits for at least one replica | Durable; slower writes; replica problem becomes write problem |
| Semi-sync | Wait for one replica, others async | Practical compromise; many production setups use this |
Replica Lag
When async, replicas trail the leader. Symptoms: user updates profile, refreshes, sees old value. Mitigations:
- Read your own writes — route a user's reads to the leader for N seconds after they write.
- Monotonic reads — pin a user to one replica so they don't see time go backwards.
- Check replica lag and route to leader if it exceeds a threshold.
Sharding (Partitioning)
Sharding splits one logical dataset across many physical machines. Each shard owns a subset of rows.
users table
┌────────────────┐
│ user_id %% 4=0 │ → shard 0
│ user_id %% 4=1 │ → shard 1
│ user_id %% 4=2 │ → shard 2
│ user_id %% 4=3 │ → shard 3
└────────────────┘
It buys you:
- Write throughput beyond a single leader.
- Storage beyond a single disk.
- Smaller indexes, faster point queries.
Partition Key
The most important decision in sharding. The key determines which shard each row lives on.
- Choose a key with high cardinality and uniform access —
user_idis usually good. - Avoid keys that concentrate traffic —
tenant_idwhen one tenant is 90% of traffic causes a hot shard. - Avoid time as the only key — yesterday's shard is cold, today's is on fire.
Sharding Strategies
| Strategy | How | Pros / Cons |
|---|---|---|
| Range-based | Shard 0: A–F, Shard 1: G–M, … | Range scans cheap; hot shards if data skewed |
| Hash-based | hash(key) %% N | Even distribution; range scans expensive |
| Consistent hashing | Hash key onto a ring; node adjacent owns it | Adding/removing a node moves only ~1/N of data |
| Directory-based | Lookup service maps key → shard | Most flexible; lookup service is a dependency |
Resharding
Adding a shard with naive hash(key) %% N moves nearly every row. Consistent hashing with virtual nodes minimises this. Some systems (Vitess, MongoDB, Cassandra) handle resharding online; others require downtime — find out before going live.
Replication and Sharding Together
Real systems combine both. A typical large deployment:
┌────────────────────────────┐
│ Routing layer / proxy │
└────────────────────────────┘
│ │ │ │
Shard 0 Shard 1 Shard 2 Shard 3
┌──┐ ┌──┐ ┌──┐ ┌──┐
│L │ │L │ │L │ │L │
├──┤ ├──┤ ├──┤ ├──┤
│R1│ │R1│ │R1│ │R1│
│R2│ │R2│ │R2│ │R2│
└──┘ └──┘ └──┘ └──┘
Each shard is a small replicated cluster. Routing layer sends each query to the right shard.
Cross-Shard Operations
The cost of sharding is that some operations no longer work the way you'd hope:
- Joins across shards — expensive or impossible. Denormalise instead.
- Transactions across shards — require two-phase commit or sagas. Avoid where possible.
- Aggregate queries — must fan out to all shards and merge results.
Design your data model so that the queries you need most often touch one shard. Cross-shard work should be the exception.
Failure Modes
- Hot shard — uneven traffic. Add more nodes for that shard, or change the key.
- Replica lag — async replicas fall behind. Monitor and route around laggards.
- Split brain — both leaders think they're primary after a partition. Use a coordinator (etcd, Zookeeper, Consul) and require leader leases.
- Failover loops — flapping leadership election. Use stable timeouts and quorum.
The Managed Services Cheat Code
Building this yourself is hard. Most teams should use:
- Aurora / Cloud SQL / Azure SQL — managed replication, point-in-time recovery, readers.
- DynamoDB / Bigtable / Cosmos — managed sharding with consistent hashing built in.
- Spanner / CockroachDB / Yugabyte — distributed SQL with replication and sharding under the SQL layer.
You still need to understand the concepts to choose well and debug, but you don't have to operate Cassandra at 3 a.m.
Cert Mapping
| Cert | Coverage |
|---|---|
| AWS SAA / SAP | Aurora replicas, Multi-AZ, DynamoDB partitions, global tables |
| Azure AZ-305 | Geo-replication, Cosmos DB partitioning, multi-region writes |
| GCP PCA | Spanner regional/multi-regional, Bigtable, Cloud SQL HA |
The Default Architecture
- One primary + 2 replicas (Multi-AZ) is enough for most apps.
- Add read replicas only when read traffic actually exceeds primary capacity.
- Don't shard until you must. Sharding is a one-way door.
- When you do shard, pick the partition key carefully and assume you'll regret it eventually.
The next lesson covers the consistency models that all of this trades against — CAP, PACELC, and what your design promises about correctness under partition.