Replication and Sharding — System Design Fundamentals | CertQnA

One database server eventually runs out: of CPU, of disk, of write capacity, or of "what if it dies?" tolerance. Replication and sharding are the two answers.

Replication

Replication keeps copies of the same data on multiple machines. It buys you:

Availability — failover when the primary dies.
Read scalability — distribute read traffic to replicas.
Geographic locality — put a replica near each region of users.
Backups / analytics — run heavy queries off the replica without slowing the primary.

Single-leader (primary-replica)

    Writes
       ▼
   ┌────────┐    replication log    ┌─────────┐
   │ Leader │ ────────────────────→ │ Replica │
   └────────┘                       └─────────┘
       │                                 ▲
       └────── Reads (consistent) ───────┴── Reads (possibly stale)

The default model in Postgres, MySQL, MongoDB, SQL Server. One node accepts writes; the rest follow.

+ Simple, well-understood.
+ Strong consistency on the leader.
− Single write bottleneck.
− Failover is non-trivial; "split brain" is a real risk.

Multi-leader

Multiple nodes accept writes. Used across data centres or for collaborative apps (e.g. Couchbase, Mongo with multi-master, application-level CRDTs).

+ No single write bottleneck; writes survive a partition.
− Conflict resolution is hard. Two leaders accept conflicting writes — who wins?
− Most teams who think they want this actually don't.

Leaderless

Any node accepts any write; reads and writes go to a quorum. Pioneered by Dynamo, used by Cassandra, ScyllaDB, Riak.

Quorum: R + W > N ensures that a read overlaps with the latest write (where N is replica count, R/W are read/write quorum sizes).
+ Highly available, no failover step.
− Consistency tunable but trickier; needs read repair, anti-entropy, hinted handoff.

Sync vs Async Replication

Mode	Behaviour	Trade-off
Asynchronous	Leader acks write before replicas confirm	Fast writes; can lose data on leader crash; replicas can lag seconds
Synchronous	Leader waits for at least one replica	Durable; slower writes; replica problem becomes write problem
Semi-sync	Wait for one replica, others async	Practical compromise; many production setups use this

Replica Lag

When async, replicas trail the leader. Symptoms: user updates profile, refreshes, sees old value. Mitigations:

Read your own writes — route a user's reads to the leader for N seconds after they write.
Monotonic reads — pin a user to one replica so they don't see time go backwards.
Check replica lag and route to leader if it exceeds a threshold.

Sharding (Partitioning)

Sharding splits one logical dataset across many physical machines. Each shard owns a subset of rows.

     users table
   ┌────────────────┐
   │ user_id %% 4=0 │ → shard 0
   │ user_id %% 4=1 │ → shard 1
   │ user_id %% 4=2 │ → shard 2
   │ user_id %% 4=3 │ → shard 3
   └────────────────┘

It buys you:

Write throughput beyond a single leader.
Storage beyond a single disk.
Smaller indexes, faster point queries.

Partition Key

The most important decision in sharding. The key determines which shard each row lives on.

Choose a key with high cardinality and uniform access — user_id is usually good.
Avoid keys that concentrate traffic — tenant_id when one tenant is 90% of traffic causes a hot shard.
Avoid time as the only key — yesterday's shard is cold, today's is on fire.

Sharding Strategies

Strategy	How	Pros / Cons
Range-based	Shard 0: A–F, Shard 1: G–M, …	Range scans cheap; hot shards if data skewed
Hash-based	hash(key) %% N	Even distribution; range scans expensive
Consistent hashing	Hash key onto a ring; node adjacent owns it	Adding/removing a node moves only ~1/N of data
Directory-based	Lookup service maps key → shard	Most flexible; lookup service is a dependency

Resharding

Adding a shard with naive hash(key) %% N moves nearly every row. Consistent hashing with virtual nodes minimises this. Some systems (Vitess, MongoDB, Cassandra) handle resharding online; others require downtime — find out before going live.

Replication and Sharding Together

Real systems combine both. A typical large deployment:

                ┌────────────────────────────┐
                │  Routing layer / proxy     │
                └────────────────────────────┘
                  │       │       │       │
              Shard 0   Shard 1  Shard 2  Shard 3
              ┌──┐      ┌──┐    ┌──┐     ┌──┐
              │L │      │L │    │L │     │L │
              ├──┤      ├──┤    ├──┤     ├──┤
              │R1│      │R1│    │R1│     │R1│
              │R2│      │R2│    │R2│     │R2│
              └──┘      └──┘    └──┘     └──┘

Each shard is a small replicated cluster. Routing layer sends each query to the right shard.

Cross-Shard Operations

The cost of sharding is that some operations no longer work the way you'd hope:

Joins across shards — expensive or impossible. Denormalise instead.
Transactions across shards — require two-phase commit or sagas. Avoid where possible.
Aggregate queries — must fan out to all shards and merge results.

Design your data model so that the queries you need most often touch one shard. Cross-shard work should be the exception.

Failure Modes

Hot shard — uneven traffic. Add more nodes for that shard, or change the key.
Replica lag — async replicas fall behind. Monitor and route around laggards.
Split brain — both leaders think they're primary after a partition. Use a coordinator (etcd, Zookeeper, Consul) and require leader leases.
Failover loops — flapping leadership election. Use stable timeouts and quorum.

The Managed Services Cheat Code

Building this yourself is hard. Most teams should use:

Aurora / Cloud SQL / Azure SQL — managed replication, point-in-time recovery, readers.
DynamoDB / Bigtable / Cosmos — managed sharding with consistent hashing built in.
Spanner / CockroachDB / Yugabyte — distributed SQL with replication and sharding under the SQL layer.

You still need to understand the concepts to choose well and debug, but you don't have to operate Cassandra at 3 a.m.

Cert Mapping

Cert	Coverage
AWS SAA / SAP	Aurora replicas, Multi-AZ, DynamoDB partitions, global tables
Azure AZ-305	Geo-replication, Cosmos DB partitioning, multi-region writes
GCP PCA	Spanner regional/multi-regional, Bigtable, Cloud SQL HA

The Default Architecture

One primary + 2 replicas (Multi-AZ) is enough for most apps.
Add read replicas only when read traffic actually exceeds primary capacity.
Don't shard until you must. Sharding is a one-way door.
When you do shard, pick the partition key carefully and assume you'll regret it eventually.

The next lesson covers the consistency models that all of this trades against — CAP, PACELC, and what your design promises about correctness under partition.