Skip to content
5 min read·Lesson 6 of 10

Replication and Sharding

How databases scale beyond one machine: replication for availability and read throughput, sharding for write throughput and storage.

One database server eventually runs out: of CPU, of disk, of write capacity, or of "what if it dies?" tolerance. Replication and sharding are the two answers.

Replication

Replication keeps copies of the same data on multiple machines. It buys you:

  • Availability — failover when the primary dies.
  • Read scalability — distribute read traffic to replicas.
  • Geographic locality — put a replica near each region of users.
  • Backups / analytics — run heavy queries off the replica without slowing the primary.

Single-leader (primary-replica)

    Writes
       ▼
   ┌────────┐    replication log    ┌─────────┐
   │ Leader │ ────────────────────→ │ Replica │
   └────────┘                       └─────────┘
       │                                 ▲
       └────── Reads (consistent) ───────┴── Reads (possibly stale)

The default model in Postgres, MySQL, MongoDB, SQL Server. One node accepts writes; the rest follow.

  • + Simple, well-understood.
  • + Strong consistency on the leader.
  • − Single write bottleneck.
  • − Failover is non-trivial; "split brain" is a real risk.

Multi-leader

Multiple nodes accept writes. Used across data centres or for collaborative apps (e.g. Couchbase, Mongo with multi-master, application-level CRDTs).

  • + No single write bottleneck; writes survive a partition.
  • − Conflict resolution is hard. Two leaders accept conflicting writes — who wins?
  • − Most teams who think they want this actually don't.

Leaderless

Any node accepts any write; reads and writes go to a quorum. Pioneered by Dynamo, used by Cassandra, ScyllaDB, Riak.

  • Quorum: R + W > N ensures that a read overlaps with the latest write (where N is replica count, R/W are read/write quorum sizes).
  • + Highly available, no failover step.
  • − Consistency tunable but trickier; needs read repair, anti-entropy, hinted handoff.

Sync vs Async Replication

ModeBehaviourTrade-off
AsynchronousLeader acks write before replicas confirmFast writes; can lose data on leader crash; replicas can lag seconds
SynchronousLeader waits for at least one replicaDurable; slower writes; replica problem becomes write problem
Semi-syncWait for one replica, others asyncPractical compromise; many production setups use this

Replica Lag

When async, replicas trail the leader. Symptoms: user updates profile, refreshes, sees old value. Mitigations:

  • Read your own writes — route a user's reads to the leader for N seconds after they write.
  • Monotonic reads — pin a user to one replica so they don't see time go backwards.
  • Check replica lag and route to leader if it exceeds a threshold.

Sharding (Partitioning)

Sharding splits one logical dataset across many physical machines. Each shard owns a subset of rows.

     users table
   ┌────────────────┐
   │ user_id %% 4=0 │ → shard 0
   │ user_id %% 4=1 │ → shard 1
   │ user_id %% 4=2 │ → shard 2
   │ user_id %% 4=3 │ → shard 3
   └────────────────┘

It buys you:

  • Write throughput beyond a single leader.
  • Storage beyond a single disk.
  • Smaller indexes, faster point queries.

Partition Key

The most important decision in sharding. The key determines which shard each row lives on.

  • Choose a key with high cardinality and uniform access — user_id is usually good.
  • Avoid keys that concentrate traffic — tenant_id when one tenant is 90% of traffic causes a hot shard.
  • Avoid time as the only key — yesterday's shard is cold, today's is on fire.

Sharding Strategies

StrategyHowPros / Cons
Range-basedShard 0: A–F, Shard 1: G–M, …Range scans cheap; hot shards if data skewed
Hash-basedhash(key) %% NEven distribution; range scans expensive
Consistent hashingHash key onto a ring; node adjacent owns itAdding/removing a node moves only ~1/N of data
Directory-basedLookup service maps key → shardMost flexible; lookup service is a dependency

Resharding

Adding a shard with naive hash(key) %% N moves nearly every row. Consistent hashing with virtual nodes minimises this. Some systems (Vitess, MongoDB, Cassandra) handle resharding online; others require downtime — find out before going live.

Replication and Sharding Together

Real systems combine both. A typical large deployment:

                ┌────────────────────────────┐
                │  Routing layer / proxy     │
                └────────────────────────────┘
                  │       │       │       │
              Shard 0   Shard 1  Shard 2  Shard 3
              ┌──┐      ┌──┐    ┌──┐     ┌──┐
              │L │      │L │    │L │     │L │
              ├──┤      ├──┤    ├──┤     ├──┤
              │R1│      │R1│    │R1│     │R1│
              │R2│      │R2│    │R2│     │R2│
              └──┘      └──┘    └──┘     └──┘

Each shard is a small replicated cluster. Routing layer sends each query to the right shard.

Cross-Shard Operations

The cost of sharding is that some operations no longer work the way you'd hope:

  • Joins across shards — expensive or impossible. Denormalise instead.
  • Transactions across shards — require two-phase commit or sagas. Avoid where possible.
  • Aggregate queries — must fan out to all shards and merge results.

Design your data model so that the queries you need most often touch one shard. Cross-shard work should be the exception.

Failure Modes

  • Hot shard — uneven traffic. Add more nodes for that shard, or change the key.
  • Replica lag — async replicas fall behind. Monitor and route around laggards.
  • Split brain — both leaders think they're primary after a partition. Use a coordinator (etcd, Zookeeper, Consul) and require leader leases.
  • Failover loops — flapping leadership election. Use stable timeouts and quorum.

The Managed Services Cheat Code

Building this yourself is hard. Most teams should use:

  • Aurora / Cloud SQL / Azure SQL — managed replication, point-in-time recovery, readers.
  • DynamoDB / Bigtable / Cosmos — managed sharding with consistent hashing built in.
  • Spanner / CockroachDB / Yugabyte — distributed SQL with replication and sharding under the SQL layer.

You still need to understand the concepts to choose well and debug, but you don't have to operate Cassandra at 3 a.m.

Cert Mapping

CertCoverage
AWS SAA / SAPAurora replicas, Multi-AZ, DynamoDB partitions, global tables
Azure AZ-305Geo-replication, Cosmos DB partitioning, multi-region writes
GCP PCASpanner regional/multi-regional, Bigtable, Cloud SQL HA

The Default Architecture

  1. One primary + 2 replicas (Multi-AZ) is enough for most apps.
  2. Add read replicas only when read traffic actually exceeds primary capacity.
  3. Don't shard until you must. Sharding is a one-way door.
  4. When you do shard, pick the partition key carefully and assume you'll regret it eventually.

The next lesson covers the consistency models that all of this trades against — CAP, PACELC, and what your design promises about correctness under partition.

Key Takeaways

  • Replication copies data; sharding splits data — they solve different problems and are usually combined.
  • Single-leader replication is the most common pattern; multi-leader and leaderless solve specific niches.
  • Asynchronous replication is fast but allows replica lag; synchronous protects consistency at latency cost.
  • Sharding requires choosing a partition key — get this wrong and the system rebalances forever.
  • Consistent hashing minimises data movement when shards are added or removed.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →