Day 5 Hard Database Replication Consistency Failover

Replication — You Think It's for Read Scaling, It's Actually for That One FailureLeader-Follower vs Multi-leader vs Leaderless, Sync vs Async, Replication Lag, Failover

Problem & Constraints

Design a cross-continental collaborative document platform (think Notion / Linear / GitHub). Read/write ratio 100:1, writes ~5k QPS, reads ~500k QPS, users in EU, US, and APAC. European users need read P99 < 50 ms. Plus:

These constraints sound mild but combined they kill the "just add a read replica" approach. Replication is one of those rare domains where any non-complex design simply does not work: a single machine can't scale, primary-follower has lag, cross-region has the speed of light, and failover has split-brain risk. Replication is fundamentally insurance against failure; read scaling is a side effect.

High-Level Architecture

graph TD
    subgraph US["us-east (Primary Region)"]
        APP_US["App us-east"]
        L["Leader (Primary)
all writes, strong read in-region"] F1[("Sync Replica
same AZ, semi-sync")] F2[("Async Replica
cross-AZ")] end subgraph EU["eu-west"] APP_EU["App eu-west"] R_EU[("Read Replica
async, lag 100-500ms")] end subgraph AP["ap-south"] APP_AP["App ap-south"] R_AP[("Read Replica
async, lag 200-800ms")] end CONSUL["Orchestrator + Consul
health check / failover"] APP_US -- write --> L APP_US -- read --> L L == sync ==> F1 L -. async .-> F2 L -. async (WAN) .-> R_EU L -. async (WAN) .-> R_AP APP_EU -- read --> R_EU APP_AP -- read --> R_AP APP_EU -- write --> L APP_AP -- write --> L CONSUL -. monitor .-> L CONSUL -. monitor .-> F1 classDef leader fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 classDef sync fill:#1a2530,stroke:#64c8ff,color:#e8eef5 classDef async fill:#0e2030,stroke:#5eead4,color:#e8eef5 classDef ctl fill:#1a1a30,stroke:#ffb450,color:#e8eef5 class L leader class F1 sync class F2,R_EU,R_AP async class CONSUL ctl

Single leader for writes + same-AZ semi-sync replica (RPO) + cross-AZ async (read capacity) + cross-region async (local reads). Orchestrator handles health checks and automatic failover.

Key Technical Points

1. Topology: Leader-Follower vs Multi-leader vs Leaderless

Principle: replication topology decides where writes go and who arbitrates conflicts. Three basic modes:

Leader-FollowerMulti-leaderLeaderless (Dynamo)
Write pathmust hit leaderlocal leaderclient writes N replicas
Conflictsnoneyes — CRDT / LWW / app-levelyes, in sloppy quorum
Cross-region write latencybad (150ms RTT)good (local)good (local)
Read consistencystrong (leader) or eventual (follower)eventualtunable (R+W>N → strong)
Failoverhard (split-brain risk)easy (other leaders take over)non-existent
ExamplesPostgres, MySQL, MongoDB, RedisBDR Postgres, MySQL Group Repl, CouchDBCassandra, DynamoDB, Riak, ScyllaDB
How to choose:
# The classic LWW (Last-Writer-Wins) trap in multi-leader
# Region A: user.email = "alice@old.com"  at t=100
# Region B: user.email = "alice@new.com"  at t=101 (clock skew +50ms, actually 51)
# After replication both sides take max(timestamp) → "alice@new.com" ✓ looks fine

# But:
# Region A: balance -= 100  (debit)   at t=200
# Region B: balance -= 200  (debit)   at t=200 (concurrent)
# LWW keeps ONLY one debit! The other is silently lost.
# → Numeric / set types MUST use CRDT (g-counter / OR-set) or explicit merge,
#   not LWW.
Real-world cases:

2. Sync vs Async vs Semi-sync: the consistency-latency-availability triangle

Principle: when does the leader ACK the client after a write? Depends on whether replication is sync or async.

sequenceDiagram
    participant C as Client
    participant L as Leader
    participant S as Sync Replica
    participant A as Async Replica

    Note over C,A: Semi-sync (Postgres synchronous_commit=on, 1 replica)
    C->>L: write x=1
    L->>L: WAL fsync
    L->>S: stream WAL
    L->>A: stream WAL
    S-->>L: ACK (durable)
    L-->>C: OK (now 2 copies)
    A-->>L: ACK (later, lagged)

    Note over C,A: If S stalls past timeout, degrades to async (availability > consistency)
Boundaries of the three semantics:
# Postgres semi-sync config (example)
# postgresql.conf
synchronous_commit = on
synchronous_standby_names = 'ANY 1 (replica1, replica2, replica3)'
# Meaning: ACK once any 1 replica has fsync'd
# 'FIRST 2 (...)' = requires 2 replicas in declared order

# ⚠️ synchronous_commit=remote_apply does one more thing than 'on':
# wait until replica has applied the change to memory (visible to read).
# Used for "write then immediately read from replica", but costs 2x RTT.

# ⚠️ Critical trap:
# If ALL synchronous_standby are down, the leader BLOCKS writes forever!
# Must be paired with monitoring + manual fallback / Patroni auto-degrade.
Real-world cases:

3. Replication Lag: consistency is a gradient, not a binary

Principle: async replication has inherent lag (ms to minutes). Three classic anomalies:

  1. Read-your-writes violation: user uploads new avatar, refreshes, reads a lagged follower, sees old avatar, thinks the system is broken.
  2. Monotonic read violation: first read hits follower A (caught up), second hits follower B (lagged), data appears to "go back in time."
  3. Consistent prefix violation: a reply appears on the follower before the question it replies to (causal inversion).
Four mitigations (increasing strength):
  1. "Reads after my own writes hit the leader": user's own content reads the leader for N seconds; everything else reads followers. Simplest, most common. Twitter, Instagram do this.
  2. Session sticky + leader pinning: user session sticks to one follower, with server-side "last write time" comparison.
  3. Read with replica-lag bound: client tracks last write LSN/version; replica with LSN < client either waits or redirects to leader. Aurora, Spanner, TiDB support this.
  4. Causal consistency: every write records causal deps; reads guarantee all deps visible. Clean in theory, complex in practice. MongoDB 5.0+ causal-consistency sessions are the industrial form.
# Strategy 3: client carries LSN
# After write, server returns current LSN
write_response = leader.write(doc_id, content)
client.last_seen_lsn = write_response.lsn  # store in session/cookie

# On read, attach LSN; replica waits (with timeout) or falls back to leader
def read(doc_id, min_lsn):
    replica = pick_least_loaded_replica()
    current = replica.current_lsn()
    if current >= min_lsn:
        return replica.read(doc_id)
    elif current + ACCEPTABLE_GAP >= min_lsn:
        replica.wait_for_lsn(min_lsn, timeout=50)  # ms
        return replica.read(doc_id)
    else:
        return leader.read(doc_id)  # fallback

# Pros: precise, doesn't waste leader capacity
# Cons: client must persist LSN; cross-device inconsistency
#       (write on phone, read on laptop → no shared LSN)
#       → solved by user-level "version vector", but more complex
Real-world cases:

4. Failover: what actually happens when the primary dies

Principle: failover is 4 steps — (a) detect primary failure; (b) elect new primary; (c) reroute traffic; (d) reconcile when old primary returns. Each step can produce an outage.

sequenceDiagram
    participant O as Orchestrator
    participant L as "Old Leader 1a"
    participant R1 as "Replica R1 sync"
    participant R2 as "Replica R2 async"
    participant App as Apps

    Note over L,R2: Normal — L writes, R1 sync-ACKs, R2 lags 800ms
    L--xO: heartbeat timeout 10s
    O->>R1: check R1 LSN
    O->>R2: check R2 LSN
    Note over O: R1 lsn=1000, R2 lsn=880
pick R1 most recent O->>R1: PROMOTE O->>App: reroute to R1 as new leader R1->>R2: re-establish replication Note over L: L network recovers, still thinks it's leader
but fencing token invalidated O->>L: DEMOTE / kill Note over L: writes lsn 1000~1010 on old L if un-ACKed are lost
if already ACKed to client — data loss
Key design decisions:
# Minimum-viable fencing token (à la ZooKeeper / etcd lease)
# Each election bumps epoch; writes carry it
class Leader:
    epoch: int          # issued by coordinator (Consul / etcd)
    def write(self, k, v):
        coordinator.cas("epoch", expected=self.epoch)  # verify still leader
        storage.write(k, v, epoch=self.epoch)

# Routing layer / replica rejects writes with stale epoch
def accept_write(req):
    if req.epoch < current_known_epoch:
        return REJECT  # old leader's zombie writes blocked here
    return APPLY

# Even if the partitioned old leader wakes up, its epoch is stale,
# any write to replicas/storage is rejected. That's fencing.
Real-world cases / lessons:

Extensions & Optimization

Common Pitfalls & Interview Questions

1. Treating a read replica as a backup: DROP TABLE on primary, milliseconds later the replica also DROPs. Backups must be PITR snapshots, not replicas.
2. More replicas ≠ more write throughput: every replica adds IO and bandwidth load to the leader. Postgres with 10+ async replicas hits a ceiling. Need write scale → shard, not replicate.
3. Read-only replicas that secretly accept writes: an admin runs a "quick fix UPDATE" on a replica; next failover that replica is promoted → data divergence. Replicas must enforce read-only (default_transaction_read_only=on).
4. Failover without fencing: old primary GC-pauses for 30s; new primary elected; old primary wakes and continues accepting writes — split-brain, both sides "win," data is unreconcilable.
5. Semi-sync silently blocks when all sync replicas are down: Postgres' default is to block writes; the business thinks it's a slow query but it's actually a complete write stop. Either monitor sync-replica availability and manually degrade, or use Patroni auto-config.

Likely interview follow-ups:

  1. You design a Notion cross-region read replica. A European user edits a doc, refreshes, and doesn't see their change. How do you diagnose and fix?
  2. Explain read-your-writes, monotonic read, and consistent prefix — what user-visible anomaly does each prevent?
  3. Primary died; you have 3 followers: A lsn=1000, B lsn=995, C lsn=1000 but in another region. Pick one. Why?
  4. In a multi-leader setup, how do you handle "same row mutated concurrently in two regions"? When is LWW / CRDT / app-level arbitration appropriate?
  5. What does Postgres semi-sync's synchronous_commit=remote_apply guarantee beyond on? At what cost?

Deep Resources

Deep Thinking

1. Why is Raft majority-write latency in Spanner / CockroachDB more stable than traditional semi-sync primary-follower? What's the connection to tail latency?

On the surface, Raft writes need majority ACK (with N=3, 2 ACKs); semi-sync also waits for 1 replica — should be similar. The difference is who waits for whom:

  • Semi-sync: typically waits for a specific sync replica. If that replica stalls (GC, disk stall, packet loss), the leader stalls. Even if the other two replicas are healthy, it doesn't help. Tail latency = max(leader, sync_replica).
  • Raft majority: waits for the fastest N/2+1. With N=3, 2/3 — automatically skipping the slowest replica. Tail latency ≈ median replica latency, not max.

This is the classic distributed-systems tail tolerance pattern: by saying "majority + don't pin to specific replicas," single-node tail latency is removed from the critical path. Google's Spanner paper discusses this; Aurora's 6 replicas requiring 4/6 ACK works the same way (auto-skip slowest 2).

The cost: (a) at least 3 nodes (2-node Raft degenerates), higher cost; (b) implementation complexity (leader election, log compaction, config changes) far above primary-follower. So small OLTP often stays on semi-sync; massive-scale / global goes Raft. From semi-sync to quorum is a clear upgrade on the availability-complexity curve, not a drop-in replacement.

2. Was GitHub 2018's outage really avoidable by "no cross-region auto-failover," or is the lesson deeper? Why didn't Orchestrator recognize "43s network partition" vs "primary actually dead"?

The surface lesson is "don't auto-failover across regions," but the deeper truth is fault detection in distributed systems is fundamentally imperfect. The FLP impossibility result tells us no asynchronous system can be both correct and complete in detecting failure. A 43s partition and "primary dead" look identical to followers: heartbeats stop.

  • The choice is a trade-off, not a "correct answer": wait longer (say 60s) → worse RTO; don't wait → high false positives. GitHub's threshold was a few seconds.
  • The real root cause is "consistency boundary ≠ fault detection boundary". With cross-region, the detector is in one region, the primary in another. The detector can't see the primary, but the primary is still alive and accepting writes (from isolated clients). This "live partition" is the most dangerous case.
  • The fix (what GitHub did): (a) cross-region failover requires human confirmation; (b) each region elects locally via Raft, cross-region is read-only replica; (c) add quorum-style fencing — new primary must reach majority of regions to accept writes.
  • System design philosophy: "recoverability" beats "high availability." Five more minutes of downtime is far better than 24 hours of data repair. When automating, design for the worst case: "how fast can I roll back if it judges wrong" matters more than "how accurate is it."

Analogy: self-driving in a city. Tesla picks "bold auto + occasional rear-end"; Waymo picks "cautious auto + human handover." GitHub's incident was the former philosophy hitting a wall; they pivoted to the latter. Distributed-systems maturity is not "more automation," it's "knowing where to stop and wait for a human".

3. Capacity estimation: 1M DAU SaaS, primary Postgres at 10k write QPS / 50k read QPS. CFO wants to cut costs: "Do we really need 5 read replicas? Can we trim to 2?" Give a quantitative analysis.

Before cutting, decompose what each replica does. Lump-sum answers are always wrong.

5 distinct roles replicas serve, each with different sizing needs:

  • Read capacity scaling: 50k read QPS / 15k per replica = at least 4 replicas. Cutting to 2 = read-overload.
  • HA / failover standby: at least 1 semi-sync same-AZ; can't share with read traffic (failover would halve read capacity).
  • Analytics isolation: BI / data export on a dedicated replica to avoid hurting online reads. Usually 1.
  • Geo-local reads: 1 per dense user region. EU + US + APAC = 3.
  • Backup / DR: 1 cross-region async for disaster recovery.

Sum: 4 (read) + 1 (HA sync) + 1 (analytics) + several cross-region ≈ 7-8. The current 5 is already compressed.

Real consequences of cutting to 2:

  • Read capacity drops from 60k to 30k QPS (including leader); peak traffic crashes;
  • Analytics queries on OLTP nodes; slow queries drag down the whole DB;
  • Loss of geo-local reads; EU users' P99 goes from 50ms to 200ms;
  • During failover the only remaining replica gets all the traffic — potential cascading failure.

Real cost-cutting alternative:

  1. Cache (Redis) intercepts 70% of reads → replicas can drop from 4 to 2;
  2. Analytics goes through CDC → ClickHouse / BigQuery; replicas no longer carry BI;
  3. Cross-region replicas: keep only EU (largest market); APAC degrades to us-east + edge CDN;
  4. HA replicas use cheaper SKUs (no read traffic, just WAL catch-up).

Answer to CFO: don't cut blindly. But $80k/month (5 × r6i.4xlarge) can become $40k/month — provided you first invest in a Redis cluster, CDC pipeline, and monitoring. That's "moving cost from hardware to engineering" — one-time engineering investment for long-term savings.

"Replica count" is never a standalone config; it's the result of the entire read-path architecture.

4. CRDT sounds like a silver bullet for multi-leader conflicts (auto-merge, no coordination). What are its hard limits? Why do Google Docs / Figma still need a server-side authority even with OT/CRDT?

CRDT (Conflict-free Replicated Data Type) really does mathematize "merge," but several hard constraints prevent it from replacing server authority:

  • (a) Only effective for "structured data". Counter, Set, Map types with algebraic properties (associative, commutative, idempotent) work; complex business rules (inventory must be ≥ 0), or side effects (sending email), can't be CRDT-ified.
  • (b) Metadata bloat. OR-Set must record actor+timestamp per element and keep tombstones for deletion to prevent resurrection; for large docs, CRDT metadata is 10-100x the content itself, and long-term use requires GC — GC needs coordination, back to the coordination problem.
  • (c) Can't express "what happened most recently". CRDT merge is commutative — meaning the real order of events is lost. "First a message, then a reply" can be reversed in CRDT view (commutativity ≠ causal preservation).
  • (d) Permissions, audit, billing need ground truth. Two clients adding a collaborator to the same doc — the CRDT union is "correct," but "who added first" and "when" determine billing — server arbitration needed.
  • (e) Real-time collaboration needs "what others are doing right now" awareness. Cursor position, selection highlight — not CRDT state (these are ephemeral) — needs a hub to relay.

Figma's real architecture: CRDT handles graphical objects (layer z-order, node attributes CRDT-ified), but server handles session management, permissions, version snapshots, cross-user coordination. The server is not the "state authority" but the "coordination and persistence hub." Google Docs is similar — OT runs on the client, server accepts, sequences, broadcasts + saves.

Conclusion: CRDT is a tool for "multi-write conflict-free merge," not an architecture for "no server at all." It compresses the server's role from "arbitrate every write" down to "coordinate metadata events + persist." Understanding this avoids the "rewrite everything in CRDT" trap. Truly leaderless systems barely exist; the leader's role is just compressed to the minimum.

5. Cross-chapter: Sharding (Day 4) + Replication (Day 5) deployed together — how do you reason about failure domains? E.g., 32 shards × 3 replicas = 96 nodes; one AZ goes down, what's the availability?

Sharding and replication are orthogonal but their failure domains must be reasoned about together. Simple cases:

  • Cross-AZ deployment: each shard's 3 replicas distributed across 3 AZs. AZ down → each shard loses 1 replica, 2/3 remain → still writable (majority alive). Availability ≈ 100%.
  • Same-AZ deployment: all 3 replicas in one AZ → AZ down = entire shard down. 32 shards all down = total outage.
  • Mixed (most common): leader + 1 sync replica in same region different AZs; 1 async replica cross-region. AZ down → lose leader or sync replica, failover to remaining.

Key insight: sharding amplifies AZ failure blast radius. Single DB: AZ down affects 1 DB. 32 shards: AZ down triggers 32 concurrent failovers — orchestrator capacity, network capacity, client retries all stress-tested at once. GitHub, Slack have all seen "AZ down → multi-shard concurrent failover → control plane overloaded → cascading failure."

Mitigations:

  • Rate-limit failover (e.g., promote at most 5 shards/sec);
  • Client retry with jitter + exponential backoff to avoid thundering herd;
  • Regular AZ-level chaos engineering to expose hidden serialization points;
  • Capacity planning for "1 AZ down + 1 failure domain unavailable simultaneously" (N-2, not N-1).

Order of magnitude: 96 nodes evenly across 3 AZs, single AZ down — theoretically 100% available, but in practice typically drops to 95-98% (some shards 5-30s unavailable during failover, client retry storms, alert noise). Multi-region (active-active across regions) can drive single-region failure to true 100% availability, at the cost of conflict resolution + cross-region write latency.

Interview punch line: "Sharding is horizontal scaling, replication is failure insurance — they solve different problems and amplify each other's complexity. Any answer that mentions only one is incomplete."