Problem & Constraints
Design a cross-continental collaborative document platform (think Notion / Linear / GitHub). Read/write ratio 100:1, writes ~5k QPS, reads ~500k QPS, users in EU, US, and APAC. European users need read P99 < 50 ms. Plus:
- RPO (max data loss) ≤ 1 second: if an entire region goes down, lose at most 1s of writes.
- RTO (recovery time) ≤ 60 seconds: leader dies, automatic failover must restore writes within a minute.
- Read-your-writes: user edits a doc, hits refresh, must see their own change.
- Monotonic reads: two consecutive refreshes can't go back in time (a comment can't disappear and reappear).
These constraints sound mild but combined they kill the "just add a read replica" approach. Replication is one of those rare domains where any non-complex design simply does not work: a single machine can't scale, primary-follower has lag, cross-region has the speed of light, and failover has split-brain risk. Replication is fundamentally insurance against failure; read scaling is a side effect.
High-Level Architecture
graph TD
subgraph US["us-east (Primary Region)"]
APP_US["App us-east"]
L["Leader (Primary)
all writes, strong read in-region"]
F1[("Sync Replica
same AZ, semi-sync")]
F2[("Async Replica
cross-AZ")]
end
subgraph EU["eu-west"]
APP_EU["App eu-west"]
R_EU[("Read Replica
async, lag 100-500ms")]
end
subgraph AP["ap-south"]
APP_AP["App ap-south"]
R_AP[("Read Replica
async, lag 200-800ms")]
end
CONSUL["Orchestrator + Consul
health check / failover"]
APP_US -- write --> L
APP_US -- read --> L
L == sync ==> F1
L -. async .-> F2
L -. async (WAN) .-> R_EU
L -. async (WAN) .-> R_AP
APP_EU -- read --> R_EU
APP_AP -- read --> R_AP
APP_EU -- write --> L
APP_AP -- write --> L
CONSUL -. monitor .-> L
CONSUL -. monitor .-> F1
classDef leader fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
classDef sync fill:#1a2530,stroke:#64c8ff,color:#e8eef5
classDef async fill:#0e2030,stroke:#5eead4,color:#e8eef5
classDef ctl fill:#1a1a30,stroke:#ffb450,color:#e8eef5
class L leader
class F1 sync
class F2,R_EU,R_AP async
class CONSUL ctl
Single leader for writes + same-AZ semi-sync replica (RPO) + cross-AZ async (read capacity) + cross-region async (local reads). Orchestrator handles health checks and automatic failover.
Key Technical Points
1. Topology: Leader-Follower vs Multi-leader vs Leaderless
Principle: replication topology decides where writes go and who arbitrates conflicts. Three basic modes:
- Leader-Follower (single primary): all writes go through the leader; followers replicate the WAL async or sync. No write conflicts; failover is the hard part.
- Multi-leader: each region accepts writes locally and replicates to peers. Needs conflict resolution (same key changed in two regions).
- Leaderless: client writes to multiple replicas directly; quorum (W+R>N) guarantees reads see latest. No leader role, no failover, but reads are amplified.
| Leader-Follower | Multi-leader | Leaderless (Dynamo) |
| Write path | must hit leader | local leader | client writes N replicas |
| Conflicts | none | yes — CRDT / LWW / app-level | yes, in sloppy quorum |
| Cross-region write latency | bad (150ms RTT) | good (local) | good (local) |
| Read consistency | strong (leader) or eventual (follower) | eventual | tunable (R+W>N → strong) |
| Failover | hard (split-brain risk) | easy (other leaders take over) | non-existent |
| Examples | Postgres, MySQL, MongoDB, Redis | BDR Postgres, MySQL Group Repl, CouchDB | Cassandra, DynamoDB, Riak, ScyllaDB |
How to choose:
- OLTP, single region or weak cross-region requirement → Leader-Follower. The correct answer 99% of the time.
- Truly need low-latency cross-region writes (social likes, geo-distributed collaboration) → Multi-leader or Leaderless. Accept conflict complexity.
- Extreme availability, eventual consistency OK (shopping cart, like counter) → Leaderless + CRDT.
- Anti-pattern: using multi-leader like leader-follower — "they're all primary, just write local." Result: two regions concurrently mutating the same row, only one survives, and it's unpredictable.
# The classic LWW (Last-Writer-Wins) trap in multi-leader
# Region A: user.email = "alice@old.com" at t=100
# Region B: user.email = "alice@new.com" at t=101 (clock skew +50ms, actually 51)
# After replication both sides take max(timestamp) → "alice@new.com" ✓ looks fine
# But:
# Region A: balance -= 100 (debit) at t=200
# Region B: balance -= 200 (debit) at t=200 (concurrent)
# LWW keeps ONLY one debit! The other is silently lost.
# → Numeric / set types MUST use CRDT (g-counter / OR-set) or explicit merge,
# not LWW.
Real-world cases:
- GitHub: MySQL single primary + Orchestrator failover; the 2018 24h outage was cross-region failover with data divergence (see point 4).
- Cassandra / DynamoDB: leaderless with tunable consistency; Discord stores messages in Cassandra with QUORUM reads/writes balancing latency vs consistency.
- Riak / Amazon shopping cart: the classic multi-master + CRDT — during partitions both sides add items, merge takes union (prefer over-counting to losing).
- Apple iCloud Notes / Figma: client-server is fundamentally multi-leader; OT/CRDT resolves concurrent edits.
2. Sync vs Async vs Semi-sync: the consistency-latency-availability triangle
Principle: when does the leader ACK the client after a write? Depends on whether replication is sync or async.
- Sync: wait for all replicas to write before ACK. Zero data loss, but one slow replica = whole cluster slow; one dead replica = cluster can't write.
- Async: leader ACKs immediately, replicas catch up later. Lowest latency, but writes not yet replicated are lost if leader dies.
- Semi-sync: wait for at least one replica ACK. The compromise — default for most production systems.
sequenceDiagram
participant C as Client
participant L as Leader
participant S as Sync Replica
participant A as Async Replica
Note over C,A: Semi-sync (Postgres synchronous_commit=on, 1 replica)
C->>L: write x=1
L->>L: WAL fsync
L->>S: stream WAL
L->>A: stream WAL
S-->>L: ACK (durable)
L-->>C: OK (now 2 copies)
A-->>L: ACK (later, lagged)
Note over C,A: If S stalls past timeout, degrades to async (availability > consistency)
Boundaries of the three semantics:
- Pure Sync (all N replicas ACK): essentially Paxos/Raft majority. Production rarely uses "all ACK" — one slow replica is a tail-latency killer.
- Semi-sync (Postgres / MySQL): "at least K replicas"; K=1 is most common, K=2 for cross-AZ multi-active. Key trap: when ALL sync replicas are unresponsive, Postgres blocks writes indefinitely — it does NOT degrade to async. MySQL defaults to a 10s timeout and silently downgrades (
rpl_semi_sync_master_timeout), risking silent data loss.
- Async: low latency; cross-region read replicas almost always use async (sync across an ocean = 100+ ms latency tax per write).
- Quorum (Raft auto-majority): W=(N/2)+1; with N=3, W=2 — picks the 2 fastest replicas. Fast AND doesn't stall. etcd, CockroachDB, TiKV, Spanner all use this.
# Postgres semi-sync config (example)
# postgresql.conf
synchronous_commit = on
synchronous_standby_names = 'ANY 1 (replica1, replica2, replica3)'
# Meaning: ACK once any 1 replica has fsync'd
# 'FIRST 2 (...)' = requires 2 replicas in declared order
# ⚠️ synchronous_commit=remote_apply does one more thing than 'on':
# wait until replica has applied the change to memory (visible to read).
# Used for "write then immediately read from replica", but costs 2x RTT.
# ⚠️ Critical trap:
# If ALL synchronous_standby are down, the leader BLOCKS writes forever!
# Must be paired with monitoring + manual fallback / Patroni auto-degrade.
Real-world cases:
- Stripe: Postgres semi-sync, at least 1 cross-AZ replica ACK; financial data RPO=0 is the floor.
- GitHub MySQL: semi-sync + Orchestrator; intra-region sync, cross-region async.
- YouTube Vitess: MySQL semi-sync + replica-lag monitoring; replicas exceeding the lag threshold are auto-removed from read traffic.
- Spanner / CockroachDB: Paxos/Raft majority ACK with no traditional primary/follower; TrueTime / HLC for global ordering, write latency in-region can be ~10 ms.
- Aurora: 6 storage replicas (3 AZ × 2), writes need 4/6 ACK, reads need 3/6; quorum automatically routes around slow replicas.
3. Replication Lag: consistency is a gradient, not a binary
Principle: async replication has inherent lag (ms to minutes). Three classic anomalies:
- Read-your-writes violation: user uploads new avatar, refreshes, reads a lagged follower, sees old avatar, thinks the system is broken.
- Monotonic read violation: first read hits follower A (caught up), second hits follower B (lagged), data appears to "go back in time."
- Consistent prefix violation: a reply appears on the follower before the question it replies to (causal inversion).
Four mitigations (increasing strength):
- "Reads after my own writes hit the leader": user's own content reads the leader for N seconds; everything else reads followers. Simplest, most common. Twitter, Instagram do this.
- Session sticky + leader pinning: user session sticks to one follower, with server-side "last write time" comparison.
- Read with replica-lag bound: client tracks last write LSN/version; replica with LSN < client either waits or redirects to leader. Aurora, Spanner, TiDB support this.
- Causal consistency: every write records causal deps; reads guarantee all deps visible. Clean in theory, complex in practice. MongoDB 5.0+ causal-consistency sessions are the industrial form.
# Strategy 3: client carries LSN
# After write, server returns current LSN
write_response = leader.write(doc_id, content)
client.last_seen_lsn = write_response.lsn # store in session/cookie
# On read, attach LSN; replica waits (with timeout) or falls back to leader
def read(doc_id, min_lsn):
replica = pick_least_loaded_replica()
current = replica.current_lsn()
if current >= min_lsn:
return replica.read(doc_id)
elif current + ACCEPTABLE_GAP >= min_lsn:
replica.wait_for_lsn(min_lsn, timeout=50) # ms
return replica.read(doc_id)
else:
return leader.read(doc_id) # fallback
# Pros: precise, doesn't waste leader capacity
# Cons: client must persist LSN; cross-device inconsistency
# (write on phone, read on laptop → no shared LSN)
# → solved by user-level "version vector", but more complex
Real-world cases:
- Twitter: a user's own newly posted tweet reads the leader for 5 seconds, then drops to followers; other users' tweets always go to followers. Crude but effective.
- LinkedIn Espresso / Databus: replicas are CDC-driven, each exposes a high-watermark; client SDK auto-handles "write LSN, read at-least LSN."
- MongoDB causal consistency: session carries
operationTime; replica reads with afterClusterTime wait for alignment.
- Facebook TAO: social-graph caching layer; writes go to primary region, followers async-invalidate; self-writes route specially.
- Vitess @primary / @replica: app specifies routing hints (
SELECT ... /*vt+ ROUTE=primary */), handing the decision to business logic.
4. Failover: what actually happens when the primary dies
Principle: failover is 4 steps — (a) detect primary failure; (b) elect new primary; (c) reroute traffic; (d) reconcile when old primary returns. Each step can produce an outage.
sequenceDiagram
participant O as Orchestrator
participant L as "Old Leader 1a"
participant R1 as "Replica R1 sync"
participant R2 as "Replica R2 async"
participant App as Apps
Note over L,R2: Normal — L writes, R1 sync-ACKs, R2 lags 800ms
L--xO: heartbeat timeout 10s
O->>R1: check R1 LSN
O->>R2: check R2 LSN
Note over O: R1 lsn=1000, R2 lsn=880
pick R1 most recent
O->>R1: PROMOTE
O->>App: reroute to R1 as new leader
R1->>R2: re-establish replication
Note over L: L network recovers, still thinks it's leader
but fencing token invalidated
O->>L: DEMOTE / kill
Note over L: writes lsn 1000~1010 on old L if un-ACKed are lost
if already ACKed to client — data loss
Key design decisions:
- Auto vs Manual failover: auto = fast but prone to misjudgment (GC pause, network blip); manual = safe but high RTO. Most production setups use "auto + human confirm."
- Election criteria: highest LSN wins (minimizes data loss); on tie, prefer the closest AZ with the most stable network.
- Fencing / STONITH: must guarantee the old primary really cannot serve writes. Tools: IPMI power-off, network ACL cutoff, lease tokens — the new primary gets a higher epoch, the old primary's writes are rejected at the routing layer. Failover without fencing = a split-brain time bomb.
- Majority required: Raft-class systems (etcd / CockroachDB / Patroni defaults) require majority alive to elect — this prevents split-brain but needs ≥3 nodes; 2-node topologies inevitably split-brain on failover.
# Minimum-viable fencing token (à la ZooKeeper / etcd lease)
# Each election bumps epoch; writes carry it
class Leader:
epoch: int # issued by coordinator (Consul / etcd)
def write(self, k, v):
coordinator.cas("epoch", expected=self.epoch) # verify still leader
storage.write(k, v, epoch=self.epoch)
# Routing layer / replica rejects writes with stale epoch
def accept_write(req):
if req.epoch < current_known_epoch:
return REJECT # old leader's zombie writes blocked here
return APPLY
# Even if the partitioned old leader wakes up, its epoch is stale,
# any write to replicas/storage is rejected. That's fencing.
Real-world cases / lessons:
- GitHub 2018-10-21 24h outage: a 43-second cross-region network partition triggered Orchestrator failover; the east coast became the new primary; when the west recovered, the two sides had diverged. GitHub chose to keep the east (sacrificing ~30 min of data) and spent 24 hours reconciling metadata. Lesson: cross-region auto-failover is extremely risky — switched to human confirmation afterward.
- Patroni + etcd: the standard Postgres HA: majority election + leader lock + fencing. Battle-tested at Zalando.
- Redis Sentinel vs Redis Cluster: Sentinel has famous "split-brain write loss" cases (Aphyr/Jepsen 2013); Cluster mode is better but still has edge cases.
- AWS RDS / Aurora Multi-AZ: standby auto-failover at 60-120s RTO; Aurora reuses shared storage so failover is "attach new compute to same storage" — 10-30 seconds.
- etcd / CockroachDB / Spanner: Raft leader election is sub-second; no "traditional failover" — majority alive = writable.
Extensions & Optimization
- Read traffic with lag-threshold cutoff: replica lag > 5 seconds = auto-removed from load balancer (YouTube Vitess, AWS RDS support this). Reading data from minutes ago is worse than returning an error.
- Never sync-replicate across regions: cross-ocean RTT is 100-200 ms; every write pays that tax. Cross-region must be async + local strong reads + remote eventual reads.
- CDC (Change Data Capture) as second-tier replication: Debezium / Maxwell stream binlog/WAL to Kafka; downstream heterogeneous systems (ES, cache, warehouse) subscribe. Replication is no longer just inside the DB — it's the origin of the entire data flow.
- Logical vs Physical replication: physical (Postgres streaming, MySQL ROW binlog) is byte-level identical and fast but version/platform locked; logical (Postgres logical decoding, Debezium) crosses major versions and heterogeneous systems but is slower and has constraints (DDL, sequences, large txns). Use logical for upgrades / migrations / cross-cloud; physical for HA.
- Multi-region writes: CDN / edge cache first, then consider multi-leader. 99% of latency issues are solved by local read replicas + edge caching — no need to enter multi-leader's conflict hell.
Common Pitfalls & Interview Questions
1. Treating a read replica as a backup: DROP TABLE on primary, milliseconds later the replica also DROPs. Backups must be PITR snapshots, not replicas.
2. More replicas ≠ more write throughput: every replica adds IO and bandwidth load to the leader. Postgres with 10+ async replicas hits a ceiling. Need write scale → shard, not replicate.
3. Read-only replicas that secretly accept writes: an admin runs a "quick fix UPDATE" on a replica; next failover that replica is promoted → data divergence. Replicas must enforce read-only (default_transaction_read_only=on).
4. Failover without fencing: old primary GC-pauses for 30s; new primary elected; old primary wakes and continues accepting writes — split-brain, both sides "win," data is unreconcilable.
5. Semi-sync silently blocks when all sync replicas are down: Postgres' default is to block writes; the business thinks it's a slow query but it's actually a complete write stop. Either monitor sync-replica availability and manually degrade, or use Patroni auto-config.
Likely interview follow-ups:
- You design a Notion cross-region read replica. A European user edits a doc, refreshes, and doesn't see their change. How do you diagnose and fix?
- Explain read-your-writes, monotonic read, and consistent prefix — what user-visible anomaly does each prevent?
- Primary died; you have 3 followers: A lsn=1000, B lsn=995, C lsn=1000 but in another region. Pick one. Why?
- In a multi-leader setup, how do you handle "same row mutated concurrently in two regions"? When is LWW / CRDT / app-level arbitration appropriate?
- What does Postgres semi-sync's
synchronous_commit=remote_apply guarantee beyond on? At what cost?
Deep Resources
- 《Designing Data-Intensive Applications》Ch 5 — Replication (Kleppmann): the most authoritative single chapter on this topic. Full spectrum of leader-based / multi-leader / leaderless.
- GitHub 2018-10-21 Incident Report:
github.blog/2018-10-30-oct21-post-incident-analysis. Textbook RCA for cross-region auto-failover disaster.
- Amazon Dynamo (2007) + Aurora (2017) papers: foundational reads on leaderless quorum and storage-compute separated quorum.
- Jepsen test series (Kyle Kingsbury / aphyr.com): real behavior of MongoDB / Redis / etcd / Postgres under network partitions — often debunks official consistency claims.
- Marc Brooker — "It's Time to Replace TCP" series: intuition for the network and timeout substrate under replication.
- Patroni docs + Zalando engineering blog: industrial reference for Postgres production HA.
- Vitess docs — Reparenting: the most complete open-source MySQL failover engineering.
Deep Thinking
1. Why is Raft majority-write latency in Spanner / CockroachDB more stable than traditional semi-sync primary-follower? What's the connection to tail latency?
On the surface, Raft writes need majority ACK (with N=3, 2 ACKs); semi-sync also waits for 1 replica — should be similar. The difference is who waits for whom:
- Semi-sync: typically waits for a specific sync replica. If that replica stalls (GC, disk stall, packet loss), the leader stalls. Even if the other two replicas are healthy, it doesn't help. Tail latency = max(leader, sync_replica).
- Raft majority: waits for the fastest N/2+1. With N=3, 2/3 — automatically skipping the slowest replica. Tail latency ≈ median replica latency, not max.
This is the classic distributed-systems tail tolerance pattern: by saying "majority + don't pin to specific replicas," single-node tail latency is removed from the critical path. Google's Spanner paper discusses this; Aurora's 6 replicas requiring 4/6 ACK works the same way (auto-skip slowest 2).
The cost: (a) at least 3 nodes (2-node Raft degenerates), higher cost; (b) implementation complexity (leader election, log compaction, config changes) far above primary-follower. So small OLTP often stays on semi-sync; massive-scale / global goes Raft. From semi-sync to quorum is a clear upgrade on the availability-complexity curve, not a drop-in replacement.
2. Was GitHub 2018's outage really avoidable by "no cross-region auto-failover," or is the lesson deeper? Why didn't Orchestrator recognize "43s network partition" vs "primary actually dead"?
The surface lesson is "don't auto-failover across regions," but the deeper truth is fault detection in distributed systems is fundamentally imperfect. The FLP impossibility result tells us no asynchronous system can be both correct and complete in detecting failure. A 43s partition and "primary dead" look identical to followers: heartbeats stop.
- The choice is a trade-off, not a "correct answer": wait longer (say 60s) → worse RTO; don't wait → high false positives. GitHub's threshold was a few seconds.
- The real root cause is "consistency boundary ≠ fault detection boundary". With cross-region, the detector is in one region, the primary in another. The detector can't see the primary, but the primary is still alive and accepting writes (from isolated clients). This "live partition" is the most dangerous case.
- The fix (what GitHub did): (a) cross-region failover requires human confirmation; (b) each region elects locally via Raft, cross-region is read-only replica; (c) add quorum-style fencing — new primary must reach majority of regions to accept writes.
- System design philosophy: "recoverability" beats "high availability." Five more minutes of downtime is far better than 24 hours of data repair. When automating, design for the worst case: "how fast can I roll back if it judges wrong" matters more than "how accurate is it."
Analogy: self-driving in a city. Tesla picks "bold auto + occasional rear-end"; Waymo picks "cautious auto + human handover." GitHub's incident was the former philosophy hitting a wall; they pivoted to the latter. Distributed-systems maturity is not "more automation," it's "knowing where to stop and wait for a human".
3. Capacity estimation: 1M DAU SaaS, primary Postgres at 10k write QPS / 50k read QPS. CFO wants to cut costs: "Do we really need 5 read replicas? Can we trim to 2?" Give a quantitative analysis.
Before cutting, decompose what each replica does. Lump-sum answers are always wrong.
5 distinct roles replicas serve, each with different sizing needs:
- Read capacity scaling: 50k read QPS / 15k per replica = at least 4 replicas. Cutting to 2 = read-overload.
- HA / failover standby: at least 1 semi-sync same-AZ; can't share with read traffic (failover would halve read capacity).
- Analytics isolation: BI / data export on a dedicated replica to avoid hurting online reads. Usually 1.
- Geo-local reads: 1 per dense user region. EU + US + APAC = 3.
- Backup / DR: 1 cross-region async for disaster recovery.
Sum: 4 (read) + 1 (HA sync) + 1 (analytics) + several cross-region ≈ 7-8. The current 5 is already compressed.
Real consequences of cutting to 2:
- Read capacity drops from 60k to 30k QPS (including leader); peak traffic crashes;
- Analytics queries on OLTP nodes; slow queries drag down the whole DB;
- Loss of geo-local reads; EU users' P99 goes from 50ms to 200ms;
- During failover the only remaining replica gets all the traffic — potential cascading failure.
Real cost-cutting alternative:
- Cache (Redis) intercepts 70% of reads → replicas can drop from 4 to 2;
- Analytics goes through CDC → ClickHouse / BigQuery; replicas no longer carry BI;
- Cross-region replicas: keep only EU (largest market); APAC degrades to us-east + edge CDN;
- HA replicas use cheaper SKUs (no read traffic, just WAL catch-up).
Answer to CFO: don't cut blindly. But $80k/month (5 × r6i.4xlarge) can become $40k/month — provided you first invest in a Redis cluster, CDC pipeline, and monitoring. That's "moving cost from hardware to engineering" — one-time engineering investment for long-term savings.
"Replica count" is never a standalone config; it's the result of the entire read-path architecture.
4. CRDT sounds like a silver bullet for multi-leader conflicts (auto-merge, no coordination). What are its hard limits? Why do Google Docs / Figma still need a server-side authority even with OT/CRDT?
CRDT (Conflict-free Replicated Data Type) really does mathematize "merge," but several hard constraints prevent it from replacing server authority:
- (a) Only effective for "structured data". Counter, Set, Map types with algebraic properties (associative, commutative, idempotent) work; complex business rules (inventory must be ≥ 0), or side effects (sending email), can't be CRDT-ified.
- (b) Metadata bloat. OR-Set must record actor+timestamp per element and keep tombstones for deletion to prevent resurrection; for large docs, CRDT metadata is 10-100x the content itself, and long-term use requires GC — GC needs coordination, back to the coordination problem.
- (c) Can't express "what happened most recently". CRDT merge is commutative — meaning the real order of events is lost. "First a message, then a reply" can be reversed in CRDT view (commutativity ≠ causal preservation).
- (d) Permissions, audit, billing need ground truth. Two clients adding a collaborator to the same doc — the CRDT union is "correct," but "who added first" and "when" determine billing — server arbitration needed.
- (e) Real-time collaboration needs "what others are doing right now" awareness. Cursor position, selection highlight — not CRDT state (these are ephemeral) — needs a hub to relay.
Figma's real architecture: CRDT handles graphical objects (layer z-order, node attributes CRDT-ified), but server handles session management, permissions, version snapshots, cross-user coordination. The server is not the "state authority" but the "coordination and persistence hub." Google Docs is similar — OT runs on the client, server accepts, sequences, broadcasts + saves.
Conclusion: CRDT is a tool for "multi-write conflict-free merge," not an architecture for "no server at all." It compresses the server's role from "arbitrate every write" down to "coordinate metadata events + persist." Understanding this avoids the "rewrite everything in CRDT" trap. Truly leaderless systems barely exist; the leader's role is just compressed to the minimum.
5. Cross-chapter: Sharding (Day 4) + Replication (Day 5) deployed together — how do you reason about failure domains? E.g., 32 shards × 3 replicas = 96 nodes; one AZ goes down, what's the availability?
Sharding and replication are orthogonal but their failure domains must be reasoned about together. Simple cases:
- Cross-AZ deployment: each shard's 3 replicas distributed across 3 AZs. AZ down → each shard loses 1 replica, 2/3 remain → still writable (majority alive). Availability ≈ 100%.
- Same-AZ deployment: all 3 replicas in one AZ → AZ down = entire shard down. 32 shards all down = total outage.
- Mixed (most common): leader + 1 sync replica in same region different AZs; 1 async replica cross-region. AZ down → lose leader or sync replica, failover to remaining.
Key insight: sharding amplifies AZ failure blast radius. Single DB: AZ down affects 1 DB. 32 shards: AZ down triggers 32 concurrent failovers — orchestrator capacity, network capacity, client retries all stress-tested at once. GitHub, Slack have all seen "AZ down → multi-shard concurrent failover → control plane overloaded → cascading failure."
Mitigations:
- Rate-limit failover (e.g., promote at most 5 shards/sec);
- Client retry with jitter + exponential backoff to avoid thundering herd;
- Regular AZ-level chaos engineering to expose hidden serialization points;
- Capacity planning for "1 AZ down + 1 failure domain unavailable simultaneously" (N-2, not N-1).
Order of magnitude: 96 nodes evenly across 3 AZs, single AZ down — theoretically 100% available, but in practice typically drops to 95-98% (some shards 5-30s unavailable during failover, client retry storms, alert noise). Multi-region (active-active across regions) can drive single-region failure to true 100% availability, at the cost of conflict resolution + cross-region write latency.
Interview punch line: "Sharding is horizontal scaling, replication is failure insurance — they solve different problems and amplify each other's complexity. Any answer that mentions only one is incomplete."