Scenario & Constraints
Design a global e-commerce platform that mixes "flash sale inventory + product details + comments" — three regions (US/EU/AP), 500k DAU, peak 20k orders/sec, 200k product-view QPS, 50k comment R/W QPS. Three radically different consistency requirements live in one system:
- Inventory & checkout (strong / linearizable): 100 items must not oversell, must not undersell; any client read must see monotonically decreasing stock. RPO=0.
- "My orders" view (read-your-writes + monotonic): a user who just placed an order must see it on refresh; state must not regress (paid → pending).
- Product comments (causal + eventual): a reply must appear after the comment it replies to (causal order); but European users can lag 1-2 seconds behind Asian ones.
This forces you to mix multiple consistency models in one system — which is what real systems look like. Day 5 covered how replication works; today is: which consistency guarantees are worth their latency & availability cost, which aren't, and how to implement them in practice.
High-Level Architecture
graph TD
subgraph Edge["Edge / CDN"]
CDN["Product Detail Page
eventual, TTL 60s"]
end
subgraph Core["Core (us-east, strong zone)"]
INV["Inventory Service
Spanner / CockroachDB
Linearizable"]
ORD["Order Service
Postgres semi-sync
RYW via session token"]
end
subgraph Comments["Comments Domain (Multi-Region)"]
CMT_US["Comments us-east"]
CMT_EU["Comments eu-west"]
CMT_AP["Comments ap-south"]
HLC[("HLC
Hybrid Logical Clock")]
end
Client["Client
carries session LSN / HLC"]
Client -- "view product" --> CDN
CDN -. miss .-> INV
Client -- "order / decrement stock" --> INV
Client -- "list my orders" --> ORD
Client -- "post / read comments" --> CMT_US
Client -- "post / read comments" --> CMT_EU
Client -- "post / read comments" --> CMT_AP
INV -. CDC .-> CDN
ORD -. CDC .-> CDN
CMT_US <-. async + HLC .-> CMT_EU
CMT_EU <-. async + HLC .-> CMT_AP
CMT_AP <-. async + HLC .-> CMT_US
classDef strong fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
classDef ryw fill:#1a2530,stroke:#64c8ff,color:#e8eef5
classDef causal fill:#0e2030,stroke:#5eead4,color:#e8eef5
class INV strong
class ORD ryw
class CMT_US,CMT_EU,CMT_AP,HLC causal
Three consistency domains: inventory uses global linearizable (Spanner-class quorum), orders use single-region primary + client LSN, comments use multi-region async + HLC for causal order. Edge CDN absorbs 95% of pure display traffic.
Key Technical Points
1. The Consistency Spectrum: 6 Models, From "Absolutely Right" to "Eventually Aligned"
Principle: A consistency model defines "how far the order seen by clients can deviate from some ideal order". Strong to weak (each step weaker buys you performance/availability):
| Model | Promise | Implementation Cost | Typical |
| Linearizable | All ops behave as if executed in wall-clock order at a single point; reads see the latest committed write | Quorum sync + global clock / consensus | etcd, ZooKeeper, Spanner |
| Sequential | All replicas see the same op order (but not necessarily matching real time) | Global serialization (leader / total order broadcast) | Kafka single partition, Redis |
| Causal | Causally-related ops are ordered; concurrent ops are not constrained | HLC / version vector + dep tracking | MongoDB causal session, COPS |
| Read-your-writes | Read your own write immediately | Session sticky or client LSN token | Minimum bar for most web apps |
| Monotonic Read | Reads don't "go back in time" | Client pinning to one replica / high-watermark | Twitter timeline |
| Eventual | All replicas converge after writes stop | Async replication + convergence function (LWW / CRDT) | DNS, S3 list, Cassandra default |
Key insight: "strong consistency" is not the default — it must be justified to be chosen.
- Linearizable: only when business correctness depends on "latest value". Inventory, leader election, distributed locks, uniqueness constraints. Cost per op: cross-replica RTT + one round of consensus.
- Causal: chat, comments, social feed — "order matters" but global order doesn't. Far cheaper than linearizable; users can't tell the difference.
- RYW + Monotonic: 90% of "users feel the system works" comes from these two. Client carries a token — almost zero server cost.
- Eventual: pure display, counters (approximate OK), caches. Start with "everything eventual by default", upgrade when needed.
# Differences users can't see vs differences they immediately complain about
# Can't see (yet engineers fight for):
# linearizable vs sequential in a single datacenter is imperceptible
# Will complain instantly:
# - "I just ordered, can't see the order" → violates RYW
# - "I saw 5 comments, now only 3" → violates monotonic read
# - "Reply appears before the question" → violates causal (consistent prefix)
# - "100 items sold 102 times" → violates linearizable
# Engineering priority (must-have → optional):
# linearizable[critical resources] ≫ RYW + monotonic + causal[UX] ≫ eventual[fallback]
Real cases:
- Google Spanner: global linearizable (external consistency) via TrueTime API (GPS+atomic clocks) bounding clock uncertainty to ~7ms; commits wait until safe before becoming visible. Original paper, 2012.
- CockroachDB / YugabyteDB: open-source homages to Spanner, using HLC (no TrueTime) for serializable + bounded staleness.
- Amazon DynamoDB: default eventual, with per-request
ConsistentRead=true for strong (cost: read capacity ×2, latency doubled).
- Facebook TAO: social graph cache — writes go to primary region, reads are eventual; only a few critical ops (privacy settings) use strong path.
- Discord messages: Cassandra QUORUM R/W (W+R>N), giving sequential per-channel; cross-channel unordered — fine for the business.
2. CAP and PACELC: "Pick Two" Is Misleading; "Two of Four" Is Closer to Truth
Principle: The CAP theorem (Brewer 2000, proved by Gilbert & Lynch 2002) says: during a network partition (P), you must choose between consistency (C) and availability (A). But this only describes behavior during a partition — and partitions are rare. It says nothing about the rest of the time.
PACELC (Abadi 2012) fills the other half: during Partition pick A or C; Else (normal time) pick Latency or Consistency. Four full combinations:
| Type | Partition | Normal | Typical |
| PA / EL | Keep availability | Low latency (weak consistency) | Cassandra, DynamoDB, Riak |
| PA / EC | Keep availability | Strong consistency | MongoDB default (writeConcern majority) |
| PC / EL | Keep consistency (unavailable) | Low latency | Rare — if you want C in partition you usually want it normally too |
| PC / EC | Keep consistency | Strong consistency | Spanner, CockroachDB, etcd, HBase |
Interview soundbite: "CAP is the partition-time pick; PACELC tells you what 99% of the time should look like."
Real production picks are two:
- PA/EL (most common): "During partition keep writing, converge after; in normal time, latency first". Social, e-commerce browsing, IoT ingestion. Cost: write conflicts converge via LWW/CRDT, business must be designed mergeable.
- PC/EC: "Consistency is the floor; rather unavailable than wrong; in normal time spend latency for accuracy". Finance, inventory, leader election, config. Cost: one side unwritable during partition; 100+ms cross-region writes.
Anti-pattern: defaulting to PC/EC for non-critical data. Storing "social comments" in etcd is engineering disaster.
# Common CAP misconceptions, debunked
# ❌ "CAP forces me to give up one" — wrong. P is not optional, it's physical fact.
# Real choice: when P happens, pick A or C.
# ❌ "We chose AP" — too coarse. AP further splits:
# - read AP, write CP (DynamoDB ConsistentWrite + EventualRead)
# - per-key configurable (Cassandra LOCAL_QUORUM)
# ❌ "CAP is 2 of 3" — wrong, it's 1 of 2 during P; otherwise there's also E (latency vs C).
# ❌ "Strong = high latency" — only cross-region. Single-region Raft quorum is often < 5ms.
# ❌ "Eventual = data is lost" — wrong. Eventual means "temporarily inconsistent, eventually converges".
# Data loss is durability, not consistency.
Real cases:
- Jepsen reports overturn vendor claims: aphyr.com/jepsen has tested MongoDB, Redis Sentinel, Galera, CockroachDB and repeatedly disproved official docs — best reminder that PACELC must be verified, not trusted.
- Amazon DynamoDB: tunable per-request — purchase goes strong, product list goes eventual.
- Stripe: accounts & billing PC/EC (Postgres primary + multi-AZ semi-sync); event log & audit PA/EL (Kafka + ClickHouse). Same company, different subsystems, different quadrants.
- Cloudflare KV vs Durable Objects: KV is PA/EL (eventual, ~60s global convergence); Durable Objects is PC/EC (single-point strong + cross-region routing). Their docs explicitly warn "wrong quadrant = pain".
3. Causal Consistency: HLC and Version Vectors
Principle: Causal consistency guarantees that "causally related operations" are seen in the same order by all replicas. Formal definition from Lamport's 1978 happens-before relation (a → b iff a and b are in the same process and a precedes b, or a is a message b received).
Two main implementation tools:
- Lamport timestamp: single integer, increments on each op; on receive
L = max(local, msg)+1. Provides total order but cannot distinguish causal from concurrent.
- Version Vector / Vector Clock: each node tracks "last version seen from every other node". Can distinguish causal vs concurrent, but size = O(node count) — hard to maintain with dynamic membership.
- Hybrid Logical Clock (HLC): combines physical time + logical counter,
(phys_time, logical). Has the causality of vector clocks + readability close to wall-clock + constant size. Used by CockroachDB, YugabyteDB, MongoDB 5+.
sequenceDiagram
participant U as Alice (us-east)
participant US as Comments US
participant EU as Comments EU
participant V as Bob (eu-west)
Note over U,V: Causal scenario: Bob replies to Alice
U->>US: post "Looks good?" HLC=(t1, 0)
US->>EU: replicate (t1, 0)
EU->>V: read shows "Looks good?"
V->>EU: reply "Yes!" with deps=[(t1,0)] HLC=(t2, 0)
Note over EU: t2 > t1 preserves causality
EU->>US: replicate reply with deps
Note over US: receive reply, check deps
if (t1,0) not yet arrived → buffer
US-->>U: now shows — "Looks good?" first, then "Yes!" ✓
Note over U,V: Wrong path (no HLC, just async):
Note over US: reply may arrive before question
users see reply before question ✗
# Hybrid Logical Clock (HLC) — simplified pseudocode
class HLC:
def __init__(self):
self.l = 0 # logical (max wall time seen so far)
self.c = 0 # counter to break ties
def now(self): # local event (e.g. write)
pt = physical_time_ms()
if pt > self.l:
self.l, self.c = pt, 0
else:
self.c += 1
return (self.l, self.c)
def update(self, remote_l, remote_c): # receiving msg
pt = physical_time_ms()
new_l = max(self.l, remote_l, pt)
if new_l == self.l == remote_l:
self.c = max(self.c, remote_c) + 1
elif new_l == self.l:
self.c += 1
elif new_l == remote_l:
self.c = remote_c + 1
else:
self.c = 0
self.l = new_l
return (self.l, self.c)
# Properties:
# - Monotonic (no rollback on local or remote events)
# - Preserves happens-before: a→b ⇒ HLC(a) < HLC(b)
# - Close to wall clock (diff bounded by max clock skew)
# - Constant size (8 bytes), not O(N) like vector clocks
Real cases:
- MongoDB 5.0+ Causal Consistency Session: client session carries
operationTime (effectively HLC); reads use afterClusterTime to wait for sync — provides RYW + monotonic + read-your-causal.
- CockroachDB: every transaction carries an HLC timestamp; comparison across nodes gives serializable + causal order, no GPS clock needed.
- Riak / Dynamo: dotted version vector (DVV), correctly merges even under dynamic node changes.
- COPS (Cluster of Order-Preserving Servers): Lloyd et al. 2011 — causal+ model, academic baseline for social networks; influenced Facebook TAO.
- Git: commit DAG is essentially a version vector; merge auto-detects concurrent branches and demands human resolution.
4. Engineering: Pick Consistency per Data Type Within One System
Principle: Real systems never use a single consistency model. Layer by "cost of violation" is the core methodology.
Layered decision (driven by SLO/business loss):
- Violation → data corruption / financial loss (inventory, balance, uniqueness) → Linearizable / Serializable. Spanner / CockroachDB / Postgres strong constraints.
- Violation → user complaints + support cost (can't see my order, my comment vanished) → RYW + Monotonic. Client LSN/HLC token, zero extra server cost.
- Violation → weird UX but no error (comment order scrambled) → Causal. HLC + dependency tracking.
- Violation → user can't tell (product view counter, recommendation ranking) → Eventual. Redis / CDN / async index.
# Mixing consistency per business field on the same request path (pseudocode)
# Scenario: user clicks "Buy now" on a limited-edition product
def place_order(user, sku, qty):
# Step 1: Strong — decrement stock. Must be linearizable.
# Spanner / CockroachDB, global quorum, ~30ms is acceptable.
reserved = inventory_db.atomic_decrement(sku, qty) # CP
if not reserved:
return "Sold out"
# Step 2: RYW — write order. Session token so user sees it immediately.
order = orders_db.insert(user, sku, qty) # primary write
session.set_lsn(order.lsn) # client carries it back
# Step 3: Eventual — recommendations & popularity counter
fan_out_async("user.purchased", {user, sku}) # Kafka → reco/search
increment_counter_eventual(sku + ":popular", 1) # Redis, weak is fine
# Step 4: Causal — notifications (downstream actions must preserve causality)
notify_with_hlc(user, "order_placed", deps=[order.hlc])
return order
# Key insight: ONE action — "Buy now" — has FOUR consistency modes
# because business needs differ.
# All-Spanner: slow. All-Cassandra: oversells.
Real cases:
- Stripe payments: transactions (CP/EC, Postgres + idempotency key) + event stream (PA/EL, Kafka + Outbox) + reports (eventual, ClickHouse) — three clear layers.
- Shopify Storefront: inventory decrement strong (MySQL row lock + global counter); product page eventual (Fastly CDN, 60s TTL); order state RYW (cookie with last-update token).
- Twitter Timeline: user's own recent tweet → forced read leader for 5s (RYW); others' tweets → fan-out async (eventual); like counts → eventual + periodic recompute.
- Notion / Linear: document content CRDT collaboration (causal+eventual); permissions/billing strong (Postgres); comments partition by doc — within doc causal, across docs eventual.
Extensions & Optimizations
- Bounded Staleness: "read data at most X seconds old". Stricter than pure eventual, cheaper than strong. Supported by Cosmos DB / Aurora Global Database / CockroachDB. Great for dashboards, analytics, geo-local reads.
- Client SDK with built-in token routing: MongoDB, Vitess, Aurora let the SDK auto-manage session LSN; apps get RYW transparently. The trend: move consistency from infra into SDK.
- Tunable Consistency: Cassandra
ONE/QUORUM/ALL, DynamoDB ConsistentRead per request — business decides per query, not per database.
- Follower reads: CockroachDB / TiDB allow "5-seconds-stale follower reads", bypassing quorum RTT. Great for BI and geo-local reads.
- Fallback when linearizable unreachable: when quorum is lost, fail-stop (recommended) or fail-open (dangerous)? etcd, ZooKeeper choose fail-stop — that's the "C" in PC/EC. If the business can't tolerate, don't pick PC; pick PA.
Common Pitfalls & Interview Questions
1. Strong consistency as default: jumping straight to Spanner / Raft, eating 200ms cross-region writes. Ask first: "what's the cost of inconsistency?", then pick.
2. Confusing ACID with strong consistency: ACID is a single-db transaction guarantee; distributed "C" is a cross-node guarantee. Postgres has ACID on one machine but may be eventual across replicas — don't mix them.
3. Wall-clock ordering: clocks drift 1-100ms in distributed systems; NTP doesn't guarantee monotonicity. Use HLC / Lamport / version vector, never time.time().
4. Thinking quorum (W+R>N) = linearizable: Dynamo-style quorum with sloppy quorum + hinted handoff can break linearizability. True linearizability needs read-repair + leader or consensus (Raft/Paxos).
5. Client cache breaks monotonic read: browser cached an older page, refresh shows stale data — looks like the system went backward. The frontend is part of the consistency domain too; mind cache busting / ETag.
Likely interview follow-ups:
- In 30 seconds, explain the difference between linearizable, sequential, and causal, with a violation scenario for each.
- Your system needs "no oversold inventory + ordered comments + fast recommendations" — sketch the consistency choice for each subsystem and why.
- CAP is often called "outdated". What does PACELC solve? Why is 99% of time spent in E, not P?
- Why is Hybrid Logical Clock better than Lamport timestamps and Vector Clocks? Size, readability, causal detection?
- How can Spanner claim "global linearizable"? How does TrueTime's ε affect commit wait? What if you don't have GPS clocks (CockroachDB)?
Deep Resources
- 《Designing Data-Intensive Applications》Ch 5 §Consistency + Ch 9 §Consistency and Consensus (Kleppmann) — the most thorough industrial discussion of this topic.
- Daniel Abadi — "Consistency Tradeoffs in Modern Distributed Database Design" (2012, IEEE Computer) — the original PACELC paper, 6 pages.
- Spanner: Google's Globally-Distributed Database (OSDI 2012) — TrueTime + external consistency foundation.
- Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases (Kulkarni et al. 2014) — original HLC paper.
- Jepsen — aphyr.com/jepsen: real partition behavior of MongoDB / etcd / Galera / CockroachDB, debunking vendor claims.
- Peter Bailis — "Highly Available Transactions: Virtues and Limitations" (VLDB 2014) — what isolation/consistency is achievable under partition.
- Marc Brooker's blog — brooker.co.za: senior AWS engineer's short pieces on consistency/availability with strong engineering taste.
- Werner Vogels — "Eventually Consistent" (ACM Queue 2008): Amazon CTO on why eventual isn't scary.
Deep Reflection
1. For a single "checkout", inventory is Linearizable, order is RYW, recommendation is Eventual. If a PM says "let's also make orders Linearizable, just to be safe", how do you argue back? Give numbers.
Classic "consistency as comfort blanket" anti-pattern. You need to quantify the cost — not just say "it's engineering complexity", which never lands.
Step 1: What does "order" actually need?
- User writes order, must see it immediately → RYW (session token suffices)
- Order state must not regress → Monotonic Read (same)
- Payment, shipping, refund for one order stay causal → Causal (HLC suffices)
- Must "all orders across users" be globally ordered? No. User A sees A's orders in order, B sees B's, mutually independent.
So Linearizable is over-promising — it requires "a global observer to see all orders in wall-clock order", a guarantee the business does not consume.
Step 2: Quantify cost. Today: Postgres semi-sync primary (PA/EC, cross-AZ ~5ms commit), write P50=8ms / P99=25ms. Switch to Spanner-style global linearizable:
- Each write needs cross-region majority + TrueTime commit wait (~7ms). P50 → ~80ms (cross-ocean RTT), P99 → ~200ms.
- 20k QPS at peak → user perception goes from "instant" to "laggy"; checkout conversion historically drops 5-15% (Amazon's "100ms = 1% revenue").
- Infra cost: a Spanner-class DB + global presence is 5-10× a single-region Postgres.
- R&D: cross-region strong writes + idempotency + retry design is far more complex; incident diagnosis harder.
Step 3: Real solution to PM's "safety" concern:
- Order writes to primary Postgres, semi-sync with ≥1 replica ACK → RPO ≈ 0.
- Client carries LSN token → RYW + monotonic.
- State transitions (pending→paid) use idempotency keys; event stream via Outbox → 0 data loss.
- Reconciliation job: order totals = inventory out × price ± refunds. Alert on any drift.
Punchline: "Order system safety doesn't come from a stronger consistency model — it comes from a correct failure model + reconciliation. Linearizable doesn't solve 'lost orders' — that's a durability problem, not a consistency problem." Once you separate the terms, PMs usually get it.
2. Spanner's TrueTime ε is ~7ms. If GPS is jammed and ε spikes to 200ms, what happens? Why is this "safe degradation" rather than "wrong"?
This tests understanding of the commit-wait mechanism. Spanner's external consistency relies on: after assigning commit timestamp t to a transaction, wait until real time has passed t+ε before making the result visible. That guarantees "all external observers see transactions in real-time order".
When ε spikes to 200ms:
- Every transaction waits ~200ms more. Throughput drops sharply, P99 latency spikes.
- But correctness is preserved — the key point. Spanner picks "slow but right" (PC/EC); it does not relax linearizability when ε grows.
- Monitoring fires, ops intervene; once GPS recovers, ε returns to 7ms and performance recovers naturally.
Why this is safe degradation: ε is "upper bound on clock uncertainty"; Spanner always assumes worst case. Larger ε is just "we acknowledge clocks are off", commit wait backstops semantics. This is totally different from "clock is wrong and system doesn't know" — the latter would break consistency.
Compare CockroachDB (no TrueTime): uses HLC + max_offset (default 500ms) as bound; transactions beyond max_offset are aborted and retried. No GPS needed but max_offset is set conservatively, so cross-region latency is higher. Classic "engineering compromise vs hardware investment" trade-off.
Deeper insight: Spanner converts "clock uncertainty" into "wait time" — transforming a seemingly unsolvable physics problem (clock sync) into a quantifiable engineering resource (waiting). That's distributed systems design aesthetics at its peak: don't fight physics, fold it into the abstraction. HLC is the cheaper version of the same idea (logical counter compensates for clock skew).
So the "right answer" isn't "Spanner is broken"; it's "Spanner trades speed for correctness, which validates its philosophy".
3. Cross-chapter: Day 4 (Sharding) + Day 5 (Replication) + Day 6 (Consistency). For a sharded multi-replica OLTP, how do you get strong consistency across shards? Why is 2PC inefficient here and why is deterministic transaction trending?
The core problem of distributed databases. Stack the three dimensions:
- Within a shard: multi-replica using Raft/Paxos for linearizable (CockroachDB, TiKV, Spanner).
- Across shards: a transaction may touch multiple shards (e.g., "transfer $100 from A to B" — A, B in different shards).
Classical: 2PC (Two-Phase Commit)
- Phase 1: coordinator asks all shards to prepare (lock rows, write prepare log)
- Phase 2: all yes → commit, any no → abort
- Problems: (a) coordinator is a SPOF; (b) prepared rows are locked, blocking other ops; (c) if coordinator dies between phases → participants blocked; (d) 2 RTT × N shards = high latency.
Modern improvements:
- Spanner / CockroachDB: 2PC + Paxos — coordinator role is itself a Paxos group (HA); each shard is also a Paxos group. Still 2PC, but participants are HA — slow but not blocked.
- Calvin / FaunaDB: deterministic transactions — pre-order all transactions via a global sequencer (log); every node executes in the same order → no locks, no 2PC. Pros: extreme throughput. Cons: writes wait for global log fan-in; transactions must be deterministic (no randomness, no external I/O).
- Saga (microservices favorite): split into local transactions + compensations. Not ACID, but "eventual + compensation". Cost: business must design compensation logic. Day 7 expands.
Why deterministic is the trend:
- Cloud-era "global sequencers" got cheap (Kafka, Raft log); single-coordinator is no longer the bottleneck.
- 2PC's root issue (locking + blocking) cannot be optimized away — only displaced by a new paradigm.
- Determinism gives free replay / recovery / cross-region replication (replay=correct).
Soundbite: "2PC isn't wrong — it's not good enough. The future of distributed transactions is either deterministic (strong semantics, high throughput) or admitted-as-Saga (weak semantics, business compensation). Sprinkling 2PC is tech debt."
4. "Read-your-writes" sounds simple, but breaks when "user writes on phone, reads on laptop seconds later" — session stickiness fails. Give 3 engineering solutions + their costs.
Cross-device consistency is a classic gotcha — session sticky usually routes by IP/cookie and doesn't survive a device switch.
Solution 1: Account-level last-write LSN (recommended)
- On every write, server updates
users.last_write_lsn = max(current, new_lsn) in Redis / Postgres.
- Any device's read first looks up
user.last_write_lsn; if replica lags behind, wait or route to primary.
- Cost: extra Redis lookup per read (<1ms). Degrades to "read primary" if Redis is down.
- Upside: clean cross-device semantics; account-level matches user mental model better than session-level.
Solution 2: Client token + server wait (for high-real-time apps)
- Server returns LSN/HLC token on write; client stores locally.
- Reads carry token via SDK; cross-device with no token → fall back to read primary.
- Cost: cross-device degrades to strong read, primary pressure.
- Fit: mobile apps where single-device is primary use, cross-device is secondary.
Solution 3: Acknowledge lag in UX (product-side fix)
- Accept 1-2s lag across devices; cover with UI — "syncing…", pull-to-refresh, push notification to the other device.
- Turn inconsistency into a UX cue rather than a bug.
- Cost: UX surface; not acceptable for some domains (finance, medical).
- Fit: Notion / Apple Notes / Dropbox all do this — "eventual + sync indicator".
Combined: account-level LSN + UX fallback. The LSN handles 95% of cases (<1s readable); the remaining 5% (Redis slow, severe replica lag) gets the "syncing" UI. Pure technical 100% consistency is uneconomical.
Deeper insight: Cross-device RYW is really "how user identity spans sessions" — a user-identity-modeling problem, not a distributed-systems problem. The best solutions usually treat "user's current state" as a first-class entity (user.last_write_lsn) rather than letting each session manage its own. This generalizes to multi-tenant, cross-device sync, real-time collaboration.
5. Counter-intuitive: between "5 nines availability" and "strong consistency", why should the business often pick availability? Walk through a real "picked strong, got disaster" scenario.
Engineers instinctively pick strong — "data must be right". From the product view: user-perceived SLA = system availability × data correctness. If consistency causes frequent unavailability, the product is worse.
Real walkthrough: a social app's globally linearizable comments
- Pick: comments on Spanner-style global linearizable. Reason: "comment order can't be wrong".
- Normal operation: EU/Asia user post latency P99 = 300ms (cross-ocean quorum). Users feel "posting is slow", engagement -8%.
- Once or twice a year, us-east ↔ eu-west partitions (5-30 min):
- EU loses majority → EU users can't post.
- Users see error pages / loading spinners. Complaints flood.
- Support cost spikes, social-media backlash, DAU dips.
- Postmortem: "comment order mustn't be wrong" — users don't actually feel it. Timelines are already roughly time-sorted; 1-2 swapped causal events are imperceptible.
- Re-pick: causal+eventual. Both sides writable during partition; merge by HLC. Annual downtime drops from 60 minutes to 0; UX is fluid; occasional reorder is well within user tolerance.
Counter-intuitive logic:
- Consistency violations are usually soft harm (slightly wrong order, slightly stale data); unavailability is hard harm (nothing works).
- Users tolerate "consistency bugs" with massive patience ("refresh fixes it"); unavailability tolerance is near zero ("this app sucks").
- Strong consistency is the most expensive thing in distributed systems — it demands quorum sync, cross-region RTT, coordination overhead. Routinely costs an order of magnitude in throughput and latency.
When strong is still required:
- Money (accounts, billing, pricing);
- Uniqueness (usernames, order IDs, inventory);
- Critical state-machine transitions (paid vs unpaid);
- Config / permissions (errors here affect global behavior).
Soundbite: "Strong consistency pays for hard correctness constraints, not for engineers' peace of mind. Spanner and Cassandra aren't good vs bad — they're different fits. A senior architect picks per data domain, not 'whole system strong' or 'whole system eventual'."
This is also "be a super-individual" expressed at the system-design layer: replace "safe by default" with "precise by choice" — the mark of mature technical judgment.