Day 6 Hard Distributed Systems Consistency CAP / PACELC Causal

Consistency — It's Not a Switch, It's a Whole SpectrumLinearizable / Sequential / Causal / RYW / Monotonic / Eventual · CAP & PACELC · HLC & Version Vectors

Scenario & Constraints

Design a global e-commerce platform that mixes "flash sale inventory + product details + comments" — three regions (US/EU/AP), 500k DAU, peak 20k orders/sec, 200k product-view QPS, 50k comment R/W QPS. Three radically different consistency requirements live in one system:

This forces you to mix multiple consistency models in one system — which is what real systems look like. Day 5 covered how replication works; today is: which consistency guarantees are worth their latency & availability cost, which aren't, and how to implement them in practice.

High-Level Architecture

graph TD
    subgraph Edge["Edge / CDN"]
        CDN["Product Detail Page
eventual, TTL 60s"] end subgraph Core["Core (us-east, strong zone)"] INV["Inventory Service
Spanner / CockroachDB
Linearizable
"] ORD["Order Service
Postgres semi-sync
RYW via session token
"] end subgraph Comments["Comments Domain (Multi-Region)"] CMT_US["Comments us-east"] CMT_EU["Comments eu-west"] CMT_AP["Comments ap-south"] HLC[("HLC
Hybrid Logical Clock")] end Client["Client
carries session LSN / HLC"] Client -- "view product" --> CDN CDN -. miss .-> INV Client -- "order / decrement stock" --> INV Client -- "list my orders" --> ORD Client -- "post / read comments" --> CMT_US Client -- "post / read comments" --> CMT_EU Client -- "post / read comments" --> CMT_AP INV -. CDC .-> CDN ORD -. CDC .-> CDN CMT_US <-. async + HLC .-> CMT_EU CMT_EU <-. async + HLC .-> CMT_AP CMT_AP <-. async + HLC .-> CMT_US classDef strong fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 classDef ryw fill:#1a2530,stroke:#64c8ff,color:#e8eef5 classDef causal fill:#0e2030,stroke:#5eead4,color:#e8eef5 class INV strong class ORD ryw class CMT_US,CMT_EU,CMT_AP,HLC causal

Three consistency domains: inventory uses global linearizable (Spanner-class quorum), orders use single-region primary + client LSN, comments use multi-region async + HLC for causal order. Edge CDN absorbs 95% of pure display traffic.

Key Technical Points

1. The Consistency Spectrum: 6 Models, From "Absolutely Right" to "Eventually Aligned"

Principle: A consistency model defines "how far the order seen by clients can deviate from some ideal order". Strong to weak (each step weaker buys you performance/availability):

ModelPromiseImplementation CostTypical
LinearizableAll ops behave as if executed in wall-clock order at a single point; reads see the latest committed writeQuorum sync + global clock / consensusetcd, ZooKeeper, Spanner
SequentialAll replicas see the same op order (but not necessarily matching real time)Global serialization (leader / total order broadcast)Kafka single partition, Redis
CausalCausally-related ops are ordered; concurrent ops are not constrainedHLC / version vector + dep trackingMongoDB causal session, COPS
Read-your-writesRead your own write immediatelySession sticky or client LSN tokenMinimum bar for most web apps
Monotonic ReadReads don't "go back in time"Client pinning to one replica / high-watermarkTwitter timeline
EventualAll replicas converge after writes stopAsync replication + convergence function (LWW / CRDT)DNS, S3 list, Cassandra default
Key insight: "strong consistency" is not the default — it must be justified to be chosen.
# Differences users can't see vs differences they immediately complain about
# Can't see (yet engineers fight for):
#   linearizable vs sequential in a single datacenter is imperceptible
# Will complain instantly:
#   - "I just ordered, can't see the order"  → violates RYW
#   - "I saw 5 comments, now only 3"          → violates monotonic read
#   - "Reply appears before the question"     → violates causal (consistent prefix)
#   - "100 items sold 102 times"              → violates linearizable

# Engineering priority (must-have → optional):
#   linearizable[critical resources]  ≫  RYW + monotonic + causal[UX]  ≫  eventual[fallback]
Real cases:

2. CAP and PACELC: "Pick Two" Is Misleading; "Two of Four" Is Closer to Truth

Principle: The CAP theorem (Brewer 2000, proved by Gilbert & Lynch 2002) says: during a network partition (P), you must choose between consistency (C) and availability (A). But this only describes behavior during a partition — and partitions are rare. It says nothing about the rest of the time.

PACELC (Abadi 2012) fills the other half: during Partition pick A or C; Else (normal time) pick Latency or Consistency. Four full combinations:

TypePartitionNormalTypical
PA / ELKeep availabilityLow latency (weak consistency)Cassandra, DynamoDB, Riak
PA / ECKeep availabilityStrong consistencyMongoDB default (writeConcern majority)
PC / ELKeep consistency (unavailable)Low latencyRare — if you want C in partition you usually want it normally too
PC / ECKeep consistencyStrong consistencySpanner, CockroachDB, etcd, HBase
Interview soundbite: "CAP is the partition-time pick; PACELC tells you what 99% of the time should look like."

Real production picks are two: Anti-pattern: defaulting to PC/EC for non-critical data. Storing "social comments" in etcd is engineering disaster.
# Common CAP misconceptions, debunked
# ❌ "CAP forces me to give up one"  — wrong. P is not optional, it's physical fact.
#    Real choice: when P happens, pick A or C.
# ❌ "We chose AP"  — too coarse. AP further splits:
#    - read AP, write CP (DynamoDB ConsistentWrite + EventualRead)
#    - per-key configurable (Cassandra LOCAL_QUORUM)
# ❌ "CAP is 2 of 3"  — wrong, it's 1 of 2 during P; otherwise there's also E (latency vs C).
# ❌ "Strong = high latency"  — only cross-region. Single-region Raft quorum is often < 5ms.
# ❌ "Eventual = data is lost"  — wrong. Eventual means "temporarily inconsistent, eventually converges".
#    Data loss is durability, not consistency.
Real cases:

3. Causal Consistency: HLC and Version Vectors

Principle: Causal consistency guarantees that "causally related operations" are seen in the same order by all replicas. Formal definition from Lamport's 1978 happens-before relation (a → b iff a and b are in the same process and a precedes b, or a is a message b received).

Two main implementation tools:

sequenceDiagram
    participant U as Alice (us-east)
    participant US as Comments US
    participant EU as Comments EU
    participant V as Bob (eu-west)

    Note over U,V: Causal scenario: Bob replies to Alice
    U->>US: post "Looks good?" HLC=(t1, 0)
    US->>EU: replicate (t1, 0)
    EU->>V: read shows "Looks good?"
    V->>EU: reply "Yes!" with deps=[(t1,0)] HLC=(t2, 0)
    Note over EU: t2 > t1 preserves causality
    EU->>US: replicate reply with deps
    Note over US: receive reply, check deps
if (t1,0) not yet arrived → buffer US-->>U: now shows — "Looks good?" first, then "Yes!" ✓ Note over U,V: Wrong path (no HLC, just async): Note over US: reply may arrive before question
users see reply before question ✗
# Hybrid Logical Clock (HLC) — simplified pseudocode
class HLC:
    def __init__(self):
        self.l = 0      # logical (max wall time seen so far)
        self.c = 0      # counter to break ties

    def now(self):     # local event (e.g. write)
        pt = physical_time_ms()
        if pt > self.l:
            self.l, self.c = pt, 0
        else:
            self.c += 1
        return (self.l, self.c)

    def update(self, remote_l, remote_c):  # receiving msg
        pt = physical_time_ms()
        new_l = max(self.l, remote_l, pt)
        if new_l == self.l == remote_l:
            self.c = max(self.c, remote_c) + 1
        elif new_l == self.l:
            self.c += 1
        elif new_l == remote_l:
            self.c = remote_c + 1
        else:
            self.c = 0
        self.l = new_l
        return (self.l, self.c)

# Properties:
#  - Monotonic (no rollback on local or remote events)
#  - Preserves happens-before: a→b ⇒ HLC(a) < HLC(b)
#  - Close to wall clock (diff bounded by max clock skew)
#  - Constant size (8 bytes), not O(N) like vector clocks
Real cases:

4. Engineering: Pick Consistency per Data Type Within One System

Principle: Real systems never use a single consistency model. Layer by "cost of violation" is the core methodology.

Layered decision (driven by SLO/business loss):
# Mixing consistency per business field on the same request path (pseudocode)

# Scenario: user clicks "Buy now" on a limited-edition product
def place_order(user, sku, qty):
    # Step 1: Strong — decrement stock. Must be linearizable.
    #   Spanner / CockroachDB, global quorum, ~30ms is acceptable.
    reserved = inventory_db.atomic_decrement(sku, qty)   # CP
    if not reserved:
        return "Sold out"

    # Step 2: RYW — write order. Session token so user sees it immediately.
    order = orders_db.insert(user, sku, qty)             # primary write
    session.set_lsn(order.lsn)                            # client carries it back

    # Step 3: Eventual — recommendations & popularity counter
    fan_out_async("user.purchased", {user, sku})         # Kafka → reco/search
    increment_counter_eventual(sku + ":popular", 1)      # Redis, weak is fine

    # Step 4: Causal — notifications (downstream actions must preserve causality)
    notify_with_hlc(user, "order_placed", deps=[order.hlc])

    return order

# Key insight: ONE action — "Buy now" — has FOUR consistency modes
# because business needs differ.
# All-Spanner: slow. All-Cassandra: oversells.
Real cases:

Extensions & Optimizations

Common Pitfalls & Interview Questions

1. Strong consistency as default: jumping straight to Spanner / Raft, eating 200ms cross-region writes. Ask first: "what's the cost of inconsistency?", then pick.
2. Confusing ACID with strong consistency: ACID is a single-db transaction guarantee; distributed "C" is a cross-node guarantee. Postgres has ACID on one machine but may be eventual across replicas — don't mix them.
3. Wall-clock ordering: clocks drift 1-100ms in distributed systems; NTP doesn't guarantee monotonicity. Use HLC / Lamport / version vector, never time.time().
4. Thinking quorum (W+R>N) = linearizable: Dynamo-style quorum with sloppy quorum + hinted handoff can break linearizability. True linearizability needs read-repair + leader or consensus (Raft/Paxos).
5. Client cache breaks monotonic read: browser cached an older page, refresh shows stale data — looks like the system went backward. The frontend is part of the consistency domain too; mind cache busting / ETag.

Likely interview follow-ups:

  1. In 30 seconds, explain the difference between linearizable, sequential, and causal, with a violation scenario for each.
  2. Your system needs "no oversold inventory + ordered comments + fast recommendations" — sketch the consistency choice for each subsystem and why.
  3. CAP is often called "outdated". What does PACELC solve? Why is 99% of time spent in E, not P?
  4. Why is Hybrid Logical Clock better than Lamport timestamps and Vector Clocks? Size, readability, causal detection?
  5. How can Spanner claim "global linearizable"? How does TrueTime's ε affect commit wait? What if you don't have GPS clocks (CockroachDB)?

Deep Resources

Deep Reflection

1. For a single "checkout", inventory is Linearizable, order is RYW, recommendation is Eventual. If a PM says "let's also make orders Linearizable, just to be safe", how do you argue back? Give numbers.

Classic "consistency as comfort blanket" anti-pattern. You need to quantify the cost — not just say "it's engineering complexity", which never lands.

Step 1: What does "order" actually need?

  • User writes order, must see it immediately → RYW (session token suffices)
  • Order state must not regress → Monotonic Read (same)
  • Payment, shipping, refund for one order stay causal → Causal (HLC suffices)
  • Must "all orders across users" be globally ordered? No. User A sees A's orders in order, B sees B's, mutually independent.

So Linearizable is over-promising — it requires "a global observer to see all orders in wall-clock order", a guarantee the business does not consume.

Step 2: Quantify cost. Today: Postgres semi-sync primary (PA/EC, cross-AZ ~5ms commit), write P50=8ms / P99=25ms. Switch to Spanner-style global linearizable:

  • Each write needs cross-region majority + TrueTime commit wait (~7ms). P50 → ~80ms (cross-ocean RTT), P99 → ~200ms.
  • 20k QPS at peak → user perception goes from "instant" to "laggy"; checkout conversion historically drops 5-15% (Amazon's "100ms = 1% revenue").
  • Infra cost: a Spanner-class DB + global presence is 5-10× a single-region Postgres.
  • R&D: cross-region strong writes + idempotency + retry design is far more complex; incident diagnosis harder.

Step 3: Real solution to PM's "safety" concern:

  1. Order writes to primary Postgres, semi-sync with ≥1 replica ACK → RPO ≈ 0.
  2. Client carries LSN token → RYW + monotonic.
  3. State transitions (pending→paid) use idempotency keys; event stream via Outbox → 0 data loss.
  4. Reconciliation job: order totals = inventory out × price ± refunds. Alert on any drift.

Punchline: "Order system safety doesn't come from a stronger consistency model — it comes from a correct failure model + reconciliation. Linearizable doesn't solve 'lost orders' — that's a durability problem, not a consistency problem." Once you separate the terms, PMs usually get it.

2. Spanner's TrueTime ε is ~7ms. If GPS is jammed and ε spikes to 200ms, what happens? Why is this "safe degradation" rather than "wrong"?

This tests understanding of the commit-wait mechanism. Spanner's external consistency relies on: after assigning commit timestamp t to a transaction, wait until real time has passed t+ε before making the result visible. That guarantees "all external observers see transactions in real-time order".

When ε spikes to 200ms:

  • Every transaction waits ~200ms more. Throughput drops sharply, P99 latency spikes.
  • But correctness is preserved — the key point. Spanner picks "slow but right" (PC/EC); it does not relax linearizability when ε grows.
  • Monitoring fires, ops intervene; once GPS recovers, ε returns to 7ms and performance recovers naturally.

Why this is safe degradation: ε is "upper bound on clock uncertainty"; Spanner always assumes worst case. Larger ε is just "we acknowledge clocks are off", commit wait backstops semantics. This is totally different from "clock is wrong and system doesn't know" — the latter would break consistency.

Compare CockroachDB (no TrueTime): uses HLC + max_offset (default 500ms) as bound; transactions beyond max_offset are aborted and retried. No GPS needed but max_offset is set conservatively, so cross-region latency is higher. Classic "engineering compromise vs hardware investment" trade-off.

Deeper insight: Spanner converts "clock uncertainty" into "wait time" — transforming a seemingly unsolvable physics problem (clock sync) into a quantifiable engineering resource (waiting). That's distributed systems design aesthetics at its peak: don't fight physics, fold it into the abstraction. HLC is the cheaper version of the same idea (logical counter compensates for clock skew).

So the "right answer" isn't "Spanner is broken"; it's "Spanner trades speed for correctness, which validates its philosophy".

3. Cross-chapter: Day 4 (Sharding) + Day 5 (Replication) + Day 6 (Consistency). For a sharded multi-replica OLTP, how do you get strong consistency across shards? Why is 2PC inefficient here and why is deterministic transaction trending?

The core problem of distributed databases. Stack the three dimensions:

  • Within a shard: multi-replica using Raft/Paxos for linearizable (CockroachDB, TiKV, Spanner).
  • Across shards: a transaction may touch multiple shards (e.g., "transfer $100 from A to B" — A, B in different shards).

Classical: 2PC (Two-Phase Commit)

  • Phase 1: coordinator asks all shards to prepare (lock rows, write prepare log)
  • Phase 2: all yes → commit, any no → abort
  • Problems: (a) coordinator is a SPOF; (b) prepared rows are locked, blocking other ops; (c) if coordinator dies between phases → participants blocked; (d) 2 RTT × N shards = high latency.

Modern improvements:

  1. Spanner / CockroachDB: 2PC + Paxos — coordinator role is itself a Paxos group (HA); each shard is also a Paxos group. Still 2PC, but participants are HA — slow but not blocked.
  2. Calvin / FaunaDB: deterministic transactions — pre-order all transactions via a global sequencer (log); every node executes in the same order → no locks, no 2PC. Pros: extreme throughput. Cons: writes wait for global log fan-in; transactions must be deterministic (no randomness, no external I/O).
  3. Saga (microservices favorite): split into local transactions + compensations. Not ACID, but "eventual + compensation". Cost: business must design compensation logic. Day 7 expands.

Why deterministic is the trend:

  • Cloud-era "global sequencers" got cheap (Kafka, Raft log); single-coordinator is no longer the bottleneck.
  • 2PC's root issue (locking + blocking) cannot be optimized away — only displaced by a new paradigm.
  • Determinism gives free replay / recovery / cross-region replication (replay=correct).

Soundbite: "2PC isn't wrong — it's not good enough. The future of distributed transactions is either deterministic (strong semantics, high throughput) or admitted-as-Saga (weak semantics, business compensation). Sprinkling 2PC is tech debt."

4. "Read-your-writes" sounds simple, but breaks when "user writes on phone, reads on laptop seconds later" — session stickiness fails. Give 3 engineering solutions + their costs.

Cross-device consistency is a classic gotcha — session sticky usually routes by IP/cookie and doesn't survive a device switch.

Solution 1: Account-level last-write LSN (recommended)

  • On every write, server updates users.last_write_lsn = max(current, new_lsn) in Redis / Postgres.
  • Any device's read first looks up user.last_write_lsn; if replica lags behind, wait or route to primary.
  • Cost: extra Redis lookup per read (<1ms). Degrades to "read primary" if Redis is down.
  • Upside: clean cross-device semantics; account-level matches user mental model better than session-level.

Solution 2: Client token + server wait (for high-real-time apps)

  • Server returns LSN/HLC token on write; client stores locally.
  • Reads carry token via SDK; cross-device with no token → fall back to read primary.
  • Cost: cross-device degrades to strong read, primary pressure.
  • Fit: mobile apps where single-device is primary use, cross-device is secondary.

Solution 3: Acknowledge lag in UX (product-side fix)

  • Accept 1-2s lag across devices; cover with UI — "syncing…", pull-to-refresh, push notification to the other device.
  • Turn inconsistency into a UX cue rather than a bug.
  • Cost: UX surface; not acceptable for some domains (finance, medical).
  • Fit: Notion / Apple Notes / Dropbox all do this — "eventual + sync indicator".

Combined: account-level LSN + UX fallback. The LSN handles 95% of cases (<1s readable); the remaining 5% (Redis slow, severe replica lag) gets the "syncing" UI. Pure technical 100% consistency is uneconomical.

Deeper insight: Cross-device RYW is really "how user identity spans sessions" — a user-identity-modeling problem, not a distributed-systems problem. The best solutions usually treat "user's current state" as a first-class entity (user.last_write_lsn) rather than letting each session manage its own. This generalizes to multi-tenant, cross-device sync, real-time collaboration.

5. Counter-intuitive: between "5 nines availability" and "strong consistency", why should the business often pick availability? Walk through a real "picked strong, got disaster" scenario.

Engineers instinctively pick strong — "data must be right". From the product view: user-perceived SLA = system availability × data correctness. If consistency causes frequent unavailability, the product is worse.

Real walkthrough: a social app's globally linearizable comments

  • Pick: comments on Spanner-style global linearizable. Reason: "comment order can't be wrong".
  • Normal operation: EU/Asia user post latency P99 = 300ms (cross-ocean quorum). Users feel "posting is slow", engagement -8%.
  • Once or twice a year, us-east ↔ eu-west partitions (5-30 min):
    • EU loses majority → EU users can't post.
    • Users see error pages / loading spinners. Complaints flood.
    • Support cost spikes, social-media backlash, DAU dips.
  • Postmortem: "comment order mustn't be wrong" — users don't actually feel it. Timelines are already roughly time-sorted; 1-2 swapped causal events are imperceptible.
  • Re-pick: causal+eventual. Both sides writable during partition; merge by HLC. Annual downtime drops from 60 minutes to 0; UX is fluid; occasional reorder is well within user tolerance.

Counter-intuitive logic:

  1. Consistency violations are usually soft harm (slightly wrong order, slightly stale data); unavailability is hard harm (nothing works).
  2. Users tolerate "consistency bugs" with massive patience ("refresh fixes it"); unavailability tolerance is near zero ("this app sucks").
  3. Strong consistency is the most expensive thing in distributed systems — it demands quorum sync, cross-region RTT, coordination overhead. Routinely costs an order of magnitude in throughput and latency.

When strong is still required:

  • Money (accounts, billing, pricing);
  • Uniqueness (usernames, order IDs, inventory);
  • Critical state-machine transitions (paid vs unpaid);
  • Config / permissions (errors here affect global behavior).

Soundbite: "Strong consistency pays for hard correctness constraints, not for engineers' peace of mind. Spanner and Cassandra aren't good vs bad — they're different fits. A senior architect picks per data domain, not 'whole system strong' or 'whole system eventual'."

This is also "be a super-individual" expressed at the system-design layer: replace "safe by default" with "precise by choice" — the mark of mature technical judgment.