Mental Models: Distributed Systems Thinking

CAP Theorem

"Of Consistency, Availability, and Partition-tolerance, you can have at most two." — Eric Brewer, 2000

In Depth

When nodes lose contact with each other (a "partition," P), a system must choose between Consistency (every node sees the same latest data) and Availability (every request gets an immediate response). Reading CAP as "pick 2 of 3" is the novice error — partitions are physically inevitable in any system spread across space (networks will fail), so P isn't something you can give up. The expert reading is one sentence: when a partition happens, do you pick C or A?

Non-trivial: (1) this isn't an engineering quirk — it's a fundamental constraint of information propagating at finite speed, governing any "coordination across space," including human organizations, cross-timezone teams, even signal-delayed neural systems. (2) CP and AP aren't labels for a whole system but per-operation tradeoffs: a bank balance picks C (better to refuse than be wrong), a like-count picks A (a few seconds of lag is fine). (3) Nearly every coordination disaster comes from picking the wrong side — forcing C where A was needed (everything stalls the moment something disconnects), or grabbing A where C was needed (corrupted data).

Practice: for any coordination problem, split two questions first — "What's the partition here, who might lose contact with whom?" and "Does this operation need correctness, or does it need a response?" Classify first, then design.

CAP: P is a premise, not an option — what's left is the C-vs-A trade under partition

Classic example

An ATM: when it's cut off from the bank's core network (partition), it picks A over C — it still lets you withdraw, just with a cap, and reconciles later. That's a business decision: "the revenue from availability > the loss from an occasional overdraft." Here CAP isn't a technical dogma but a value tradeoff with an explicit price tag.

BigCat scenario

(1) Designing an AI agent system with shared state: globally locking and syncing on every step (strong consistency) gets too slow to be usable; letting agents run and reconcile periodically (eventual consistency) is fast but risks brief conflicts. Split by operation: money or irreversible actions pick C, drafts and exploration pick A. (2) Family decisions are isomorphic — "everything must be agreed in real time by both parents" runs the home as a CP system; when one parent is absent (a partition), the kid just waits. Healthier: high-stakes decisions pick C (wait to align), daily small stuff picks A (whoever's present decides, sync later). Design the family as a partition-tolerant system.

AI Prompt

English Prompt

I face a coordination problem: [describe the system/team/decision]. Use CAP to break it down: 1. What exactly is the "partition" here — who can lose contact with whom, and when does information fall out of sync? 2. List 3 key operations; for each, decide Consistency vs Availability and justify it. 3. Point out where I'm most likely picking the wrong side (forcing C where A fits, or vice versa) and propose a fix.

Eventual Consistency

"In the absence of new writes, all replicas eventually converge to the same state."

In Depth

Give up "all replicas agree at every instant" (strong consistency) for "once writes stop, the system eventually converges." This is exactly how you cash in the A choice from CAP. Its essence isn't abandoning consistency; it's relaxing consistency from a point-in-time guarantee to an interval guarantee — buying enormous availability and scalability in return.

Non-trivial: (1) "eventually" doesn't happen by itself — you must design a convergence mechanism: how conflicts are resolved (last-write-wins / version vectors / conflict-free data types, CRDTs). "Eventual consistency" without a resolution rule is just "never consistent." (2) The cost is exposed inconsistency during the convergence window (a post you just made isn't visible to friends yet) — acceptability depends on business semantics. (3) The deep point: strong consistency is itself an expensive illusion. Reality is already eventually consistent — light takes time to travel, so the starlight you see now is from the past. Insisting on instant consistency everywhere is fighting physics. This is isomorphic to the Buddhist view of impermanence and dependent origination: there's no "global truth of the present moment," only local states that propagate and eventually converge.

Practice: stop chasing "real-time sync for the whole team or family" — that obsession grinds everyone down. Instead design convergence points (a weekly review, a nightly family sync) plus a clear tie-breaker rule (who decides), and deliberately tolerate temporary inconsistency inside the window.

Classic example

DNS: change one resolution record and global caches won't update instantly — it takes minutes to tens of hours to become "eventually consistent." The entire Internet's naming system is built on eventual consistency, because a strongly-consistent global DNS simply couldn't scale. Available and scalable, at the cost of a brief inconsistency window — a trade the whole Internet accepts.

BigCat scenario

(1) Note-taking across devices (phone + laptop + cloud): don't demand every edit sync to all devices in real time (strong consistency throws constant conflict errors); accept eventual consistency with a clear conflict rule (e.g., "last edit wins"). (2) The most common family arguments are, at root, "assuming you're running a strongly-consistent system when you're actually eventually consistent, with no convergence mechanism installed" — you and your partner don't need real-time agreement on "who's picking up the kid today," but you do need a convergence point like "sync tomorrow's plan at 9pm every night." Install the convergence point and most coordination conflicts vanish.

AI Prompt

English Prompt

Coordinating [team/family/multi-device/multi-agent] keeps causing conflicts from being out of sync. Use eventual consistency to design a fix: 1. Which states don't actually need strong consistency and can be relaxed to eventual? 2. Specify one concrete convergence mechanism: when to sync, and what rule resolves conflicts. 3. Name the inconsistencies exposed during the convergence window, and how to make them acceptable in practice.

Idempotency

"f(f(x)) = f(x) — applying it twice is the same as once."

In Depth

Idempotent: applying an operation once or many times has the identical effect on system state. "Set x to 5" is idempotent; "increment x by 1" is not. It's the bedrock of distributed fault tolerance — over an unreliable network you can never be sure a request actually arrived (the response itself can be lost), so the only safe retry strategy is to make retrying itself harmless.

Non-trivial: (1) rather than chase "exactly-once" (near-impossible in distributed systems), design "at-least-once + idempotent" — together they're equivalent to exactly-once, and far simpler. (2) The mechanism is an idempotency key: tag each operation with a unique ID and ignore duplicates. (3) The deepest layer: idempotency is a design philosophy of converting uncertainty into safety — you don't eliminate duplication, you make duplication not matter. This is the fundamental stance for dealing with an unreliable world, isomorphic to biological robustness (immune memory and DNA repair don't compound damage from repeated stimuli).

Practice: for any process that might be triggered more than once (reminders, transfers, deployments, even the rules you set for a child), ask first: "what happens if this repeats?" Designing key actions to be idempotent makes fault-tolerance cost plummet.

Classic example

Online payment: you hit "pay," the network stalls, you hit it again. Idempotent design (using the order number as an idempotency key) guarantees you're charged only once. A payment system without idempotency is a double-charge disaster machine — which is why every serious payment API (e.g., Stripe's idempotency key) mandates idempotency.

BigCat scenario

(1) When an AI agent retries an external API call, if that API isn't idempotent (e.g., "create a new record"), the retry produces duplicate data; the fix is to attach an idempotency key, or rewrite the operation as an upsert (update if exists, else create). (2) Parenting rules should be idempotent too — "remind the child to put toys away" should drive toward the same state (toys away) whether said once or three times, rather than escalating emotion with each repeat. A non-idempotent rule accumulates conflict and erodes the relationship on repeat; an idempotent rule keeps its meaning no matter how many times you say it.

AI Prompt

English Prompt

I have a process that may be triggered more than once: [describe the operation/API/rule]. Run an idempotency audit: 1. Is this operation naturally idempotent? If not, what breaks when it repeats? 2. Give one concrete way to make it idempotent (idempotency key / rewrite as upsert / state check). 3. Combine it with at-least-once retries and explain why the pair is equivalent to exactly-once.

Backpressure

"When the downstream can't keep up, push 'slow down' back upstream — don't drop, don't crash."

In Depth

Backpressure: when the downstream (consumer) can't process as fast as the upstream (producer), the downstream signals "slow down" so the producer throttles, instead of letting data pile up without bound until the system collapses. It's fundamentally a negative-feedback control loop (the same family as feedback loops in systems thinking): downstream congestion is fed back to constrain the upstream rate.

Non-trivial: (1) a system without backpressure under overload doesn't slow down — it collapses catastrophically: the queue grows unbounded → memory is exhausted → the whole system dies, far worse than merely being slow. Backpressure's value is converting catastrophic failure into graceful degradation. (2) The counterintuitive core: deliberately slowing down (queuing, throttling, rejecting) actually sustains higher throughput — greedily accepting everything drives throughput to zero after the crash. (3) Three tactics for overload: buffer (treats the symptom), drop (when acceptable), backpressure (the cure). The nervous system uses backpressure too — synaptic inhibition and sensory gating keep the brain from being flooded by input.

Practice: spot the "unbounded queues" in your life and work — the ever-growing to-do list, the inbox, the child's activity schedule. An unbounded queue will eventually avalanche. Install backpressure: set an explicit intake limit, and when it's full, push back (reject or defer) rather than silently absorbing until you break.

Backpressure = a negative-feedback loop carrying downstream congestion back upstream

Classic example

TCP flow control (the sliding window): the receiver continuously tells the sender "this much buffer remains," and the sender paces itself accordingly. This is why the Internet doesn't collapse when one fast server overwhelms a slow client — backpressure is welded directly into the protocol. Toyota's Kanban is the physical version of the same idea: if downstream sends no card, upstream doesn't produce.

BigCat scenario

(1) AI agent pipeline: an upstream agent floods a downstream executor with tasks; if the downstream is slow, its queue explodes. The fix is to bound the downstream queue and, when full, have the upstream pause production or drop low-priority tasks. (2) As someone pursuing the "AI super-individual," AI makes your "upstream" (things you could do) nearly infinite — without backpressure your to-do list grows unbounded and eventually avalanches as burnout. Install backpressure: set a daily/weekly intake capacity and, when full, push back explicitly (say "no," or defer to next week). Saying "no" isn't a character flaw — it's flow control.

AI Prompt

English Prompt

I have a component that overloads: [describe the system/process/my to-do list or attention]. Use backpressure to design a fix: 1. Identify the "unbounded queue" — what piles up here with no cap? 2. Design a concrete backpressure mechanism: where to set the intake limit, and how to push back when full (reject / defer / drop low-priority). 3. Compare "greedily accept everything" vs "deliberately slow down" and estimate the sustainable throughput of each.