Mental Models: Information-Theoretic Thinking

Shannon Entropy

"Information is the resolution of uncertainty." — Claude Shannon

In Depth

Entropy isn't "disorder," it's the average surprise of a probability distribution — the less certain you are about an outcome, the higher its entropy. Shannon's H = −Σ p·log p quantifies, on average, how much uncertainty is resolved when the outcome is revealed. The unit is the bit: 1 bit = the information needed to cut the possibilities in half.

Non-trivial: (1) information ≠ meaning. Shannon stripped away semantics deliberately — a coin flip and a life-or-death decision can both be exactly 1 bit. Entropy measures "how much uncertainty was removed," not "how important the content is." (2) Surprise = −log p: rarer events carry more information. "The sun rose again" has near-zero information; "there was a solar eclipse today" has a lot. This is also why news naturally favors low-probability events — high information ≠ high value. (3) Entropy is maximal under a uniform distribution: you're most uncertain when you know nothing about the outcome. (4) It's isomorphic to thermodynamic entropy — Boltzmann's S and Shannon's H are the same mathematical object, both counting "how many equally likely microstates a system can occupy." (5) It's the same quantity the predictive brain minimizes: perception constantly predicts the next input, and "surprise" (prediction error) is exactly −log p.

Practical test: to judge whether a piece of information has value, ask "by how much did it lower my uncertainty about something?" If your internal probability estimate didn't budge after reading a report, its information content was zero — no matter how long it was.

Binary entropy: zero when fully predictable (p=0 or 1), maximal when most uncertain (p=0.5)

Classic example

The redundancy of English text. Letter frequencies are wildly uneven (e very frequent, z very rare), and with spelling and grammar rules, each English letter actually carries only about 1 bit — far below the 4.7 bits of 26 equiprobable letters. That's exactly why you can read a mutilated sentence ("the weather today is ___"): language is inherently redundant, and redundancy is noise resistance.

BigCat scenario

(1) Evaluating meetings and documents: if after a two-hour meeting your probability judgment about the decision hasn't changed, its entropy contribution was zero — pure ritual. (2) AI: an LLM's perplexity is the exponential of cross-entropy — the model's "average surprise" at the next token; low perplexity = low uncertainty about language. (3) Parenting: a child's "I don't know" may be an honest high-entropy signal; forcing a low-entropy certain answer manufactures false information instead. See how large the uncertainty is first, then decide whether to rush to remove it.

AI Prompt

English Prompt

Help me assess the information value of [a report / meeting / data source] through the lens of Shannon entropy: 1. Before encountering it, what was my belief distribution over [the key question]? After? 2. Estimate how much uncertainty it actually removed (high / medium / zero). 3. If the information content is near zero, say whether it's pure ritual or mere confirmation of the known, and propose one higher-information alternative source.

Channel Capacity

Shannon's 1948 noisy-channel coding theorem — reliable communication can be approached, but there's a hard wall

In Depth

Every noisy channel has a capacity ceiling C: transmit below it and coding can drive the error rate arbitrarily low; cross it and no code, however clever, can remove the errors. This is Shannon's most counterintuitive result — near-perfect communication over an unreliable channel is possible, but there's a hard wall.

Non-trivial: (1) it's a sharp phase transition, not gradual decay. Below C, the error rate can go to 0; above C, it's locked above some positive value. Many engineering "can we squeeze a bit more?" questions actually have black-and-white answers. (2) Reliability is bought with redundancy + latency: to approach C with zero error, code blocks must be long, so latency rises. You cannot have high rate, zero error, and zero latency at once. (3) The formula C = B·log(1 + S/N) gives two knobs — bandwidth B and signal-to-noise ratio S/N — but the gain from SNR is only logarithmic, so brute-forcing power has fast-diminishing returns. (4) "Channel" transfers to any noise-limited transport: teaching, organizational communication, human-AI collaboration each have their own capacity wall.

Practice: when communication keeps failing, first separate "is the content wrong, or has the channel exceeded capacity?" Cramming 10 decision points into one meeting = far over the capacity of human working memory, so packets get dropped. The fix isn't speaking louder (more power) but lowering the rate (one thing at a time) or adding redundancy (re-explain from another angle).

Crossing capacity C is a sharp wall: inside, error can be driven arbitrarily low; outside, it's locked in

Classic example

The Voyager probes, billions of kilometers out with a signal nearly buried in noise, still return crisp photos — thanks to error-correcting codes (every data bit wrapped in carefully designed redundancy). The channel is terrible, yet as long as the rate stays below capacity, the image is fully recovered. That's the "inside-the-wall" miracle.

BigCat scenario

(1) Teaching a child: her attention + vocabulary form a narrow channel. Three new concepts at once = over capacity, all lost. The way to approach capacity is redundancy — repeat the same point via story, drawing, and hands-on, not by talking faster. (2) Human-AI collaboration: the context window is the channel, and prompt engineering is essentially "channel coding" — encoding your intent so it decodes correctly under the model's noise. (3) Teams: two teams that "only talk in meetings" form a low-bandwidth channel (echoing Conway's Law, D33); forcing heavy collaboration through it inevitably grows thick defensive interfaces. Estimate capacity first, then decide how much to push through.

AI Prompt

English Prompt

I keep failing at [a communication / teaching / collaboration setting]. Diagnose it via channel capacity: 1. What limits this "channel's" capacity (attention / working memory / bandwidth / context window)? 2. Am I transmitting over capacity (too much at once)? Give one "lower the rate" and one "add redundancy" fix. 3. Am I throwing "more power" (louder / more often) at what is really a capacity problem? Explain why the returns diminish.

Coding & Compression

"Compression is comprehension." — the source-coding theorem equates compression with understanding

In Depth

Compression is fundamentally the removal of redundancy: short codes for frequent symbols, long codes for rare ones, with average length approaching entropy, the theoretical floor. Shannon's source-coding theorem sets the hard limit — lossless compression can never make information shorter than its entropy; truly random data is incompressible.

Non-trivial: (1) compression = understanding, information theory's deepest equation. A model that compresses data into something short has found its structure and regularities; rote memorization is zero compression (storing it verbatim), understanding is high compression (capturing the generative rule). Kolmogorov complexity takes this to the limit: an object's complexity = the length of the shortest program that generates it — more regularity, shorter program, more compressible. (2) Abstraction is lossy compression: maps, models, and concepts actively discard detail to gain tractability. The question isn't "compress or not" but "is what you threw away the part you didn't need?" — exactly the line between good and bad abstraction (see D33). (3) Isomorphic to science and Occam's razor: a good theory is the shortest encoding of observations, covering the most phenomena with the fewest assumptions. (4) Same family as generalization in learning: a model that merely memorizes the training set hasn't compressed and will overfit; one that compresses has captured transferable structure.

Practice: to test whether you truly understand something, see if you can compress it without distortion — state its core generative rule in one sentence. If you can't shorten it, you usually haven't understood it; you're just moving details around.

Classic example

Morse code, as far back as 1838, intuitively used the optimal-coding idea — the most frequent letter, E, gets the shortest symbol, a single dot "·," while the rare Q gets a long string. This matches Shannon's source-coding theorem a century later ("short codes for frequent symbols") exactly: good coding makes the common things cheaper.

BigCat scenario

(1) Notes: good notes aren't a full transcription but a lossy compression — the act of writing a summary forces you to find structure, and if there's none to find, you can't compress; failure to compress is itself an honest signal of "I don't get it yet." (2) AI: an LLM is essentially a lossy compression of the entire internet, and "understanding" in the information-theoretic sense just is compression ability; this is why "can paraphrase" and "can compress to one sentence" are two different skills. (3) Knowledge systems: after reading 50 papers, being able to draw one map (a few principles generating most conclusions) = you've compressed the field; only being able to recite paper by paper = still at zero compression. Only what you can compress do you truly own.

AI Prompt

English Prompt

I want to test whether I really understand [a concept / field / system]. Use the "compression = comprehension" frame: 1. Have me compress it to a one-sentence generative rule, then judge whether what I dropped was real redundancy or a distortion. 2. If I can only list details and can't compress, diagnose whether I'm stuck at "zero-compression memorization" or grasping the wrong core. 3. Offer a shorter "encoding" that loses no essential structure.

Mutual Information

I(X;Y) — how much knowing one variable reduces your uncertainty about another

In Depth

Mutual information I(X;Y) measures how much knowing X reduces your uncertainty about Y: I(X;Y) = H(Y) − H(Y|X). It's symmetric, non-negative, and zero iff X and Y are independent.

Non-trivial: (1) mutual information is a thorough upgrade over correlation. Correlation catches only linear relationships; two variables can have zero correlation yet be highly dependent (e.g. Y=X²). Mutual information catches dependence of any form — the ultimate test of "does X contain information about Y?" (2) The value of a signal or metric = its mutual information with the outcome you actually care about. KPIs fail (Goodhart's Law, see D50) precisely because you've destroyed the mutual information between the proxy you optimize and the real goal. (3) Channel capacity is, mathematically, the maximum of mutual information — stitching all four models into one: communication, coding, uncertainty, and dependence are the same language. (4) Same family as representation learning: the Information Bottleneck principle says good learning = compressing away the irrelevant detail in input X while preserving as much mutual information with target Y as possible; brains and neural networks both do this — building internal representations with maximal mutual information with "what's useful for the future."

Practice: before collecting some data, asking some question, or watching some dashboard, ask "how high is its mutual information with what I must decide?" High mutual information = looking at it can change your judgment; zero = it's noise, however precise. Most of the "data" people collect has near-zero mutual information with the real decision.

Mutual information = the overlap of two variables' uncertainty; it captures any (incl. nonlinear) dependence

Classic example

The value of a medical test lies not in how precise it is but in the mutual information between its result and the true condition. A test that comes back positive for everyone has zero mutual information with the disease — however "accurate," it's diagnostically worthless. This is also why rare-disease screening must beware false positives: when the base rate is tiny, one positive carries far less mutual information than intuition suggests.

BigCat scenario

(1) Choosing metrics: fixating on a proxy with low mutual information with the real outcome (e.g. "lines of code" vs "software value") drifts further off the more you optimize it — the information-theoretic root of Goodhart's Law. (2) Attention allocation: in an age of information overload, the true scarcity isn't information but "information with high mutual information to your decision"; filtering is essentially ranking information by mutual information. (3) Asking questions: a good question is the one whose answer has maximal mutual information with what you truly need to know; most ineffective communication is spent asking questions with near-zero mutual information. It's not more information that's better, but higher mutual information.

AI Prompt

English Prompt

I'm using [a metric / data source / question] to support [a decision]. Audit it via mutual information: 1. Is this signal's mutual information with the outcome I truly care about high, medium, or near zero? 2. Is there a "looks linearly correlated but barely dependent" mismatch, or the reverse? 3. If it's low, tell me whether I've fallen into a Goodhart trap, and propose a substitute signal with higher mutual information with the goal.