Meta Knowledge: Information Theory

May 25, 2026 · Meta Knowledge

DAY 10

Information Theory Communication Engineering Complexity Theory ML Foundations

Shannon Entropy

Information = uncertainty removed

Why surprise is the real unit of information

CORE INSIGHT

Information isn't "content" — it's uncertainty removed. "The sun rises tomorrow" carries almost 0 bits; "an earthquake hits somewhere tomorrow" carries an enormous amount. The more predictable an event, the less information it holds; the more surprising, the more worth encoding and sharing.

BACKGROUND & MECHANISM

In 1948 Shannon proposed a formula for how "surprising" a message is on average. When all outcomes are equally likely and hardest to guess, information (entropy) is at its maximum; when the outcome is certain, entropy is 0. This is the very same "entropy" from physics — a measure of disorder.

COUNTER-INTUITIVE EXAMPLE

English has 26 letters, so at first glance each needs about 4.7 bits. But because letters and words follow heavy regularities (a "q" is almost always followed by "u"), the real information content is only about 1 bit per letter — meaning English text can be compressed to under a quarter of its size. This is exactly why large language models can "autocomplete" so accurately: natural language is fundamentally highly compressible.

CROSS-DOMAIN TRANSFER

In physics it's the disorder of a system; in machine learning it's the "cross-entropy loss" you minimize during training — making the model's predictions match reality; in ecology it measures species diversity; in cognitive science it maps to "surprise" — one theory holds that the brain is a machine that constantly works to reduce surprise.

BIGCAT APPLICATION + REFLECTION

The way to judge a prompt, a meeting, or a status update changes: not "how many words," but how much uncertainty it removed. "The project is going well" carries almost 0 bits; "blocked on subtask X, missing resource Y, expect a 3-day slip" is genuinely high-information intel. Same in parenting: "How was today?" almost always returns a low-information "fine" — better to ask something specific and unpredictable.

▸ Reflection: of the 5 longest messages you sent or received this week, how much uncertainty did each really remove? Which was the shortest yet most informative?

Mutual Information

A more universal measure of dependence than correlation

Statistical dependence beyond straight lines

CORE INSIGHT

A correlation coefficient only tells you whether two things rise and fall together. Mutual information measures "how much knowing X tells me about Y" — whether the relationship is a straight line, a curve, or seemingly chaotic. It's a far stronger tool for spotting hidden dependencies.

BACKGROUND & MECHANISM

Mutual information answers an intuitive question: once I know Y, how much does X's uncertainty shrink? It's 0 when the two are completely independent, and maximal when one fully determines the other. It doesn't care about the shape of the distribution or demand a straight-line relationship, so it's far more universal than the everyday correlation coefficient.

COUNTER-INTUITIVE EXAMPLE

Let X be uniform between −1 and 1, and let Y = X². Their correlation coefficient is exactly 0 — by correlation they look "unrelated." Yet Y is completely determined by X, so their mutual information is high. Any finance or risk model that relies only on correlation misses this — and in the 2008 crisis many risk models failed on exactly this point: "zero correlation" does not mean "independent."

CROSS-DOMAIN TRANSFER

In neuroscience it measures how much information actually flows between two brain regions; in AI representation learning, contrastive learning (e.g. CLIP aligning images and text) aims to maximize the mutual information between different views; in genetics it identifies regulatory relationships between genes; in cryptography, "perfect secrecy" is defined as zero mutual information between ciphertext and plaintext.

BIGCAT APPLICATION + REFLECTION

When diagnosing a team or project, ask "which early signal has the highest mutual information with the final outcome," not "which metric has the highest correlation." You'll often find the surface KPI (like weekly-report word count) has almost no predictive power, while a neglected weak signal (the length of comments in code review, the duration of silence in standups) is what truly forecasts delivery quality. When picking features for an AI system, mutual information often surfaces unexpectedly strong predictors.

▸ Reflection: is there a "neglected weak signal" in your work whose predictive power is actually far higher than the core KPI your team keeps staring at?

Channel Capacity

Noise is a resource you can trade away with coding

Why slow-but-accurate always wins

CORE INSIGHT

Shannon proved something counter-intuitive: over any noisy channel, as long as your transmission rate stays below a certain ceiling (the channel capacity), there must exist an encoding that drives the error rate toward 0. Noise isn't an unbeatable fate — it's a resource you can trade away, in proportion, with redundant coding. The entire digital civilization rests on this claim.

BACKGROUND & MECHANISM

The capacity formula tells us: double the bandwidth and you double the capacity; but double the signal-to-noise ratio and capacity barely rises — which is why 5G fights so hard for wider spectrum rather than just cranking up power. Shannon only proved "such a good code must exist," not how to build one; engineers then spent about 40 years actually constructing codes that approach this theoretical limit.

▸ How a message crosses a noisy channel

Source

→

Encoder

→

Channel
(with noise)

→

Decoder

→

Receiver

Below capacity, an encoding exists that makes errors arbitrarily small; once you exceed capacity, the error rate inevitably surges

COUNTER-INTUITIVE EXAMPLE

A deep-space probe billions of kilometers away sends back a signal weaker than the cosmic background noise — by intuition, unreadable. Yet with carefully designed error-correcting codes plus an extremely slow transmission rate, it still returns its data intact. That's the promise of the capacity theorem: no matter how strong the noise, slow the rate down and you can buy back accuracy — slow, but error-free.

CROSS-DOMAIN TRANSFER

In cognitive science it's the bandwidth of attention — the eye takes in far more per second than consciousness can process, so "seeing" isn't "taking in"; in organizational communication, information gets re-narrated and degraded at every layer it passes through; in AI, a large model's context window is effectively its channel capacity; in biology, a synapse passing a signal is itself a piece of elegant coding under heavy noise.

BIGCAT APPLICATION + REFLECTION

Remote collaboration is fundamentally channel-capacity management. Text is narrow-bandwidth but high signal-to-noise, ideal for asynchronous, deep questions; video is wide-bandwidth but noisy, ideal for building trust and resolving disagreements; emotional topics often need the full bandwidth of being face to face. Mismatching the medium (layoffs over text, routine updates over video) is capacity waste at the organizational level. Same in parenting: a sticky note, a hug, and a long talk are three channels with completely different capacities.

▸ Reflection: in your key conversations this week, did a "small matter use a high-bandwidth medium" or a "big matter use a low-bandwidth one"? Would switching the medium have been better?

Kolmogorov Complexity

True information = length of the shortest program

"Looks random" isn't "is random"

CORE INSIGHT

A thing's true information content equals the length of the shortest program that can generate it. "Looks random" isn't the same as "is random": the first trillion digits of π look like gibberish, yet a few-hundred-character program can compute them — so its complexity is actually tiny.

BACKGROUND & MECHANISM

In the 1960s, several mathematicians (Kolmogorov being the figurehead) independently proposed: an object's complexity is the length of the shortest program that generates it. Strikingly, that length can't actually be computed (it's as unsolvable as the "halting problem"), yet it gives Occam's razor — simpler explanations are better — a rigorous mathematical form. The ultimate goal of all learning is to find the shortest explanation for the data.

COUNTER-INTUITIVE EXAMPLE

There's a theorem that says: in any mathematical system, there exist many "random" true statements whose proofs are longer than the system itself — so you can never decide whether they're true or false. In other words, something truly random can't be compressed by any theory. And science is only possible because the universe is far more compressible than it "looks" — a handful of simple laws explain a vast number of phenomena.

CROSS-DOMAIN TRANSFER

In machine learning, the "best model" is the one whose "model + the error it fails to explain" has the shortest total encoding; in physics, E=mc² is great precisely because it's tiny yet enormously explanatory; in biology, the many repeated segments in a genome show it's highly compressible; in design, "less is more" is the same idea — the more concise, the purer the signal.

BIGCAT APPLICATION + REFLECTION

To judge the "real depth" of an insight, plan, or proposal, ask: how short can it be compressed and still make sense? Genuinely deep insights usually fit in one sentence ("markets fail when externalities exist"); the ones that need 2,000 words are often noise disguised as ideas. This standard is especially brutal for senior engineers: a complex architecture diagram isn't high value — a design you can capture in one diagram or one line is the real skill. Apply it to your own writing, demos, and reports: after cutting everything you can, does the remaining core still hold up?

▸ Reflection: compress your most important recent work into one sentence. Strip away all the jargon and build-up — is the "core" still powerful? Or was the build-up the whole thing?