AI/ML Deep Dive: Probability & Information Theory

Day 33 · 2026-06-19 · Difficulty: Math-heavy
For: engineers with coding experience but not from an AI background

Almost every ML loss function, generative-model objective, and alignment derivation rests on the same information-theoretic language underneath. Today nails down four atomic concepts: Entropy (quantifies uncertainty), KL Divergence (quantifies how far apart two distributions are), Mutual Information (quantifies how much two variables share), and ELBO (makes intractable Bayesian inference computable). Grasp these and cross-entropy loss, VAEs, and contrastive learning stop being black boxes.

EntropyEntropy

Info MeasureCompression Limit
One-line analogy

Entropy is a dataset's theoretical optimal-compression lower bound (in bits). The gzip and Huffman coding you know — how small can it possibly squeeze? — is governed by entropy. A file of all AAAA... has entropy near 0 (gzip crushes it to almost nothing); a truly random byte stream has maximal entropy (incompressible). Entropy = the "true information content" of the data, independent of which codec you use.

Problem it solves + mechanism

Shannon, in 1948, faced an engineering question: over a communication channel, what is the minimum number of bits needed to transmit a message losslessly? That requires quantifying "amount of information." The core intuition is surprise (self-information):

information of one event = −log p(x)

The rarer the event (smaller p), the more information it carries — "the sun rises tomorrow" has p≈1, information≈0, nobody shares it; "a meteor lands tomorrow" has tiny p, enormous information, front-page news. We use log because information must be additive: two independent events together should carry summed information, while probabilities multiply — log turns multiplication into addition. With base 2, the unit is bits.

Entropy is the expectation of surprise (probability-weighted average) — "average information per event":

H(X) = −Σ p(x) log p(x) = E[ −log p(x) ]

The more uniform the distribution (every outcome possible), the higher the entropy; the sharper it is (one outcome near-certain), the lower. A fair coin H=1 bit; a coin loaded to 99% heads H≈0.08 bit.

Uniform (high entropy, hard to compress)    → H large

Sharp (low entropy, compressible)    → H small
Code
import numpy as np

def entropy(p, base=2):
    p = np.asarray(p, dtype=float)
    p = p[p > 0]                       # log(0) undefined, drop zero-prob terms
    return -np.sum(p * np.log(p) / np.log(base))

print(entropy([0.5, 0.5]))            # 1.0  fair coin = 1 bit
print(entropy([0.99, 0.01]))          # 0.08 near-certain, very low info
print(entropy([0.25, 0.25, 0.25, 0.25])) # 2.0  4 equal outcomes = 2 bit
# Sanity check: 4 equiprobable outcomes need 2 bits (00/01/10/11), entropy = 2
Pitfall + use case
Pitfall: "higher entropy = more chaotic = worse." In information theory entropy is a neutral measure, carrying no value judgment. High entropy means unpredictable, information-dense — a good password, a redundancy-free compressed blob both have high entropy, which is a virtue. Equating entropy with "bad chaos" misleads your intuition about output diversity and sampling temperature (high temperature = raising the entropy of the output distribution).
📌 Your scenario: assess dedup quality of your "AI super-individual" knowledge base. Compute the entropy of each note's word-frequency distribution; abnormally low-entropy docs are often template copies or auto-generated filler (low info content). Only high-entropy docs with low mutual information to existing content are genuine increments. Entropy as an information-density filter beats word-count filtering by far.
Takeaway + question
💡 Entropy = uncertainty = optimal-compression bound = average information — three names for one thing. It's the measuring unit for the next three concepts.
🤔 An LLM's generation "temperature" directly tunes the entropy of the output distribution. At temperature 0, output entropy≈0 (always picks the highest-probability token). What does "certainty = zero information" mean philosophically?

KL DivergenceKL Divergence

Distribution GapCross-EntropyAsymmetric
One-line analogy

KL divergence = the cost of using the wrong histogram. A database query optimizer relies on a column's statistical distribution (histogram) to estimate cardinality and pick a plan. If the true distribution is P but the optimizer holds a stale, biased Q to decide, the wasted rows scanned and extra I/O is KL(P‖Q). The more off the estimate, the higher the cost; a perfect estimate costs 0.

Problem it solves + mechanism

Goal: quantify how far the "true distribution P" is from "my approximation Q." Information theory's answer is elegant — the average extra bits per symbol you pay when you compress data actually drawn from P using a code designed for Q:

KL(P‖Q) = Σ p(x) log( p(x) / q(x) )

Decomposed, it equals exactly cross-entropy minus the true entropy:

cross-entropy H(P,Q) = H(P) + KL(P‖Q)
cross-entropy H(P,Q)=true entropy H(P)+KL(P‖Q) ≥ 0
"total cost of coding P with Q" = "P's irreducible compression floor" + "penalty for using the wrong distribution"

This is the bedrock logic of all deep-learning training: classification minimizes cross-entropy loss, and since H(P) is a constant fixed by the data, minimizing cross-entropy = minimizing KL(P‖Q) = pushing the model's predicted distribution Q toward the true label distribution P. The cross-entropy loss you use daily is, at heart, compressing KL away.

Three key properties: (1) KL ≥ 0, zero iff P=Q; (2) asymmetric: KL(P‖Q) ≠ KL(Q‖P) — it is not a distance! (3) the direction has meaning: forward KL (minimizing KL(P‖Q)) tends to "cover all modes of P" (mean-seeking, blurs them together), while reverse KL (minimizing KL(Q‖P)) tends to "lock onto one mode" (mode-seeking, drops other peaks). VAEs and variational inference use reverse KL, which is why VAE-generated images skew "conservative, blurry."

Code
import numpy as np

def kl(p, q):
    p, q = np.asarray(p,float), np.asarray(q,float)
    return np.sum(np.where(p > 0, p * np.log(p / q), 0))  # contributes 0 where p=0

p = [0.5, 0.5]
q = [0.9, 0.1]
print(kl(p, q))   # 0.51  coding P with biased Q costs 0.51 nat extra
print(kl(q, p))   # 0.37  reverse ≠ forward, proving asymmetry
# Engineering meaning: the "true label P first, model prediction Q second"
# direction must not be flipped — flip it and the objective changes
# (mean-seeking vs mode-seeking)
Pitfall + use case
Pitfall: "KL divergence is a distance between two distributions." It is not a distance in the mathematical sense — neither symmetric (KL(P‖Q)≠KL(Q‖P)) nor obeying the triangle inequality. Treating it as a distance plants bugs in your derivations. For a true symmetric distance, use Jensen-Shannon divergence or Wasserstein distance.
📌 Your scenario: monitor production data drift. Store the feature distribution at deploy time as P, compute the live daily distribution Q, and use KL(P‖Q) as the drift alarm — a sudden spike means the input distribution has drifted from training assumptions and the model may degrade. This is a direct transfer of your distributed-systems monitoring instinct: KL is "diff at the distribution level."
Takeaway + question
💡 KL divergence = cross-entropy − true entropy = the "coding penalty" for using the wrong distribution. Minimizing cross-entropy loss is, at heart, minimizing KL.
🤔 The RLHF / DPO objective contains a "KL penalty" pinning the aligned model so it doesn't stray too far from the original. Why KL rather than some other divergence? What happens if you drop that KL constraint (hint: reward hacking)?

Mutual InformationMutual Information

Variable DependenceNonlinear Correlation
One-line analogy

Mutual information = the functional-dependency strength between two tables. In a database, if knowing zip_code almost determines city, the two columns are highly redundant — high mutual information. If two columns are fully independent (knowing one tells you nothing about the other), mutual information is 0. Mutual information quantifies "how much knowing X eliminates of the uncertainty about Y."

Problem it solves + mechanism

Pain point: the Pearson correlation coefficient captures only linear relationships. A strong dependence like Y = X² can have a correlation of 0 and fool you into "unrelated." Mutual information captures any form of statistical dependence. It has two equivalent views:

I(X;Y) = H(X) − H(X|Y) = H(Y) − H(Y|X)

First view: X's original uncertainty, minus X's remaining uncertainty "once Y is known" — the difference is exactly the uncertainty Y removed for you. Fully independent: knowing Y is useless, difference 0. Fully determined: knowing Y locks X, difference = H(X).

I(X;Y) = KL( p(x,y) ‖ p(x)p(y) )

The second view is deeper: mutual information is the KL divergence of the "true joint distribution" from "pretending the two are independent." If truly independent, p(x,y)=p(x)p(y), KL=0. The further the deviation, the stronger the dependence. In a phrase: mutual information = the degree to which independence is violated.

Classic information diagram (think of entropy as set area):

 H(X|Y)  I(X;Y)  H(Y|X) 
└──── H(X) ────┘     
           └──── H(Y) ────┘
overlap of the two circles = mutual information (shared info); each exclusive part = conditional entropy (private info)
Code
import numpy as np
from sklearn.feature_selection import mutual_info_regression
from scipy.stats import pearsonr

np.random.seed(0)
x = np.random.uniform(-3, 3, 2000)
y = x**2 + np.random.normal(0, 0.1, 2000)   # strong nonlinear dependence

print(pearsonr(x, y)[0])                      # ≈ 0.01  correlation lies: "unrelated"
print(mutual_info_regression(x.reshape(-1,1), y)[0])  # ≈ 1.5  MI catches strong dependence
# Conclusion: zero correlation ≠ independence; MI is the reliable detector of true independence
Pitfall + use case
Pitfall: "correlation 0 = the two variables are independent." Badly wrong. Zero correlation only means no linear relationship; the two may still have strong nonlinear dependence (like Y=X² above). True independence requires mutual information to be 0 — a condition far stronger than "uncorrelated." In feature engineering, looking only at correlation misses a wealth of nonlinear signal.
📌 Your scenario: the contrastive-learning objective InfoNCE (cf. Day 30) is essentially maximizing a lower bound on the mutual information between "two views of the same sample" — teaching the model "which information is the shared, essential feature across views." Grasp this and you'll see why CLIP and SimCLR work: they all maximize mutual information.
Takeaway + question
💡 Mutual information = how much knowing one variable removes the other's uncertainty = the KL divergence of the joint from independence. It captures any dependence and is the gold standard for "true independence."
🤔 Mutual information is symmetric: I(X;Y)=I(Y;X) — "what X tells you about Y" exactly equals "what Y tells you about X." Is that symmetry an asset or a flaw in causal inference (hint: correlation ≠ causation; can MI distinguish causal direction)?

Evidence Lower BoundELBO · Evidence Lower Bound

Variational InferenceVAE ObjectiveApproximation
One-line analogy

ELBO = using an approximate query instead of an exact full-table aggregate. You want to compute an exact value, but it requires summing/integrating over a vast number of hidden states — computationally infeasible (like running an exact COUNT(DISTINCT) over a petabyte-scale table; a full scan blows up). What do you do in practice? Use a computable approximation like HyperLogLog or sampling, with a guarantee it won't overestimate the true value. ELBO is exactly that "guaranteed-not-to-overestimate lower-bound estimate" in Bayesian inference — and you can keep pushing the bound up toward the true value.

Problem it solves + mechanism

The core obstacle in Bayesian inference: to compute the evidence p(x), the marginal likelihood of data x, you must integrate over all latent variables z:

p(x) = ∫ p(x,z) dz   ← usually intractable (z's dimensionality explodes)

It won't compute. So introduce a computable approximate posterior q(z) and make a neat identity decomposition:

log p(x) = ELBO(q) + KL( q(z) ‖ p(z|x) )
log p(x) target (const)=ELBO we can compute+KL ≥ 0 approx error
Since KL ≥ 0, ELBO ≤ log p(x): ELBO is always a lower bound on the evidence.
Maximizing ELBO ⇒ simultaneously raises the bound + shrinks the KL (pushing q toward the true posterior p(z|x)) — two birds, one stone

ELBO itself splits into two intuitive terms:

ELBO = Eq[ log p(x|z) ] − KL( q(z) ‖ p(z) )

First term = reconstruction: how well decoding x back from the latent z restores the data (higher when more faithful). Second term = regularizer: keep the approximate posterior q(z) from straying too far from the prior p(z) (usually a standard normal). This is exactly the loss function of the VAE (Variational Autoencoder) — Kingma & Welling 2013's Auto-Encoding Variational Bayes: the encoder produces q(z|x), the decoder computes p(x|z), and training maximizes the ELBO. Diffusion models and variational inference all build on this framework.

Code
import torch
import torch.nn.functional as F

def vae_loss(x, x_recon, mu, logvar):
    # Term 1: reconstruction (negated because maximizing ELBO = minimizing -ELBO)
    recon = F.binary_cross_entropy(x_recon, x, reduction='sum')
    # Term 2: KL( q(z|x)=N(mu,σ²) ‖ p(z)=N(0,1) ), closed form for Gaussians
    kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    return recon + kl          # = -ELBO; minimizing it = maximizing ELBO

# Intuition: recon forces the model to "reconstruct the data,"
# kl forces the latent space to "hug a standard normal, not sprawl."
# Their tug-of-war is why VAEs are both continuous and slightly blurry.
Pitfall + use case
Pitfall: "maximizing ELBO directly computes the evidence." No. ELBO and the true log p(x) are always separated by a gap = KL(q‖p(z|x)). Only when the approximate posterior q perfectly equals the true posterior does the gap reach 0. So ELBO is a lower-bound estimate, not the exact value; q's expressiveness (its family) limits how close you can get. That is the origin of the word "variational" — searching within a family of q's for the best approximation.
📌 Your scenario: transfer the ELBO mindset to decision support — when a quantity is "uncomputable exactly," find a computable proxy with a guaranteed direction (lower/upper bound), then keep optimizing it toward the true value. This is the same engineering philosophy as your distributed-systems habit of replacing "strong-consistency exact solutions" with "approximate consistency + monotone convergence": give up exactness, guarantee the direction, iterate toward the target.
Takeaway + question
💡 ELBO turns the intractable evidence into an optimizable lower bound: log p(x) = ELBO + KL; maximizing ELBO simultaneously approximates the true posterior. It = reconstruction − KL regularizer, the shared foundation of VAEs / variational inference / diffusion.
🤔 Inside ELBO the reconstruction and KL terms tug against each other: raise the KL weight and the latent space gets tidier but reconstruction blurs (β-VAE tunes exactly this weight). Is this "fidelity vs regularity" trade-off structurally the same tension as "consistency vs availability" (CAP) you know well?

Further ReadingFurther Reading

Deep QuestionsDeep Questions

1. Why do almost all classification models use cross-entropy loss rather than the more intuitive "error rate" or MSE?
Three layers. Information-theoretic: minimizing cross-entropy = minimizing KL(true‖predicted), equivalent to pushing the model distribution toward the data distribution — a theoretically grounded "learn the true distribution," whereas error rate only counts right/wrong and discards "how badly wrong." Gradient: cross-entropy with softmax gives a clean (prediction − label) gradient, never vanishing on extreme errors the way MSE+sigmoid saturates (gradient trends to zero exactly when the prediction is most wrong, stalling training). Probabilistic: cross-entropy is the direct embodiment of maximum likelihood — minimizing it = maximizing the log-likelihood of the training data, which is asymptotically optimal statistically. Error rate is non-differentiable, no gradient descent; MSE assumes Gaussian noise, mismatched to discrete class structure. So cross-entropy isn't an engineering preference — it's the "correct answer" pointed to jointly from information theory, maximum likelihood, and optimization.
2. KL's asymmetry KL(P‖Q)≠KL(Q‖P) isn't a "mathematical flaw" but carries deep meaning. What behaviors do forward and reverse KL each induce in generative models?
This is the key to a generative model's "style." Forward KL, minimizing KL(P‖Q) (P true, Q model): wherever P has probability mass, Q must too, or log(p/q) with q→0 makes the cost explode — so Q is forced to "cover all modes of P," i.e. mean-seeking / mode-covering. The cost: when P is multimodal, Q blurs into the low-probability valleys between peaks, producing "neither-fish-nor-fowl" samples. Maximum-likelihood training is forward KL, one reason early VAE outputs were blurry. Reverse KL, minimizing KL(Q‖P): it suffices for Q to place mass where P is high, safely ignoring P's other peaks — yielding mode-seeking / mode-collapse: lock onto one peak, sharp but low-diversity. Variational inference and some GAN training lean reverse KL, hence mode collapse. In a phrase: forward KL "would rather blur than miss," reverse KL "would rather miss than blur." Choosing the direction is declaring a stance on "diversity vs fidelity."
3. Entropy, KL, mutual information, ELBO look like four concepts — can we string them into a tree with "KL as the sole atom"?
Yes, and the tree is elegant. KL is the base atom. (1) Entropy is a degenerate special case: cross-entropy H(P,Q)=H(P)+KL(P‖Q), and with Q uniform, KL(P‖uniform) = log N − H(P), so entropy = max-entropy minus "how far P is from uniform" as KL. (2) Mutual information is directly KL: I(X;Y)=KL(joint p(x,y) ‖ independent p(x)p(y)) — measuring "how far the joint deviates from independence." (3) ELBO is born from KL: log p(x)=ELBO+KL(q‖true posterior), so ELBO is "evidence minus the approximation's KL gap." So all four are essentially "using KL to measure the gap between some two distributions" applied to different objects: entropy = KL to uniform (negated/adjusted), mutual information = joint vs independent KL, ELBO = approximate vs true-posterior KL. Master the single intuition "KL measures distribution gaps" and all four connect — which is why information theory can be the unifying language of deep learning.
4. Is information theory's "entropy" the same thing as physics' (thermodynamics / statistical mechanics) "entropy," or just a name coincidence? What does this suggest about intelligence / life?
Not a coincidence — the same mathematical object in different contexts. Boltzmann's statistical-mechanics entropy S=k·ln W (W = number of microstates) and Shannon entropy H=−Σp log p coincide exactly under the uniform-distribution assumption — Shannon adopted the name "entropy" on von Neumann's suggestion. The deep link: both measure "beneath the macro-observable, how many micro arrangements are possible" = uncertainty = missing information. This gives a fascinating view of intelligence/life: life is a "local resistance to entropy increase" — a dissipative structure; Schrödinger's What Is Life? says life "feeds on negative entropy." And what an intelligent agent (including an LLM) does is essentially reduce uncertainty about the environment through observation (lowering entropy / acquiring mutual information) and act on it. Friston's free-energy principle goes further: the brain minimizes "the gap between prediction and reality" — a minimizer of an ELBO / variational free energy. For anyone pursuing "consciousness × complexity science × distributed systems," this thread weaves information theory, thermodynamics, neuroscience, and Buddhist "impermanence / uncertainty" into one web: intelligence may just be the universe locally turning uncertainty into structure.
5. ELBO's "give up the exact solution, optimize a guaranteed lower bound" approach — what isomorphic engineering patterns exist in your distributed-systems experience?
This is a cross-domain "approximation + monotone guarantee + iterative convergence" meta-pattern, isomorphic all over distributed systems. (1) Eventual consistency: give up the exact, instantaneous strong-consistency solution, guarantee the "eventually converges" direction, iterate (gossip / anti-entropy protocols — the name literally contains entropy) toward a consistent state. (2) Approximate aggregation: HyperLogLog / Count-Min Sketch trade exact dedup/count for a bounded-error approximation, just as ELBO trades exact evidence for a computable lower bound. (3) Bounded relaxation: many NP-hard scheduling problems use LP relaxation for a lower bound, then tighten it — structurally identical to "maximizing ELBO to raise the bound." (4) Gradient descent itself: give up the analytic optimum, iterate via computable local gradients. The shared philosophy: when the exact solution is intractable, step one isn't to give up but to construct a "direction-guaranteed (monotone / bounded) + computable" proxy, then pour all engineering effort into iteratively optimizing it. ELBO writes this philosophy as math; distributed systems write it as protocols. Recognize the isomorphism and you'll have a ready engineering anchor for every new ML objective you meet.