Almost every ML loss function, generative-model objective, and alignment derivation rests on the same information-theoretic language underneath. Today nails down four atomic concepts: Entropy (quantifies uncertainty), KL Divergence (quantifies how far apart two distributions are), Mutual Information (quantifies how much two variables share), and ELBO (makes intractable Bayesian inference computable). Grasp these and cross-entropy loss, VAEs, and contrastive learning stop being black boxes.
Entropy is a dataset's theoretical optimal-compression lower bound (in bits). The gzip and Huffman coding you know — how small can it possibly squeeze? — is governed by entropy. A file of all AAAA... has entropy near 0 (gzip crushes it to almost nothing); a truly random byte stream has maximal entropy (incompressible). Entropy = the "true information content" of the data, independent of which codec you use.
Shannon, in 1948, faced an engineering question: over a communication channel, what is the minimum number of bits needed to transmit a message losslessly? That requires quantifying "amount of information." The core intuition is surprise (self-information):
The rarer the event (smaller p), the more information it carries — "the sun rises tomorrow" has p≈1, information≈0, nobody shares it; "a meteor lands tomorrow" has tiny p, enormous information, front-page news. We use log because information must be additive: two independent events together should carry summed information, while probabilities multiply — log turns multiplication into addition. With base 2, the unit is bits.
Entropy is the expectation of surprise (probability-weighted average) — "average information per event":
The more uniform the distribution (every outcome possible), the higher the entropy; the sharper it is (one outcome near-certain), the lower. A fair coin H=1 bit; a coin loaded to 99% heads H≈0.08 bit.
import numpy as np def entropy(p, base=2): p = np.asarray(p, dtype=float) p = p[p > 0] # log(0) undefined, drop zero-prob terms return -np.sum(p * np.log(p) / np.log(base)) print(entropy([0.5, 0.5])) # 1.0 fair coin = 1 bit print(entropy([0.99, 0.01])) # 0.08 near-certain, very low info print(entropy([0.25, 0.25, 0.25, 0.25])) # 2.0 4 equal outcomes = 2 bit # Sanity check: 4 equiprobable outcomes need 2 bits (00/01/10/11), entropy = 2
KL divergence = the cost of using the wrong histogram. A database query optimizer relies on a column's statistical distribution (histogram) to estimate cardinality and pick a plan. If the true distribution is P but the optimizer holds a stale, biased Q to decide, the wasted rows scanned and extra I/O is KL(P‖Q). The more off the estimate, the higher the cost; a perfect estimate costs 0.
Goal: quantify how far the "true distribution P" is from "my approximation Q." Information theory's answer is elegant — the average extra bits per symbol you pay when you compress data actually drawn from P using a code designed for Q:
Decomposed, it equals exactly cross-entropy minus the true entropy:
This is the bedrock logic of all deep-learning training: classification minimizes cross-entropy loss, and since H(P) is a constant fixed by the data, minimizing cross-entropy = minimizing KL(P‖Q) = pushing the model's predicted distribution Q toward the true label distribution P. The cross-entropy loss you use daily is, at heart, compressing KL away.
Three key properties: (1) KL ≥ 0, zero iff P=Q; (2) asymmetric: KL(P‖Q) ≠ KL(Q‖P) — it is not a distance! (3) the direction has meaning: forward KL (minimizing KL(P‖Q)) tends to "cover all modes of P" (mean-seeking, blurs them together), while reverse KL (minimizing KL(Q‖P)) tends to "lock onto one mode" (mode-seeking, drops other peaks). VAEs and variational inference use reverse KL, which is why VAE-generated images skew "conservative, blurry."
import numpy as np def kl(p, q): p, q = np.asarray(p,float), np.asarray(q,float) return np.sum(np.where(p > 0, p * np.log(p / q), 0)) # contributes 0 where p=0 p = [0.5, 0.5] q = [0.9, 0.1] print(kl(p, q)) # 0.51 coding P with biased Q costs 0.51 nat extra print(kl(q, p)) # 0.37 reverse ≠ forward, proving asymmetry # Engineering meaning: the "true label P first, model prediction Q second" # direction must not be flipped — flip it and the objective changes # (mean-seeking vs mode-seeking)
Mutual information = the functional-dependency strength between two tables. In a database, if knowing zip_code almost determines city, the two columns are highly redundant — high mutual information. If two columns are fully independent (knowing one tells you nothing about the other), mutual information is 0. Mutual information quantifies "how much knowing X eliminates of the uncertainty about Y."
Pain point: the Pearson correlation coefficient captures only linear relationships. A strong dependence like Y = X² can have a correlation of 0 and fool you into "unrelated." Mutual information captures any form of statistical dependence. It has two equivalent views:
First view: X's original uncertainty, minus X's remaining uncertainty "once Y is known" — the difference is exactly the uncertainty Y removed for you. Fully independent: knowing Y is useless, difference 0. Fully determined: knowing Y locks X, difference = H(X).
The second view is deeper: mutual information is the KL divergence of the "true joint distribution" from "pretending the two are independent." If truly independent, p(x,y)=p(x)p(y), KL=0. The further the deviation, the stronger the dependence. In a phrase: mutual information = the degree to which independence is violated.
import numpy as np from sklearn.feature_selection import mutual_info_regression from scipy.stats import pearsonr np.random.seed(0) x = np.random.uniform(-3, 3, 2000) y = x**2 + np.random.normal(0, 0.1, 2000) # strong nonlinear dependence print(pearsonr(x, y)[0]) # ≈ 0.01 correlation lies: "unrelated" print(mutual_info_regression(x.reshape(-1,1), y)[0]) # ≈ 1.5 MI catches strong dependence # Conclusion: zero correlation ≠ independence; MI is the reliable detector of true independence
ELBO = using an approximate query instead of an exact full-table aggregate. You want to compute an exact value, but it requires summing/integrating over a vast number of hidden states — computationally infeasible (like running an exact COUNT(DISTINCT) over a petabyte-scale table; a full scan blows up). What do you do in practice? Use a computable approximation like HyperLogLog or sampling, with a guarantee it won't overestimate the true value. ELBO is exactly that "guaranteed-not-to-overestimate lower-bound estimate" in Bayesian inference — and you can keep pushing the bound up toward the true value.
The core obstacle in Bayesian inference: to compute the evidence p(x), the marginal likelihood of data x, you must integrate over all latent variables z:
It won't compute. So introduce a computable approximate posterior q(z) and make a neat identity decomposition:
ELBO itself splits into two intuitive terms:
First term = reconstruction: how well decoding x back from the latent z restores the data (higher when more faithful). Second term = regularizer: keep the approximate posterior q(z) from straying too far from the prior p(z) (usually a standard normal). This is exactly the loss function of the VAE (Variational Autoencoder) — Kingma & Welling 2013's Auto-Encoding Variational Bayes: the encoder produces q(z|x), the decoder computes p(x|z), and training maximizes the ELBO. Diffusion models and variational inference all build on this framework.
import torch import torch.nn.functional as F def vae_loss(x, x_recon, mu, logvar): # Term 1: reconstruction (negated because maximizing ELBO = minimizing -ELBO) recon = F.binary_cross_entropy(x_recon, x, reduction='sum') # Term 2: KL( q(z|x)=N(mu,σ²) ‖ p(z)=N(0,1) ), closed form for Gaussians kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp()) return recon + kl # = -ELBO; minimizing it = maximizing ELBO # Intuition: recon forces the model to "reconstruct the data," # kl forces the latent space to "hug a standard normal, not sprawl." # Their tug-of-war is why VAEs are both continuous and slightly blurry.