Every model so far hands you a point estimate—"the answer is 0.87." But is 0.87 "I'm highly certain" or "I'm just guessing"? A vanilla neural net can't tell the difference. The core ambition of Bayesian methods: don't output a number, output a distribution—"around 0.87, but plausibly anywhere from 0.6 to 0.95." Today we make sense of the two computational engines behind this idea (MCMC sampling, variational inference), the capability it brings to deep learning (uncertainty quantification), and the probabilistic programming languages that package this math into usable tools.
You want to know the shape of some aggregate distribution over a massive database, but you can't do a full table scan (the integral is intractable). MCMC dispatches a random-walking crawler through the state space with one rule: linger more where probability density is high, less where it's low. After it wanders long enough, the crawler's "visit-frequency histogram" approximates the distribution you wanted. It's the same idea as the Monte Carlo sampling you use to estimate p99—replace the full population with samples.
Bayesian inference revolves around Bayes' rule: posterior ∝ likelihood × prior, written as p(θ|D) = p(D|θ)·p(θ) / p(D). The numerator is easy (your model gives it directly). The killer is the denominator p(D)—an integral over all possible parameters θ, ∫ p(D|θ)p(θ)dθ, simply uncomputable in high dimensions (the "normalizing constant" problem).
MCMC's trick: it never needs that denominator—each step only looks at "the density ratio between the new and old position," and the denominator cancels in the ratio. The classic Metropolis-Hastings algorithm is just four steps:
Naive jittering is hopelessly inefficient in high dimensions (like a drunk staggering around). Modern samplers use Hamiltonian Monte Carlo (HMC)—borrowing physics' "a ball rolling on an energy surface" to make proposals follow the gradient, stepping far while rarely getting rejected. But HMC requires hand-tuning "how many steps to roll"—tune it wrong and it breaks. NUTS (No-U-Turn Sampler, Hoffman & Gelman 2011) lets the algorithm decide automatically when to stop (stop once the ball starts doubling back), and that's the origin of the default sampler in PyMC / Stan today.
import numpy as np # Sample from an unnormalized target (bimodal); shows MCMC needs no denominator def target(x): # only the "shape" of the density, no integral constant return np.exp(-(x-2)**2/0.5) + np.exp(-(x+2)**2/0.5) samples, x = [], 0.0 for _ in range(50000): x_new = x + np.random.normal(0, 1) # ② propose nearby a = min(1, target(x_new) / target(x)) # ③ ratio, constant cancels if np.random.rand() < a: # ④ roll the die to accept or not x = x_new samples.append(x) # Drop the early un-converged "burn-in"; the rest approximates the target print(np.mean(samples[5000:]), np.std(samples[5000:]))
MCMC is slow and hard to scale—a million samples is a disaster on large datasets. Variational inference (VI) flips the approach: instead of sampling that complex posterior exactly, find a simple distribution (e.g. a Gaussian) that "fits" it. It's like using a precomputed materialized view / cache to approximate an expensive real-time aggregate—give up a little accuracy for a huge speedup. VI turns an "inference problem" into the optimization problem you know best: tune parameters so the approximate distribution hugs the true posterior.
Let the true posterior be p(θ|D) (uncomputable). We introduce a simple distribution qφ(θ) with parameters φ (say, a Gaussian with unknown mean and variance). Goal: tune φ so the "distance" between q and p is minimized. Distance is measured by KL divergence (Day 33: the information gap between two distributions).
But KL(q‖p) again hides that uncomputable posterior. With one algebraic rearrangement, minimizing the KL is equivalent to maximizing a quantity called the ELBO (Evidence Lower BOund):
Intuition: term 1 forces q to explain the observed data well (high likelihood); term 2 is a spring pulling back toward the prior, preventing overfitting. The two tug against each other—this is exactly the Bayesian version of "fit vs regularize", the same tension you saw on Day 10. It's a "lower bound" because the ELBO is always ≤ the true evidence log p(D), so pushing it up pushes q toward p.
Crucially, the ELBO is differentiable, so you can run it straight through Adam. Rewrite the Gaussian's "sample" as "mean + std × standard noise" and gradients can flow back through the random sampling—this reparameterization trick is the heart of Kingma & Welling's VAE paper (2013), and the same building block behind VAEs and diffusion models. The automated version in PyMC / NumPyro is called ADVI.
import pymc as pm import numpy as np y = np.random.normal(5, 2, size=200) # fake data: true mean 5 with pm.Model() as model: mu = pm.Normal("mu", 0, 10) # prior: mean unknown sigma = pm.HalfNormal("sigma", 5) pm.Normal("obs", mu, sigma, observed=y) # likelihood # No MCMC sampling; fit with variational inference — orders faster on big data approx = pm.fit(20000, method="advi") # maximize the ELBO idata = approx.sample(1000) # draw from the fitted q print(idata.posterior["mu"].mean().item()) # ≈ 5, with a posterior interval
A vanilla neural net is forever brimming with confidence—feed it a garbage image it's never seen and it'll flatly declare "that's a cat, 99%." That's like a service that never returns an error code, always 200 OK, even when the backend crashed long ago. Uncertainty quantification (UQ) installs an honest "I don't know" signal—distinguishing "I've seen data like this, I'm confident" from "this is outside what I know, don't trust me."
The key is distinguishing two fundamentally different kinds of uncertainty, the bedrock of the whole field:
How do you get an epistemic signal out of a neural net? Core idea: don't trust one model, look at the disagreement among "a crowd of models". Where training data is dense everyone agrees (low uncertainty); in unseen regions models guess wildly and clash (high uncertainty). The two most practical implementations:
import torch # MC Dropout: keep dropout on at inference, run many times, read the disagreement def predict_with_uncertainty(model, x, n=30): model.train() # key: train mode keeps dropout active preds = torch.stack([model(x) for _ in range(n)]) # N random forward passes mean = preds.mean(0) # prediction std = preds.std(0) # uncertainty: high spread = model unsure return mean, std mean, std = predict_with_uncertainty(my_net, x_test) # high-std samples → model is "guessing", route to human review or abstain mask = std.squeeze() > 0.3 print(f"samples needing human review: {mask.sum().item()}")
The math behind the previous three concepts is hard. A probabilistic programming language (PPL) packages it into a declarative language: you only "declare" the probabilistic model (what the prior is, how the data is generated), and the inference engine automatically runs MCMC or VI for you. This is exactly SQL's relationship to a database—you write "what you want," and the query planner decides "how to compute it." PyMC / Stan are the SQL of the probabilistic world, freeing you from hand-writing samplers.
Before PPLs, every new model meant rewriting Metropolis acceptance rates, gradients, and convergence diagnostics—horribly error-prone. The core mechanism of a PPL is to decouple model definition from inference algorithm:
Main choices: Stan—the statistics community's gold standard, most rigorous, a standalone DSL; PyMC—pure Python, fastest to pick up; NumPyro—built on JAX, GPU-accelerated, fastest, great for big-model Bayes. They share the same engine core: auto-diff + NUTS sampling + variational inference, differing only in syntax and speed.
import pymc as pm import numpy as np # Bayesian linear regression: note we only "declare" the model, no sampling logic x = np.linspace(0, 10, 50) y = 2.5 * x + 1.0 + np.random.normal(0, 1, 50) # true slope 2.5 with pm.Model() as model: slope = pm.Normal("slope", 0, 5) # ① prior inter = pm.Normal("inter", 0, 5) noise = pm.HalfNormal("noise", 2) # ② likelihood + ③ bind data pm.Normal("y", slope*x + inter, noise, observed=y) idata = pm.sample(1000) # ④ engine auto-runs NUTS + diagnostics # What you get is not one slope, but the slope's entire posterior distribution pm.summary(idata) # outputs mean, 94% interval, r_hat, ess