AI/ML Explained: Probabilistic Programming & Bayesian Deep Learning

Day 38 · 2026-06-24 · Difficulty ★★★★☆

For: engineers with coding experience, not from an AI background

Every model so far hands you a point estimate—"the answer is 0.87." But is 0.87 "I'm highly certain" or "I'm just guessing"? A vanilla neural net can't tell the difference. The core ambition of Bayesian methods: don't output a number, output a distribution—"around 0.87, but plausibly anywhere from 0.6 to 0.95." Today we make sense of the two computational engines behind this idea (MCMC sampling, variational inference), the capability it brings to deep learning (uncertainty quantification), and the probabilistic programming languages that package this math into usable tools.

Markov Chain Monte CarloMCMC

samplingbayesian inference

One-line analogy

You want to know the shape of some aggregate distribution over a massive database, but you can't do a full table scan (the integral is intractable). MCMC dispatches a random-walking crawler through the state space with one rule: linger more where probability density is high, less where it's low. After it wanders long enough, the crawler's "visit-frequency histogram" approximates the distribution you wanted. It's the same idea as the Monte Carlo sampling you use to estimate p99—replace the full population with samples.

What it solves + how it works

Bayesian inference revolves around Bayes' rule: posterior ∝ likelihood × prior, written as p(θ|D) = p(D|θ)·p(θ) / p(D). The numerator is easy (your model gives it directly). The killer is the denominator p(D)—an integral over all possible parameters θ, ∫ p(D|θ)p(θ)dθ, simply uncomputable in high dimensions (the "normalizing constant" problem).

MCMC's trick: it never needs that denominator—each step only looks at "the density ratio between the new and old position," and the denominator cancels in the ratio. The classic Metropolis-Hastings algorithm is just four steps:

Metropolis-Hastings one-step loop

① current point θ → ② propose new point θ' (jitter nearby)

③ acceptance a = min(1, p(θ')/p(θ)) ← denominator cancels!

④ roll a die: move to θ' with prob a, else stay at θ

repeat a million times → distribution of resting points ≈ true posterior

Naive jittering is hopelessly inefficient in high dimensions (like a drunk staggering around). Modern samplers use Hamiltonian Monte Carlo (HMC)—borrowing physics' "a ball rolling on an energy surface" to make proposals follow the gradient, stepping far while rarely getting rejected. But HMC requires hand-tuning "how many steps to roll"—tune it wrong and it breaks. NUTS (No-U-Turn Sampler, Hoffman & Gelman 2011) lets the algorithm decide automatically when to stop (stop once the ball starts doubling back), and that's the origin of the default sampler in PyMC / Stan today.

Code example

import numpy as np

# Sample from an unnormalized target (bimodal); shows MCMC needs no denominator
def target(x):           # only the "shape" of the density, no integral constant
    return np.exp(-(x-2)**2/0.5) + np.exp(-(x+2)**2/0.5)

samples, x = [], 0.0
for _ in range(50000):
    x_new = x + np.random.normal(0, 1)      # ② propose nearby
    a = min(1, target(x_new) / target(x))  # ③ ratio, constant cancels
    if np.random.rand() < a:                 # ④ roll the die to accept or not
        x = x_new
    samples.append(x)

# Drop the early un-converged "burn-in"; the rest approximates the target
print(np.mean(samples[5000:]), np.std(samples[5000:]))

Common pitfall + use case

"The sampler finished, so it must be right"—wrong. MCMC's biggest trap is thinking you converged when you didn't: a chain can get stuck in one mode, or samples can be highly autocorrelated (so the effective sample count is tiny). You must read the diagnostics—R-hat (do multiple chains agree? should be ≈ 1.0) and ESS (effective sample size; too low means redundant samples). A Bayesian result without diagnostics is no result at all.

📌 Super-individual scenario: for a personal A/B decision (which of two blurbs converts better), sample sizes are often tiny (30 clicks each). The frequentist gives one point estimate "3.2% vs 4.1%," and you can't tell if the gap is trustworthy. Run a Bayesian proportion model with MCMC and get directly "the probability B beats A is 73%"—on small samples, Bayes honestly hands the uncertainty back to you, far better for a call than a falsely precise point estimate.

Takeaway + question

💡 MCMC trades "an uncomputable integral" for "obtainable samples"—using the resting frequency of a random walk to sidestep the hardest part of Bayes' rule, the normalizing constant.
🤔 Which of your everyday "estimate-a-number-from-experience" judgments are secretly distributions? If you treated them as distributions, would your decisions change?

Variational InferenceVI

approximate inferenceoptimizationELBO

One-line analogy

MCMC is slow and hard to scale—a million samples is a disaster on large datasets. Variational inference (VI) flips the approach: instead of sampling that complex posterior exactly, find a simple distribution (e.g. a Gaussian) that "fits" it. It's like using a precomputed materialized view / cache to approximate an expensive real-time aggregate—give up a little accuracy for a huge speedup. VI turns an "inference problem" into the optimization problem you know best: tune parameters so the approximate distribution hugs the true posterior.

What it solves + how it works

Let the true posterior be p(θ|D) (uncomputable). We introduce a simple distribution q_φ(θ) with parameters φ (say, a Gaussian with unknown mean and variance). Goal: tune φ so the "distance" between q and p is minimized. Distance is measured by KL divergence (Day 33: the information gap between two distributions).

But KL(q‖p) again hides that uncomputable posterior. With one algebraic rearrangement, minimizing the KL is equivalent to maximizing a quantity called the ELBO (Evidence Lower BOund):

ELBO = E_q[ log p(D|θ) ] − KL( q_φ(θ) ‖ p(θ) )
　　　　term 1: fit the data　　　　term 2: don't stray too far from the prior (regularizer)

Intuition: term 1 forces q to explain the observed data well (high likelihood); term 2 is a spring pulling back toward the prior, preventing overfitting. The two tug against each other—this is exactly the Bayesian version of "fit vs regularize", the same tension you saw on Day 10. It's a "lower bound" because the ELBO is always ≤ the true evidence log p(D), so pushing it up pushes q toward p.

Crucially, the ELBO is differentiable, so you can run it straight through Adam. Rewrite the Gaussian's "sample" as "mean + std × standard noise" and gradients can flow back through the random sampling—this reparameterization trick is the heart of Kingma & Welling's VAE paper (2013), and the same building block behind VAEs and diffusion models. The automated version in PyMC / NumPyro is called ADVI.

Code example

import pymc as pm
import numpy as np

y = np.random.normal(5, 2, size=200)   # fake data: true mean 5

with pm.Model() as model:
    mu = pm.Normal("mu", 0, 10)        # prior: mean unknown
    sigma = pm.HalfNormal("sigma", 5)
    pm.Normal("obs", mu, sigma, observed=y) # likelihood

    # No MCMC sampling; fit with variational inference — orders faster on big data
    approx = pm.fit(20000, method="advi")   # maximize the ELBO
    idata = approx.sample(1000)            # draw from the fitted q

print(idata.posterior["mu"].mean().item())  # ≈ 5, with a posterior interval

Common pitfall + use case

"VI gives the same result as MCMC, just faster"—not quite. VI has a systematic bias: using KL(q‖p) as the objective makes q underestimate uncertainty (it tends to "shrink into" a single mode of the true posterior, with too-small variance). So the speed advantage has a cost—its confidence intervals are often too narrow, too overconfident. Use MCMC when scale is small and you need rigorous conclusions; use VI when scale is large and you need it scalable (e.g. deep models). Know this trade-off, and a deceptively narrow interval won't fool you.

📌 Super-individual scenario: any model you train with stochastic layers (dropout, latent variables) is using VI's ideas under the hood. Understanding the ELBO lets you see why VAE / diffusion loss functions look the way they do (reconstruction term + KL regularizer)—they're all variants of the ELBO, and such papers stop reading like hieroglyphs.

Takeaway + question

💡 Variational inference swaps "sampling" for "optimization"—find a simple distribution to hug the complex posterior and solve Bayes with gradient descent, at the cost of underestimating uncertainty.
🤔 "Approximate a complex truth with a simple model to buy computability"—where else does this trade-off show up in your engineering experience (caches? dimensionality reduction? indexes?)

Uncertainty QuantificationUQ

bayesian deep learningreliability

One-line analogy

A vanilla neural net is forever brimming with confidence—feed it a garbage image it's never seen and it'll flatly declare "that's a cat, 99%." That's like a service that never returns an error code, always 200 OK, even when the backend crashed long ago. Uncertainty quantification (UQ) installs an honest "I don't know" signal—distinguishing "I've seen data like this, I'm confident" from "this is outside what I know, don't trust me."

What it solves + how it works

The key is distinguishing two fundamentally different kinds of uncertainty, the bedrock of the whole field:

Aleatoric uncertainty — noise inherent in the data, irreducible
　e.g. a coin flip, sensor jitter. More data won't help, still 50/50

Epistemic uncertainty — the model's ignorance, reducible
　e.g. a region the training set never covered. More data / training lowers it

Distributed-debugging analogy: aleatoric ≈ inherent network jitter (accept it);
epistemic ≈ a link I never monitored (just add monitoring)

How do you get an epistemic signal out of a neural net? Core idea: don't trust one model, look at the disagreement among "a crowd of models". Where training data is dense everyone agrees (low uncertainty); in unseen regions models guess wildly and clash (high uncertainty). The two most practical implementations:

MC Dropout (Gal & Ghahramani 2015): at inference keep dropout on, run the same input N times—each drops different neurons → N different predictions, and their variance is the uncertainty. Theoretically equivalent to a cheap form of variational inference—near-zero cost to add Bayesian power to an existing model.
Deep Ensembles (Lakshminarayanan 2016): independently train 5 models (different random seeds), look at their disagreement at prediction time. Crude, but empirically often the best-calibrated—an industry-strength baseline.

Code example

import torch

# MC Dropout: keep dropout on at inference, run many times, read the disagreement
def predict_with_uncertainty(model, x, n=30):
    model.train()              # key: train mode keeps dropout active
    preds = torch.stack([model(x) for _ in range(n)])  # N random forward passes
    mean = preds.mean(0)      # prediction
    std  = preds.std(0)       # uncertainty: high spread = model unsure
    return mean, std

mean, std = predict_with_uncertainty(my_net, x_test)
# high-std samples → model is "guessing", route to human review or abstain
mask = std.squeeze() > 0.3
print(f"samples needing human review: {mask.sum().item()}")

Common pitfall + use case

"A softmax output of 0.99 means the model is very certain"—badly wrong. Softmax probabilities are relative rankings between classes, not calibrated confidence. Modern neural nets are notoriously overconfident, handing 99% even to inputs they've never seen. True epistemic uncertainty must come from model disagreement (MC Dropout / Ensemble); you can't read it off softmax—treating softmax as confidence is the most common and most dangerous misjudgment when deploying AI.

📌 Super-individual scenario: when wiring up an AI workflow, add an "uncertainty gate" at each model node—high-uncertainty outputs auto-route to a human or trigger a second check, low-uncertainty ones flow through automation. This makes "human-AI collaboration" a quantifiable switch: full autonomy when the model is sure, your attention only when it isn't. UQ turns "how much should I trust the AI" from mysticism into a number.

Takeaway + question

💡 A vanilla model forever returns 200 OK; uncertainty quantification installs an honest "I don't know"—and insists on separating "data noise (accept it)" from "I've never seen this (fixable)".
🤔 If your AI assistant could reliably say "I'm not sure about this," how would your way of collaborating change? Is trust premised on it "knowing what it doesn't know"?

Probabilistic Programming LanguagesPyMC / Stan

toolingdeclarative

One-line analogy

The math behind the previous three concepts is hard. A probabilistic programming language (PPL) packages it into a declarative language: you only "declare" the probabilistic model (what the prior is, how the data is generated), and the inference engine automatically runs MCMC or VI for you. This is exactly SQL's relationship to a database—you write "what you want," and the query planner decides "how to compute it." PyMC / Stan are the SQL of the probabilistic world, freeing you from hand-writing samplers.

What it solves + how it works

Before PPLs, every new model meant rewriting Metropolis acceptance rates, gradients, and convergence diagnostics—horribly error-prone. The core mechanism of a PPL is to decouple model definition from inference algorithm:

Declarative vs imperative

Imperative (by hand): write your own sampling loop, acceptance, gradients, diagnostics → hundreds of lines, error-prone

Declarative (PPL):
  ① write prior + ② write likelihood + ③ bind observed data
  ↓ engine auto-diff + NUTS/ADVI
  ④ get posterior + R-hat/ESS diagnostics

you just describe "how the world generates data"; inference is the engine's job

Main choices: Stan—the statistics community's gold standard, most rigorous, a standalone DSL; PyMC—pure Python, fastest to pick up; NumPyro—built on JAX, GPU-accelerated, fastest, great for big-model Bayes. They share the same engine core: auto-diff + NUTS sampling + variational inference, differing only in syntax and speed.

Code example

import pymc as pm
import numpy as np

# Bayesian linear regression: note we only "declare" the model, no sampling logic
x = np.linspace(0, 10, 50)
y = 2.5 * x + 1.0 + np.random.normal(0, 1, 50)   # true slope 2.5

with pm.Model() as model:
    slope = pm.Normal("slope", 0, 5)     # ① prior
    inter = pm.Normal("inter", 0, 5)
    noise = pm.HalfNormal("noise", 2)
    # ② likelihood + ③ bind data
    pm.Normal("y", slope*x + inter, noise, observed=y)

    idata = pm.sample(1000)   # ④ engine auto-runs NUTS + diagnostics

# What you get is not one slope, but the slope's entire posterior distribution
pm.summary(idata)  # outputs mean, 94% interval, r_hat, ess

Common pitfall + use case

"PPLs are a statistician's toy, useless for deep learning"—outdated. NumPyro / Pyro are built precisely for Bayesian deep learning: you can set neural net weights as distributions and train with variational inference, getting a network that ships its own uncertainty. And reading a probabilistic model ≠ building the right one—pick an absurd prior or wrong structure and the engine will still spit out a pile of beautiful but wrong posteriors. A PPL lowers the computational barrier, not the modeling-judgment barrier.

📌 Super-individual scenario: use PyMC for personal decision modeling—write your prior belief about something ("this project has maybe a 60% shot") as a distribution, and update the posterior with each new piece of evidence. It's a thinking discipline for turning Bayesian updating into executable code: it forces you to spell out "what I believed, how strong the new evidence is, what I believe now"—far more reliable than gut feel.

Takeaway + question

💡 A probabilistic programming language is "the SQL of the Bayesian world"—you declare the model, the engine infers; it lowers the computational barrier, but the modeling judgment is still yours.
🤔 SQL let people who don't understand B+ trees query databases. PPLs let people who don't understand MCMC do Bayes—what's the next field this kind of "declarative abstraction" will upend?

Deep Questions

1. Frequentists give a point estimate + confidence interval; Bayesians give a posterior distribution. For an engineer making real decisions, what's the essential difference between these frameworks, and when should you use which?

The essential difference is what "probability" refers to. Frequentist: the parameter is a fixed true value, and probability describes "the fluctuation of data under repeated experiments"—the true meaning of a "95% confidence interval" is "run this procedure 100 times and about 95 of the intervals will cover the true value." It is not "the true value has a 95% chance of being in this interval" (a near-universal misreading). Bayesian: the parameter itself is a random variable, so a "94% interval" means literally that—"given the data and prior, the parameter is in here with 94% probability"—and you can take it straight to a decision. Engineering trade-off: (a) small sample + a prior → Bayesian, you can write domain knowledge into the prior and it's steadier on little data; (b) large sample + standardized reporting → frequentist, faster, the industry default, no arguing about priors; (c) making a downstream decision → Bayesian, the full posterior directly computes "the probability A beats B." For your super-individual personal decisions (naturally small samples, strong priors), Bayesian thinking usually fits better—it forces you to write beliefs down explicitly and update them with evidence.

2. MCMC is exact but slow; VI is fast but underestimates uncertainty. This "accuracy vs speed" trade-off—which trade-offs in distributed systems is it isomorphic to? Is there a "best of both worlds" middle path?

This is strong vs eventual consistency transplanted into the inference world. MCMC is like a strongly consistent read: asymptotically correct, but slow and hard to scale horizontally. VI is like a cached read: fast, scalable, but you may read a "too narrow" approximation (analogous to a slightly stale replica). And VI's bias is systematic (always underestimating variance), just as cache staleness is directional—predictable and compensable. There are middle paths: (a) normalizing flows—give VI a more flexible q distribution (beyond Gaussian), greatly boosting approximation power; (b) SVGD / particle variational—use a swarm of particles to balance sampling's flexibility with optimization's speed; (c) VI warm-up + MCMC refinement—use VI to quickly locate the posterior's rough position, then a few MCMC steps to refine, like "cache prefetch + strong-consistency verification on the critical path." Same engineering philosophy: locate fast with an approximation, then sharpen locally on demand.

3. "Epistemic uncertainty can be eliminated with more data; aleatoric cannot." What does this distinction directly guide in designing an AI system's "active learning" and "abstention" strategies?

This distinction is the theoretical core of active learning. Active learning asks "which samples are worth paying to label"—answer: pick the high-epistemic ones. High epistemic = the model hasn't seen this kind of data = labeling it genuinely removes ignorance; conversely high-aleatoric samples (inherently 50/50 noise) gain nothing from labeling. So "rank by total uncertainty and label" is wrong—you must first decompose out the epistemic part. Equally key for abstention / human-handoff: (a) high epistemic → "this is beyond my knowledge" → should abstain or route to a human, and it's worth feeding back into the training set to close the loop; (b) high aleatoric → "this is inherently uncertain" → abstaining is pointless (any model is the same), instead give the probability honestly and let the human decide carrying that uncertainty. Confusing the two makes a system either over-abstain (treating noise as ignorance) or be blindly confident (treating ignorance as noise)—a mature AI should behave differently for these two kinds of "I don't know."

4. The reparameterization trick lets gradients "flow through" random sampling. Why does this one small trick simultaneously ignite VAEs, diffusion models, and variational inference? What's its deeper meaning?

Deeper meaning: it reconciles "randomness" with "differentiability," the linchpin of the whole deep generative-model field. Background: neural nets train via backprop (gradients), but "sampling from a distribution" is itself non-differentiable—you can't differentiate "rolling a die." The trick's elegance: rewrite "sample from N(μ,σ²)" as "z = μ + σ·ε, where ε~N(0,1)." Randomness is outsourced to ε, which is independent of the parameters, while μ, σ become deterministic differentiable transforms, so gradients flow around ε and back through μ/σ. Why it ignites three fields: they're all fundamentally "learning a generative process with random latent variables"—VAEs learn encode-decode, diffusion learns step-by-step denoising, Bayesian nets learn weight distributions, all stuck on "how to differentiate through sampling." The reparameterization breaks through at a stroke, letting them all train end-to-end with SGD + Adam. That's why Day 20's diffusion models, today's VI, and generative models share the same building block—one seemingly trivial algebraic rewrite decides whether a whole generation of generative AI can be trained by gradients, the key interface by which "differentiable programming" swallows "probabilistic modeling."

5. The heart of Bayes is the "prior + evidence → posterior" update loop. If you view human cognition, scientific progress, even Buddhism's "dependent origination" as Bayesian updating, what does this lens illuminate—and what does it obscure?

What it illuminates: (a) cognition—"see the world through a prior, then update with evidence" precisely captures learning: an infant's prior is weak, an expert's is strong yet harder to shake with new evidence—exactly the Bayesian account of "expert bias" (a strong prior needs stronger evidence to flip). (b) philosophy of science—progress is the updating of a collective posterior, and a paradigm shift (Kuhn) is the "phase transition" where accumulated anomalous evidence finally overwhelms the old prior. (c) Eastern thought—Buddhism's "dependent origination" holds that all things arise and cease by conditions with no fixed self-nature, a fascinating echo of Bayes' "beliefs flow with evidence, nothing absolutely certain"; "attachment" can be likened to a degenerate state of a too-strong prior refusing to update. What it obscures: (i) Bayes assumes a stable hypothesis space, but real cognitive breakthroughs often create a new concept that wasn't in the hypothesis space—a "creation from nothing" the framework can't hold (you can't assign a prior to a hypothesis not yet conceived); (ii) it reduces cognition to information processing, possibly obscuring the embodied, emotional, intuitive—non-propositional ways of knowing; (iii) Buddhism's "awakening" points to an experience beyond conceptual distinction, while Bayes is a refined machine of conceptual distinction—using it to explain awakening may miss exactly what awakening seeks to dissolve. So this is a useful lens, but don't mistake the lens for the eye—which is itself very Bayesian: toward the prior "Bayes is the universal lens," you should keep an updatable humility.

Markov Chain Monte CarloMCMC

Variational InferenceVI

Uncertainty QuantificationUQ

Probabilistic Programming LanguagesPyMC / Stan

Further Reading

Deep Questions