Day 2 · 2026.05.22

Probability and Uncertainty

How to make slightly-less-wrong judgments in a world you can't quite see

"The theory of probabilities is at bottom nothing but common sense reduced to calculus." — Pierre-Simon Laplace, Essai philosophique sur les probabilités (1814)

Bayes' Theorem: How Beliefs Update

Inference / Probability

Inference

Intuition

You take a cancer screening test and it comes back positive. The prevalence of this cancer in your age group is 1%. The test has 90% sensitivity (catches the disease when it's there) and 90% specificity (correctly clears you when it isn't). What's the probability you actually have cancer? Most people blurt out 90%. The real answer is about 8.3%.

Picture 1,000 people just like you. About 10 truly have the disease, and the test flags 9 of them. Of the 990 healthy people, 10% are false positives — 99 of them. Total positives: 9 + 99 = 108. Truly sick among them: only 9. So $9/108 \approx 8.3\%$. That's Bayes' theorem in action — it fuses a prior belief (the base rate) with new evidence (the test result) into a posterior belief. In one sentence: new belief = old belief × how strongly this evidence supports it.

$$P(H \mid E) = \frac{P(E \mid H)\, P(H)}{P(E)}$$

Why it's beautiful

Bayes' formula lifts "rational updating" out of intuition and into algebra. It shows that ignoring the prior — the so-called base-rate neglect — is one of the most systematic errors human cognition makes. Kahneman built an entire book, Thinking, Fast and Slow, on documenting it. To the Bayesian, probability is not a property of the world; it is the strength of an observer's belief. That stance was one of the great schisms of 20th-century scientific philosophy (frequentists vs Bayesians). E.T. Jaynes opens Probability Theory: The Logic of Science with: "Probability theory is an extension of logic — promoting Boolean algebra from {true, false} to a continuum of plausibilities on [0, 1]."

Applications

Spam filtering (each word is evidence; the posterior "is this email spam?" gets updated word by word); medical diagnosis (reading a positive test result without the base rate is essentially guaranteed to mislead); Bayesian networks, MCMC, variational inference, Bayesian deep learning — entire branches of modern AI; courtroom reasoning (the defense in the O.J. Simpson trial famously conflated $P(\text{guilty} \mid \text{evidence})$ with $P(\text{evidence} \mid \text{guilty})$ — a confusion now known as the prosecutor's fallacy).

History

Thomas Bayes (1701–1761) was an English Presbyterian minister and amateur mathematician. He never published his theorem; his friend Richard Price read it aloud to the Royal Society in 1763, two years after Bayes died. Pierre-Simon Laplace rediscovered and systematized it independently, and did most of the work of spreading it. By the late 19th century, R.A. Fisher and other frequentists had pushed the Bayesian school into the wilderness for nearly a hundred years. It returned only in the 1980s — thanks to computers and the MCMC algorithm — and has since become one of the dominant frameworks in statistics and AI.

Going deeper

• 3Blue1Brown — The medical test paradox
• E.T. Jaynes — Probability Theory: The Logic of Science (the Bayesian bible)

English Insight: prior — your belief before seeing the evidence; posterior — your belief after; likelihood — $P(E \mid H)$, the probability of the evidence given the hypothesis (constantly confused with $P(H \mid E)$, the other direction); base-rate neglect; prosecutor's fallacy — confusing those two conditional probabilities in court.

Something to chew on: If an AI classifier always predicts the most common class, its accuracy can look high — but is it a good classifier? Is that the same trap as "a positive cancer test usually means you're healthy"? Why is "accuracy" so misleading for rare-class problems?

Monte Carlo: Cracking Problems with Randomness

Computation / Simulation

Computation

Intuition

You want to compute $\pi$, but you don't know calculus. Draw a $2 \times 2$ square, inscribe a unit circle. The circle has area $\pi$, the square has area 4, so the ratio is $\pi/4$. Now close your eyes and throw beans at the square at random. After 100,000 throws, take the fraction that landed inside the circle, multiply by 4, and you have an approximation of $\pi$.

That's Monte Carlo: when a problem resists analytic solution, simulate it many times and let the law of large numbers compute the answer for you. It flips uncertainty into a tool. You stop fighting randomness and start using it as a hammer.

Formally

To estimate $\mathbb{E}[f(X)]$, draw independent samples $X_1, X_2, \ldots, X_n$ and approximate with $\frac{1}{n}\sum_{i=1}^n f(X_i)$. The error decays as $O(1/\sqrt{n})$ — slow, but independent of dimension. That dimension-independence is the killer feature.

Why it's beautiful

Classical numerical methods (grids, finite elements) cost $O(N^d)$ in dimension $d$. Three dimensions are fine; thirty dimensions are catastrophic. Monte Carlo's error depends only on the number of samples — the curse of dimensionality simply doesn't apply to it. That's why it's the go-to tool for pricing financial derivatives (dozens of asset dimensions), Bayesian inference (hundreds of parameters), and tree search in AlphaGo (an astronomical state space). The deepest beauty: Monte Carlo doesn't solve the problem — it sidesteps it. That maneuver, refusing to fight the battle on the enemy's terms, is one of the purest expressions of mathematical spirit.

Applications

The Manhattan Project: simulating neutron diffusion through fissile material — the birthplace of the method. Finance: option pricing, value-at-risk. Physics: the Ising model, lattice QCD. AI: AlphaGo's Monte Carlo Tree Search, policy evaluation in reinforcement learning. Film CGI: modern path tracing is essentially Monte Carlo integration in a very high-dimensional space — Pixar is throwing beans for every frame.

History

In 1946 the physicist Stanislaw Ulam was recovering from an illness at Los Alamos and playing solitaire to pass the time. He started wondering how to compute the probability of winning a hand. Rather than enumerate every possibility, he realized, he could just simulate many hands and take the frequency. He and John von Neumann turned the idea on neutron simulation for the hydrogen bomb. Nicholas Metropolis named it "Monte Carlo" — Ulam's uncle had a habit of gambling in Monaco. The 1953 Metropolis–Hastings algorithm laid the foundation for modern MCMC.

Going deeper

• 3Blue1Brown — But what is the Central Limit Theorem?
• Christian Robert & George Casella — Monte Carlo Statistical Methods

English Insight: Monte Carlo method — usually left untranslated; importance sampling, rejection sampling — two basic variance-reduction tricks; MCMC = Markov Chain Monte Carlo; curse of dimensionality — exponential cost in dimension, the thing Monte Carlo escapes.

Something to chew on: 100,000 bean-throws gets you only two digits of $\pi$ — that looks terrible. But what if you need the volume of a 30-dimensional ball? A grid would need at least $10^{30}$ points; Monte Carlo still needs only a few tens of thousands of samples. What intuition does that give you about high-dimensional space? Why is "high-dimensional" in some sense friendlier to randomized methods?

Expected Value: The Center of Mass of Randomness

Foundations of Probability

Expectation

Intuition

Roll a fair die. What do you "expect" to see? The answer is 3.5. But no die has a 3.5 face — you will never, on a single roll, get 3.5. So "expected value" is a slightly misleading phrase. It doesn't mean "what I expect to happen." It means "if I played this game ten thousand times, what would my average outcome be?"

A better metaphor: imagine the probability distribution as a thin wire, with a little ball hanging at each possible outcome, weighted by its probability. The expected value is the wire's center of mass — the single point where one finger would balance the entire distribution. This is why mathematicians sometimes call it the first moment — etymologically the same "moment" as "mass times distance" in mechanics.

$$\mathbb{E}[X] = \sum_i x_i \cdot P(X = x_i) \quad \text{or} \quad \int x \, f(x)\, dx$$

Why it's beautiful

Expectation has one stunning property: linearity. $\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]$, even when $X$ and $Y$ are not independent — they can be arbitrarily correlated and the equation still holds. It is one of the few unconditional free lunches in probability theory. Countless hard problems collapse with one line of it: draw $n$ cards at random; what is the expected number of pairs? Define a 0/1 indicator variable for each pair, take its expectation, sum — no need to worry about whether the indicators are independent. This "decompose into indicators, then sum" trick is nearly unbeatable in combinatorial probability.

But expectation also has its traps. In 1738, Daniel Bernoulli posed the St. Petersburg paradox: a game whose mathematical expectation is infinite, but no one will pay even $1,000 to play. That contradiction forced the invention of utility theory: people care not only about the expected money, but about variance and risk aversion. It's the seed of behavioral economics.

Applications

Insurance pricing: premium ≈ expected payout + operating cost + profit. The Kelly criterion: the optimal bet size in gambling or investing. Reinforcement learning's Bellman equation $V(s) = \mathbb{E}[r + \gamma V(s')]$ is just expectation written recursively. Nearly every theoretical framework that defines "rationality" as "maximizing expected utility" starts here.

History

In 1654 the French nobleman Chevalier de Méré asked Pascal a gambling question: "If two players are playing a series to a fixed score and have to stop early, how should the pot be divided?" Pascal and Fermat exchanged letters on this Problem of Points — that correspondence is widely regarded as the birth of modern probability theory, and expected value was their main tool. Christiaan Huygens' 1657 treatise De Ratiociniis in Ludo Aleae was the first formal written definition.

Going deeper

• Leonard Mlodinow — The Drunkard's Walk: How Randomness Rules Our Lives
• Steven Strogatz — The Joy of x, "Chances Are" chapter

English Insight: expectation / expected value / mean — used interchangeably; variance, standard deviation, moment; linearity of expectation — possibly the single most powerful one-liner in combinatorial probability.

Something to chew on: The St. Petersburg game: flip a coin until you see the first tails; if it appears on flip $n$, you win $2^n$ dollars. The expectation is $\sum_{n=1}^\infty 2^n \cdot 2^{-n} = \infty$. So how much would you pay to play? Most people answer somewhere between 10 and 30 dollars. The gap between "infinite expectation" and "what I'd pay" — what does it tell us about rationality? Is the mathematics wrong, or does the very concept of "rational" need rewriting?

Law of Large Numbers vs the Law of Small Numbers

Limit Theorems / Cognitive Bias

Limit Theorems

Intuition

The Law of Large Numbers (LLN): repeat a random experiment enough times, and the sample average converges to the expected value. Flip a fair coin a million times and the proportion of heads will be essentially 50%. This sounds obvious — but proving it rigorously requires the full machinery of modern analysis.

The trouble is, humans have a strange cognitive bug: we treat small samples as if they already represent the whole. Kahneman and Tversky's 1971 paper called this the Law of Small Numbers. Flip a coin four times and get four heads — many people feel "tails are due." They are not. The coin has no memory. That's the gambler's fallacy. The opposite illusion: a basketball player hits five shots in a row and the crowd declares he has "the hot hand" — the hot-hand fallacy. Gilovich et al. in 1985 argued this was illusion, but Miller & Sanjurjo's 2018 reanalysis found that hot streaks are, in fact, sometimes real — the original paper, it turned out, had committed a subtle statistical bias of its own. Human intuition is easy to fool, but so are the studies that critique it.

$$\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \;\xrightarrow{P}\; \mu \quad (n \to \infty)$$

Why it's beautiful — and why we keep tripping over it

LLN is the bridge between "probability" and "reality": without it, expectation would be an empty concept. Jacob Bernoulli proved it rigorously for the first time in his posthumously published Ars Conjectandi (1713). But LLN only kicks in when $n$ is genuinely large — and humans' sense of "large enough" is wildly out of sync with mathematics'. Five out of your five stock picks beat the market? That tells you nothing about your skill. An A/B test with only 100 visitors? Not trustworthy. A small-sample epidemiological study? Likely to flip on replication. Half the work of statistics is dragging "interesting small-sample observations" back to ground.

Applications / cautionary tales

Finance: a fund manager beats the market five years running — skill or luck? Brinson-Hood-Beebower argued most of it is luck; survivorship bias means we only ever notice the winners. Medicine: the early COVID flood of small studies claiming "drug X works" — almost all of them failed to replicate. AI: a model hits 95% on a 100-sample hold-out set, then craters in production — distribution shift compounded by tiny-sample evaluation noise. Hiring, investing, evaluating a founder: judging someone from a single interview or a single business plan is, at root, fighting the law of large numbers with a sample size of one.

History

Jacob Bernoulli (1654–1705) gave the first rigorous proof of LLN and called it the "golden theorem" (theorema aureum). He wrote: "Even the dullest person knows by instinct that more observations bring us closer to the truth. But turning that instinct into a rigorous mathematical theorem is far harder than it seems — it took me twenty years." Kahneman and Tversky's 1971 paper Belief in the Law of Small Numbers formally diagnosed the cognitive bias and helped found behavioral economics. Nassim Taleb pushed it further: under fat-tailed distributions, the sample mean itself becomes unreliable — a single black swan can blow up any previously "known" average.

Going deeper

• Daniel Kahneman — Thinking, Fast and Slow, Chapter 10 "The Law of Small Numbers"
• Nassim Taleb — Fooled by Randomness

English Insight: Law of Large Numbers (LLN); Central Limit Theorem (CLT); regression to the mean; survivorship bias; gambler's fallacy / hot-hand fallacy; statistical significance; fat-tailed distribution.

Something to chew on: If "small samples cannot be trusted," why do the singular anecdotes of the Bible, of Sima Qian's Records of the Grand Historian, of every ancient text, still shape entire worldviews generation after generation? Perhaps the human brain isn't built for the law of large numbers at all — it is built for "the representativeness of a single vivid story." What does that mean for what we should, and shouldn't, believe today?