Meta Knowledge: Statistics & Probability

May 26, 2026 · Meta Knowledge
DAY 12
Statistical Inference Computational Statistics Data Bias Causal Inference

Bayesian Inference

Bayesian Inference
Probability = degree of belief, updated by evidence
CORE INSIGHT

Probability isn't an objective property of the world — it's how strongly you believe something. New evidence shouldn't erase your old belief; it should reweight it as "prior × likelihood." Ignoring the "base rate" (how common a thing is in the first place) is the systematic error nearly everyone makes — doctors, judges, investors included — when judging probabilities by gut.

BACKGROUND & MECHANISM

The idea is named after the 18th-century thinker Bayes. The formula itself is simple: prior (what you believed before) × likelihood (how well the new evidence fits) → posterior (your updated belief). The hard part is the easily-forgotten principle: every probability is relative to "the information you already have" — there's no such thing as an information-free probability. Bayesian methods only became mainstream once computation matured in the 1990s.

▸ Rare-disease test: 1% prevalence, 95% accuracy, yet only 16% of positives are truly sick
Tests positiveTests negativeTotal
Actually sick955100
Actually healthy49594059900
Total590941010000
Of 590 positives, only 95 are truly sick → about 16%. The healthy population is so large that even a low false-positive rate produces far more false alarms than real cases.
COUNTER-INTUITIVE EXAMPLE

Someone once posed the "rare-disease test" above to medical-school faculty and students: a disease with 1% prevalence, a test that's 95% accurate — what's the chance a person who tests positive is actually sick? Most blurted out 95%, but the correct answer is only about 16% — because the healthy population is so large that false alarms vastly outnumber real cases. Even doctors ignore the base rate. The same error in court (multiplying small probabilities together and treating it as ironclad proof) has even produced wrongful convictions.

CROSS-DOMAIN TRANSFER

In machine learning, most ways of handling uncertainty are Bayesian in spirit; in cognitive science, one theory holds the brain itself is a prediction machine doing layered Bayesian inference; in medicine, "update the probability of disease after seeing symptoms" is the standard diagnostic mindset; in investing, people who "suddenly find religion" after a black swan are really just violently updating a prior.

BIGCAT APPLICATION + REFLECTION

"That person always procrastinates" is almost always a base-rate error — if every project slips (base rate 80%), the handful of times they slipped tells you nothing; conversely, "this new feature will definitely take off" usually ignores the prior that "90% of launches in the industry are unremarkable." Build a habit: before looking at the evidence, ask "what was my prior belief? Does this evidence raise the probability 2× or 20×?" — it blocks most hasty conclusions. Same in parenting: three wrong problems shouldn't update you to "she's bad at math" — against the hundreds of problems in a semester, that evidence can't move any prior.

▸ Reflection: take your most recent "I knew it all along" judgment and write it in Bayesian form — what was your prior? Was the evidence really strong enough to support that conclusion?

Monte Carlo Methods

Monte Carlo Methods
Can't compute it? Simulate it, over and over
CORE INSIGHT

When a problem can't be solved by formula but can be simulated, enough random sampling yields an approximation to any precision you want — turning "thinking" into "number of computations." This shattered the old notion that "complex = unsolvable": many things once seen as hard math problems are today just a question of compute.

BACKGROUND & MECHANISM

Its origin is vivid: a mathematician playing cards while ill wanted to compute the odds of winning a certain layout, found the formula nearly impossible, but realized simulating many hands and counting was easy. The principle behind it is the "law of large numbers": turn any sum, average, or area problem into "sample a lot, then take the average." It later grew into a whole family of methods, all sharing one trait — using randomness to precisely approach a definite answer.

COUNTER-INTUITIVE EXAMPLE

Estimating π: scatter a huge number of random points in a square, count the fraction landing inside the inscribed circle, multiply by 4, and you get π. Ten thousand points give ~3.14; a million give ~3.1416 — with no geometry at all, just repeated "dice rolls." The same idea underpins pricing complex financial products, modern Bayesian computation, the tens of thousands of simulated games behind each of AlphaGo's moves — and even every token a large model emits is essentially one random draw.

CROSS-DOMAIN TRANSFER

In physics it handles many-particle problems; in finance it prices derivatives and estimates risk; in game AI it's the core of players like AlphaGo; in climate, perturbing initial conditions "samples" the many possible futures; in robotics it powers localization and tracking.

BIGCAT APPLICATION + REFLECTION

Facing a complex decision (change jobs? will the project ship on time?), the most common mistake is chasing "one precise number" instead of "sampling many possibilities." Sketching three or four scenarios to reassure yourself "60% likely" is far weaker than seriously imagining 20 concrete execution paths (including extreme cases) and seeing which kind of failure recurs. The "distribution" of outcomes usually matters far more than the "average." Parenting plans too — don't make a "perfect plan"; first simulate ten versions of "Wednesday night, the kid is sick, and you're still in a meeting at 8" — that's the real pressure point.

▸ Reflection: for a recent important decision, could you "sample" 10 concrete paths instead of just computing an expected value? Which path unsettles you most? Have you prepared a contingency for it?

Survivorship Bias

Survivorship Bias
Silent failures distort everything you see
CORE INSIGHT

When you only see the "samples that survived," your conclusions are almost guaranteed to be wrong. Failures are silent and can't speak up, so the "success factors" you think you see may just be random noise on the survivors, not real causes. The deadliest bias isn't in the data itself — it's in "what kind of sample even gets to be seen by you."

BACKGROUND & MECHANISM

The classic WWII case: the military wanted to reinforce returning bombers based on where the bullet holes clustered — naturally, the densest spots. A statistician pointed out it was backwards — the parts to reinforce were exactly those with the fewest holes (like the engines): planes hit there never made it back, so you don't see those holes on the survivors. Any data that "counts only survivors" overestimates the upside and underestimates the risk.

COUNTER-INTUITIVE EXAMPLE

"Success" books conclude that wealthy CEOs share traits like "early rising, reading, grit" — but those traits may be just as common, or more so, among bankrupt CEOs; no one counts the failures. The "great companies" studied in such books often fade into mediocrity within a few years of publication — after-the-fact attribution, when you only look at survivors, is all just rationalized storytelling. Hedge funds' average returns look glossy partly because the losers shut down early and dropped out of the dataset.

CROSS-DOMAIN TRANSFER

In history, "old buildings were all sturdy" — really, the fragile ones collapsed long ago; in research, publishing only positive results makes the literature overly optimistic, a root cause of the "replication crisis"; in machine learning, training data is itself a filtered sample ("great offline metrics, crashes in production" often comes from this); in education, "elite-school graduates earn high salaries" ignores that "getting admitted" was already a powerful signal.

BIGCAT APPLICATION + REFLECTION

The trap engineers fall into most easily is studying only successful open-source projects, startups, and AI products and treating their methods as gospel. An anti-bias drill: for every success story you read, force yourself to find 3 cases that used a similar strategy and failed. If you can't, either the failures are silent or the "strategy" isn't actually identifiable — in both cases you shouldn't trust the attribution. Same in parenting: before following "so-and-so mom's parenting secrets," ask "where are the moms who used the same methods but whose kids didn't do well" — their silence doesn't mean they don't exist.

▸ Reflection: a "success path" you're imitating — can you list 3 cases that used the same method and failed? If you can't, are you following a pattern, or following noise?

Simpson's Paradox

Simpson's Paradox
Aggregation can flip the conclusion entirely
CORE INSIGHT

Overall, A beats B — yet split into groups, B can beat A in every single group. Aggregation can flip the conclusion outright. This means the direction of a correlation depends on whether you've controlled for the right "confounding variable." The data isn't lying — data without a causal judgment simply has no direction. This is the watershed between "describing data" and "understanding cause."

BACKGROUND & MECHANISM

The most famous case: a university's graduate admissions showed men admitted at a clearly higher rate than women overall — looking like sex discrimination. But broken down department by department, women's admission rate was no lower than men's in almost every one. The truth: women applied more to popular departments that were hard to get into. In other words, the hidden variable "which department you applied to" manufactured the overall illusion. Whether or not you control for a variable can yield opposite conclusions.

▸ Simpson reversal: women's admission rate is higher in every department, yet lower overall
Dept.Men applied / admittedMen's rateWomen applied / admittedWomen's rate
A (easy)800 / 48060%100 / 7070%
B (hard)200 / 4020%900 / 22525%
Overall1000 / 52052%1000 / 29529.5%
Women's rate is higher in every department, yet lower overall — because women concentrated their applications in the hard-to-enter department B (hidden variable = choice of department).
COUNTER-INTUITIVE EXAMPLE

Two kidney-stone treatments, A and B: overall data show B has a higher success rate. But once you group by stone size, A is better for both large and small stones. The reason: doctors assigned the harder-to-treat large stones disproportionately to A, dragging down A's overall number. With the same data, the answer to "which treatment is better" depends on whether you're asking "the effect of the treatment itself" or "the average effect given doctors' assignment habits" — two different questions.

CROSS-DOMAIN TRANSFER

In A/B testing, this is the trap analysts most often fall into — differences between user groups mask the real effect; in algorithmic fairness, "higher accuracy in every subgroup yet lower overall" does happen; in epidemiology, vaccine effectiveness must be viewed stratified by age; in education, "school X has a high college-admission rate" may come entirely from the students it admits, not its teaching.

BIGCAT APPLICATION + REFLECTION

The biggest trap in data-driven decisions often isn't "no data" — it's looking only at aggregate data. Team productivity up 10% might be a new high-output member lifting the average while every veteran is actually declining; DAU up 20% might be new users less active than the churned ones — looks like growth, is really dilution. For any key metric, immediately ask "does this still hold after I group it differently?" — that's the fundamental difference between an engineer and a "vanity metric." Same in parenting: "the kid did problems faster this month" — real progress, or just easier problems this month?

▸ Reflection: a metric in your team, product, or family that recently "clearly improved/worsened" — does it still hold after grouping it 3 different ways? Which grouping flips the conclusion?