Mental Models: The Science of Superforecasting

Calibration vs Resolution

A good forecast has two independent virtues — don't chase only one

In Depth

Forecast quality isn't one dimension but two orthogonal abilities. Calibration: when you say "70%," events of that kind really happen 70% of the time — your probabilities tell the truth. Resolution (sharpness): you dare to leave the base rate and commit to decisive probabilities (90% or 10%) rather than timidly hugging 50%.

Non-trivial: (1) The two pull against each other. A weatherman in a city with a 30% annual rain rate who always reports "30% rain" is perfectly calibrated yet carries zero information value; conversely, shouting 95% everywhere to look decisive destroys calibration. (2) The real expert achieves both: decisive only when the evidence truly warrants, honestly parked in the uncertain zone otherwise. (3) A deep mathematical fact: forecast error decomposes into a calibration term plus a resolution term — so a debrief shouldn't just ask "did I get it right" but separately "were my probabilities honest" and "did I dare to leave the middle." Two ailments, two cures: poor calibration is treated with feedback; poor resolution with domain knowledge.

Diagnostic: pull out everything you once called "80% sure" and check the realized rate. Far below 80% → overconfidence (a calibration ailment); never daring to say above 80% → insufficient resolution (you left information on the table).

Reliability diagram: high-probability calls that keep failing → curve sits below the diagonal = overconfidence

Classic example

Weather forecasting is one of humanity's best-calibrated fields. On days called "70% rain," it really rains about 70% over the long run — because forecasters get next-day ground truth and are forced to be honest. This reveals the source of calibration: high frequency plus immediate feedback. Most professions (economists, strategists) are badly calibrated precisely because their horizons are long, feedback is sparse, and no one keeps score.

BigCat scenario

"Model calibration" in machine learning is the same concept: a classifier outputting 0.9 confidence should be correct 90% of the time. But modern deep networks are systematically overconfident — output 0.99 yet right only 80% of the time, needing temperature scaling to recalibrate. Likewise, when an LLM sounds certain, its "linguistic confidence" and true accuracy often don't align. Turn it on yourself: tag every estimate in your project plan with a probability, then review hit rates each quarter — you'll find you're also an "uncalibrated network."

AI Prompt

English Prompt

Here is a batch of my recent probabilistic forecasts about [domain/project]: [list "event + my probability + actual outcome"]. Please: 1. Assess my calibration — bucket forecasts by probability, compare each bucket's stated probability to its realized hit rate, and judge whether I'm overconfident or underconfident. 2. Assess my resolution — are most of my forecasts clustered in 40%–60%, afraid to commit? 3. Give one concrete action to improve calibration and one to improve resolution.

Brier Score

Score a probabilistic forecast with one number — and that number forces you to be honest

In Depth

The Brier score compresses how good a probabilistic forecast was into one number: error = (your probability − outcome)², where outcome is 1 if it happened, 0 if not. Average over many forecasts; lower is better, 0 is godlike. It rewards both "being accurate" and "being honest about uncertainty."

Non-trivial: (1) It's a proper scoring rule — it can be proven that the only strategy maximizing your expected score is to report the probability you truly believe. Any misreport (inflating to look decisive, shrinking to play safe) worsens your long-run score. This is profound: the rule itself makes honesty optimal and ungameable. (2) The quadratic penalty is asymmetric: swearing 95% and being wrong is punished far more than honestly saying 60%. So it naturally suppresses overconfidence. (3) It replaces binary right/wrong grading — and binary grading is exactly what rewards the loud and punishes the nuanced, the root of why big voices win in public discourse.

Practice: keep a "forecast ledger" — event, probability, deadline. After resolution, compute the Brier score and compare it to a dumb baseline (always report the base rate). If you can't beat the baseline, your "insight" is noise.

Classic example

A large forecasting tournament used the Brier score to grade thousands of volunteers judging world events, and surfaced a small band of "superforecasters" whose scores beat professional analysts with access to classified intelligence. The point isn't that they were oracles — it's that this scoring leaves real skill nowhere to hide: long-run, cumulative, impossible to fake with one lucky call.

BigCat scenario

Log loss (cross-entropy), the loss function for training classifiers, is essentially a proper scoring rule from the same family as Brier — it forces a model not just to predict the class but to tune its confidence honestly, which is the mathematical reason models can be "trained to be calibrated." Applied to personal decisions, it dovetails with the Decision Journal (Day 42): the journal records the probability and reasoning at the time, and the Brier score gives that journal a quantifiable report card. Do it for a year and your sense of your own judgment shifts from "feels fine" to "documented."

AI Prompt

English Prompt

I'm forecasting for [decision/project] and want to evaluate with the Brier score. Here are my forecasts: [event + probability + known outcome]. Please: 1. Compute my average Brier score. 2. Compare it against a dumb baseline that always reports the base rate — did I beat it? 3. Identify the "confidently wrong" forecasts (high probability that failed), how much penalty they contributed, and how I should rein in next time.

Fox vs Hedgehog

"The fox knows many things; the hedgehog knows one big thing." And the fox forecasts better

In Depth

This metaphor splits cognitive styles into two. The hedgehog holds one grand theory, crams everything into it, is confident, makes bold claims, and is a media darling. The fox is omnivorous, holds many competing small models at once, self-doubts, updates often, and is at peace with uncertainty. A landmark 20-year study found that foxes' long-run accuracy far exceeds hedgehogs' — and the more famous and telegenic the expert, the worse they tend to be.

Non-trivial: (1) The most counter-intuitive part is that fame correlates negatively with accuracy — being on TV demands simplicity, certainty, and drama, which are exactly hedgehog traits and exactly forecasting poison. (2) The fox's advantage is cleanly explained by machine learning: a fox is essentially an ensemble model. A random forest beats a single decision tree because averaging many biased weak models cancels variance. A fox runs several lenses in parallel and weights them — doing the same thing. (3) The hedgehog fails not from stupidity but because a strong prior refuses to be updated by data — every counter-example is explained away as "a temporary exception," so no amount of evidence can correct course.

Practice: whenever you notice one theory explaining everything in front of you, an alarm should ring — that's usually not depth of insight but you turning into a hedgehog. Forcing yourself to write down "if I'm wrong, where am I most likely wrong" installs the fox's second eye.

Classic example

That long-running study showed famous experts who made bold claims about international politics were barely better than random guessing, some even worse; meanwhile obscure, fox-type analysts full of "on the one hand… on the other" came out ahead. The harsh lesson: in complex, open systems, decisive simplicity is a systematic bias, not a mark of skill.

BigCat scenario

The ensemble-learning analogy comes naturally to a technologist: don't worship "one elegant grand theory explaining a whole field" (e.g., "scaling laws explain everything in AI") — that's the hedgehog trap; robust judgment comes from weighting and ensembling many lenses. Same with parenting — don't convert to any single school (attachment, tiger, Montessori); take a bit from each and keep re-weighting based on the child's actual response, the fox-parent way. The more certain you are that one framework explains it all, the more you should suspect you're overfitting.

AI Prompt

English Prompt

My judgment on [issue] currently rests mainly on this core theory/framework: [describe]. Run a "foxification" stress test: 1. Surface 3 different, even conflicting, explanatory lenses and how each would forecast the outcome. 2. Point out whether I'm explaining away counter-examples as "exceptions" (the classic hedgehog symptom). 3. Give a combined judgment that weighs and ensembles these lenses, rather than a single confident claim.

Outside View & Base Rate

First ask "how do cases like this usually go," then "what's special about mine" — never reverse the order

In Depth

The inside view reasons from a case's specifics — "my project is special, three weeks will do." The outside view first finds a reference class (a set of similar cases) and looks at its base rate — "of the last 10 similar projects, how many finished on time?" The superforecaster's first commandment: anchor on the base rate, then adjust for this case's specifics, never starting from a blank slate.

Non-trivial: (1) We systematically ignore base rates because the inside story is vivid while the reference class is dull — the root of the planning fallacy: nearly everyone underestimates how long a renovation, a paper, or a project will take, because each time it feels "different this time." (2) This is just another name for Bayesian updating: the base rate is the prior, the case-specific evidence is the likelihood, and the right move is to start from the prior and adjust by evidence strength. Starting from the inside view = discarding the prior and estimating from scratch, with absurdly high variance. (3) The key skill is choosing the reference class: too narrow ("a project exactly like mine") collapses back to the inside view; too broad loses information. A good reference class is similar on the structural features that decide success.

Practice: before any estimate leaves your mouth, ask "what class of thing is this, and what's that class's track record." Write that number down as the anchor, then discuss what's special about yours.

Classic example

Large infrastructure almost always runs late and over budget — because every project team uses the inside view to tell a "we'll manage better this time" story, ignoring the cold reference-class base rate. "Reference-class forecasting" was later written into some public-investment guidance: it mandates pulling the real overrun distribution of similar projects as the starting point, then adjusting. Just changing the starting point made forecasts dramatically more reliable.

BigCat scenario

Scheduling an AI feature, the team's inside view is "the spec is clear, two weeks is enough." Outside view: dig out the actual delivery of the last 10 sprints; the base rate might be "two-week claims took a median of five weeks" — that's your prior, then fine-tune for what's genuinely special this time. This dovetails with Bayesian Thinking (Day 7): the base rate is the prior. In distributed systems, estimating a node class's failure rate works the same — don't reason about "this one server" in a vacuum; look at the historical failure distribution of a whole fleet of the same model, a far more reliable starting point. The outside view isn't pessimism — it's putting the prior back where it belongs.

AI Prompt

English Prompt

I need to forecast/estimate [specific event: outcome, duration, or cost]. My inside-view take is [my gut estimate + reasoning]. Help me switch to the outside view: 1. Propose 2–3 suitable reference classes and explain the relevant similarity of each. 2. Give the historical base rate of each as a prior starting point. 3. Starting from the base rate, make a disciplined adjustment for this case's genuine specifics, give a final probability/range, and warn me if I'm sliding back into the inside view.