Forecast quality isn't one dimension but two orthogonal abilities. Calibration: when you say "70%," events of that kind really happen 70% of the time — your probabilities tell the truth. Resolution (sharpness): you dare to leave the base rate and commit to decisive probabilities (90% or 10%) rather than timidly hugging 50%.
Non-trivial: (1) The two pull against each other. A weatherman in a city with a 30% annual rain rate who always reports "30% rain" is perfectly calibrated yet carries zero information value; conversely, shouting 95% everywhere to look decisive destroys calibration. (2) The real expert achieves both: decisive only when the evidence truly warrants, honestly parked in the uncertain zone otherwise. (3) A deep mathematical fact: forecast error decomposes into a calibration term plus a resolution term — so a debrief shouldn't just ask "did I get it right" but separately "were my probabilities honest" and "did I dare to leave the middle." Two ailments, two cures: poor calibration is treated with feedback; poor resolution with domain knowledge.
Diagnostic: pull out everything you once called "80% sure" and check the realized rate. Far below 80% → overconfidence (a calibration ailment); never daring to say above 80% → insufficient resolution (you left information on the table).
Weather forecasting is one of humanity's best-calibrated fields. On days called "70% rain," it really rains about 70% over the long run — because forecasters get next-day ground truth and are forced to be honest. This reveals the source of calibration: high frequency plus immediate feedback. Most professions (economists, strategists) are badly calibrated precisely because their horizons are long, feedback is sparse, and no one keeps score.
"Model calibration" in machine learning is the same concept: a classifier outputting 0.9 confidence should be correct 90% of the time. But modern deep networks are systematically overconfident — output 0.99 yet right only 80% of the time, needing temperature scaling to recalibrate. Likewise, when an LLM sounds certain, its "linguistic confidence" and true accuracy often don't align. Turn it on yourself: tag every estimate in your project plan with a probability, then review hit rates each quarter — you'll find you're also an "uncalibrated network."
The Brier score compresses how good a probabilistic forecast was into one number: error = (your probability − outcome)², where outcome is 1 if it happened, 0 if not. Average over many forecasts; lower is better, 0 is godlike. It rewards both "being accurate" and "being honest about uncertainty."
Non-trivial: (1) It's a proper scoring rule — it can be proven that the only strategy maximizing your expected score is to report the probability you truly believe. Any misreport (inflating to look decisive, shrinking to play safe) worsens your long-run score. This is profound: the rule itself makes honesty optimal and ungameable. (2) The quadratic penalty is asymmetric: swearing 95% and being wrong is punished far more than honestly saying 60%. So it naturally suppresses overconfidence. (3) It replaces binary right/wrong grading — and binary grading is exactly what rewards the loud and punishes the nuanced, the root of why big voices win in public discourse.
Practice: keep a "forecast ledger" — event, probability, deadline. After resolution, compute the Brier score and compare it to a dumb baseline (always report the base rate). If you can't beat the baseline, your "insight" is noise.
A large forecasting tournament used the Brier score to grade thousands of volunteers judging world events, and surfaced a small band of "superforecasters" whose scores beat professional analysts with access to classified intelligence. The point isn't that they were oracles — it's that this scoring leaves real skill nowhere to hide: long-run, cumulative, impossible to fake with one lucky call.
Log loss (cross-entropy), the loss function for training classifiers, is essentially a proper scoring rule from the same family as Brier — it forces a model not just to predict the class but to tune its confidence honestly, which is the mathematical reason models can be "trained to be calibrated." Applied to personal decisions, it dovetails with the Decision Journal (Day 42): the journal records the probability and reasoning at the time, and the Brier score gives that journal a quantifiable report card. Do it for a year and your sense of your own judgment shifts from "feels fine" to "documented."
This metaphor splits cognitive styles into two. The hedgehog holds one grand theory, crams everything into it, is confident, makes bold claims, and is a media darling. The fox is omnivorous, holds many competing small models at once, self-doubts, updates often, and is at peace with uncertainty. A landmark 20-year study found that foxes' long-run accuracy far exceeds hedgehogs' — and the more famous and telegenic the expert, the worse they tend to be.
Non-trivial: (1) The most counter-intuitive part is that fame correlates negatively with accuracy — being on TV demands simplicity, certainty, and drama, which are exactly hedgehog traits and exactly forecasting poison. (2) The fox's advantage is cleanly explained by machine learning: a fox is essentially an ensemble model. A random forest beats a single decision tree because averaging many biased weak models cancels variance. A fox runs several lenses in parallel and weights them — doing the same thing. (3) The hedgehog fails not from stupidity but because a strong prior refuses to be updated by data — every counter-example is explained away as "a temporary exception," so no amount of evidence can correct course.
Practice: whenever you notice one theory explaining everything in front of you, an alarm should ring — that's usually not depth of insight but you turning into a hedgehog. Forcing yourself to write down "if I'm wrong, where am I most likely wrong" installs the fox's second eye.
That long-running study showed famous experts who made bold claims about international politics were barely better than random guessing, some even worse; meanwhile obscure, fox-type analysts full of "on the one hand… on the other" came out ahead. The harsh lesson: in complex, open systems, decisive simplicity is a systematic bias, not a mark of skill.
The ensemble-learning analogy comes naturally to a technologist: don't worship "one elegant grand theory explaining a whole field" (e.g., "scaling laws explain everything in AI") — that's the hedgehog trap; robust judgment comes from weighting and ensembling many lenses. Same with parenting — don't convert to any single school (attachment, tiger, Montessori); take a bit from each and keep re-weighting based on the child's actual response, the fox-parent way. The more certain you are that one framework explains it all, the more you should suspect you're overfitting.
The inside view reasons from a case's specifics — "my project is special, three weeks will do." The outside view first finds a reference class (a set of similar cases) and looks at its base rate — "of the last 10 similar projects, how many finished on time?" The superforecaster's first commandment: anchor on the base rate, then adjust for this case's specifics, never starting from a blank slate.
Non-trivial: (1) We systematically ignore base rates because the inside story is vivid while the reference class is dull — the root of the planning fallacy: nearly everyone underestimates how long a renovation, a paper, or a project will take, because each time it feels "different this time." (2) This is just another name for Bayesian updating: the base rate is the prior, the case-specific evidence is the likelihood, and the right move is to start from the prior and adjust by evidence strength. Starting from the inside view = discarding the prior and estimating from scratch, with absurdly high variance. (3) The key skill is choosing the reference class: too narrow ("a project exactly like mine") collapses back to the inside view; too broad loses information. A good reference class is similar on the structural features that decide success.
Practice: before any estimate leaves your mouth, ask "what class of thing is this, and what's that class's track record." Write that number down as the anchor, then discuss what's special about yours.
Large infrastructure almost always runs late and over budget — because every project team uses the inside view to tell a "we'll manage better this time" story, ignoring the cold reference-class base rate. "Reference-class forecasting" was later written into some public-investment guidance: it mandates pulling the real overrun distribution of similar projects as the starting point, then adjusting. Just changing the starting point made forecasts dramatically more reliable.
Scheduling an AI feature, the team's inside view is "the spec is clear, two weeks is enough." Outside view: dig out the actual delivery of the last 10 sprints; the base rate might be "two-week claims took a median of five weeks" — that's your prior, then fine-tune for what's genuinely special this time. This dovetails with Bayesian Thinking (Day 7): the base rate is the prior. In distributed systems, estimating a node class's failure rate works the same — don't reason about "this one server" in a vacuum; look at the historical failure distribution of a whole fleet of the same model, a far more reliable starting point. The outside view isn't pessimism — it's putting the prior back where it belongs.