Mental Models: Explore vs Exploit

The Multi-Armed Bandit

Every choice is either gathering information or cashing in known value — that is explore vs exploit

In Depth

Picture a row of slot machines (a "multi-armed bandit"), each with a different, unknown payout rate. Every pull is the same fundamental choice: pull the one that currently looks best (exploit — cash in known value) or pull an uncertain one (explore — spend a turn to gather information). Almost every repeated choice in life is a bandit: which restaurant to eat at, which books to read, which direction to bet on.

Non-trivial: (1) The value of exploration lies not in the current turn but in the future — what you learn now sharpens countless later choices. So the optimal amount of exploration depends on how many turns you have left: your time horizon. Many turns left → explore more is rational; few left → harvest the known best. (2) This turns a seeming personality question ("do I like novelty?") into math: explore young, exploit late — not from growing timid, but from a shrinking horizon. (3) The yardstick is regret: how much you lost versus knowing the best option from the start; good strategies make long-run regret grow ever more slowly.

Practice: before choosing, ask "how many more of this kind of choice will I make?" Long horizon → weight exploration higher; the occasional dud is tuition. Short horizon → stop trying new things and bet on the proven best.

Classic example

A clinical trial is the most agonizing bandit: assign more patients to the treatment that currently looks more effective (exploit — good for present patients), or assign some to test a new treatment (explore — good for future patients)? Each assignment gambles certain present good against greater future good — which is exactly why trial ethics are so hard.

BigCat scenario

Recommender systems and A/B tests are the engineering version: always serving only the highest-known click-through content (pure exploit) makes the system ossify and miss new hits; reserving some traffic to explore new content keeps it evolving. RLHF is the same — a model that only exploits known high-scoring answers never learns a better policy. Applied to a career: keep an annual exploration budget for unfamiliar fields; while the horizon is long, that's not distraction but the highest-compounding investment.

AI Prompt

English Prompt

In [context: career direction / product strategy / learning area] I keep facing the choice "keep deepening the known best vs try new options." My current options and their known performance: [list]. Using the explore–exploit framework, please: 1. Estimate how much "time horizon" I likely have left. 2. Judge whether I should now lean toward exploration or exploitation. 3. Give a concrete exploration-budget ratio and a next action.

Optimal Stopping (the Secretary Problem)

When options come one at a time with no going back, when do you stop and commit?

In Depth

Some choices must be made sequentially, decided on the spot, with no going back: renting an apartment, hiring, finding a partner. Look at too few and you fear missing something better later; look at too many and the good ones slip away. Math gives a startlingly clean answer — the 37% rule: observe the first 37% of candidates without picking, use them to set a bar, then take the first one that beats the bar.

Non-trivial: (1) This 37% (precisely 1/e) is both how much to look at and the probability of landing the best — even the optimal strategy lands the top candidate only ~37% of the time, yet that is an unbeatable ceiling; the universe is that stingy. (2) The crucial precondition is no recall, no backtracking: you can't change your pick, nor recall someone you passed on. Change the conditions (returns allowed, repeated bids) and the threshold shifts, so confirm your situation is truly irreversible first. (3) It precisely captures the "have I looked enough?" dilemma: phase one only calibrates judgment and never acts; phase two only honors the bar and never hesitates — splitting explore and exploit cleanly along the time axis.

Practice: for sequential, no-going-back choices, estimate the total count, treat the first third as look-only (noting the best as your bar), then commit to the first one that beats the bar — stop fantasizing that "later might be better."

Split explore (set the bar) and exploit (commit) along the time axis

Classic example

The classic version is the "secretary problem": interview applicants one by one, decide on the spot after each, and a rejection is final. Intuition makes us look too long and miss strong early candidates; the 37% rule says use the first third to set the bar, then commit. Apartment hunting is the same — the first few days of looking without committing build the judgment for later decisiveness.

BigCat scenario

Hiring for a key role is sequential — strong candidates won't wait forever. Rather than endlessly "seeing a few more," declare that the first third only calibrates "what good looks like," then send an offer to the first one clearly above that bar. Tech selection is a variant: when evaluating architectures, scan a batch quickly to set a baseline, then lock in the first significantly better option, avoiding the infinite "is there something even better?" comparison.

AI Prompt

English Prompt

I must choose from a stream of "decide-on-the-spot, no going back" options: [describe: hiring / apartment / picking a design, and estimate the total number]. Using optimal stopping, please: 1. Confirm whether my situation is truly irreversible (does the 37% rule apply). 2. Compute how many I should observe to set the bar. 3. Give me one actionable criterion for "when to stop and commit immediately."

Optimism Under Uncertainty: UCB & Thompson Sampling

Smarter than gut feel: treat uncertainty itself as value worth betting on

In Depth

Is there a smarter bandit solution than gut feel? Two elegant algorithms. Upper Confidence Bound (UCB): give each option an "optimistic estimate" — its average performance plus a bonus that grows with uncertainty — then always pick the highest optimistic value. Thompson sampling: draw one sample from your probabilistic belief about each option, and pick whichever sample is highest.

Non-trivial: (1) Both share a deep motto — "optimism in the face of uncertainty." Why is optimism right? Because uncertainty itself has value: try an unsure option and at worst you confirm it's bad and stop (bounded downside); at best you find a gold mine (unbounded upside). This asymmetry makes betting on uncertainty mathematically favorable. (2) UCB is explicit optimism: the more uncertain, the bigger the bonus, forcing you to try untested options; as you try them, uncertainty shrinks and the bonus fades. Thompson is implicit optimism: random sampling makes an option's selection probability exactly equal to its probability of actually being best — elegantly allocating exploration by likelihood. (3) This is Bayesian thinking in action: belief is the posterior distribution, and acting means sampling from it or taking its optimistic upper bound.

Practice: don't pick by "current average best" alone — give under-sampled, uncertain options an explicit bonus; when a new option might be great and you can't lose much, that's exactly when to bet optimistically.

Classic example

These two algorithms power modern A/B testing, ad serving, and news recommendation: rather than wasting traffic evenly on a known-bad variant, the system automatically concentrates exploration on the "maybe better but under-tested" variant — converging fast while not missing a dark horse.

BigCat scenario

A Go AI's Monte Carlo Tree Search uses exactly the UCB idea (UCT) — preferring to explore moves whose win rate is uncertain but promising, rather than only the currently best-looking move. Production recommenders heavily use Thompson sampling for online learning. Transfer it to research and career judgment: when betting across directions, don't only pick the proven safe ones — keep stakes on the "highly uncertain but with sky-high ceiling" directions, because the downside of being wrong is bounded and the upside of being right is unbounded.

AI Prompt

English Prompt

In [context] I have several options to bet on: [list each one's known performance + how sure I am]. Using "optimism in the face of uncertainty" (UCB / Thompson sampling), please: 1. Tag each option with its mean performance and an uncertainty bonus. 2. Identify which is most worth exploring now (high potential, under-tested). 3. Flag which options have limited downside and large upside, worth an optimistic bet.

ε-Greedy & Annealing

The simplest balance knob — and a never-closing slit of curiosity

In Depth

The simplest and most common balance knob is ε-greedy: most of the time (with probability 1−ε) pick the known best (greedily exploit), and with a small probability ε try something random (explore). Simple and robust, so it's everywhere.

Non-trivial: (1) A fixed ε has a flaw: even when you're already sure an option is terrible, you keep trying it forever at a fixed rate — pure waste. The elegant fix is annealing — decay ε over time: explore boldly early, then converge to exploitation as evidence and confidence grow. That decay curve is the same idea behind simulated annealing and learning-rate schedules. (2) Versus UCB/Thompson, ε-greedy explores indiscriminately, wasting precious tries on clearly bad options, whereas smart algorithms only target the promising ones. So ε-greedy is a cheap baseline, trading a little inefficiency for utter simplicity. (3) It maps precisely onto a life rhythm: explore widely when young, then focus with experience — but the key is never anneal ε to zero; keep a never-closing slit. The world changes, and fully stopping exploration is slow ossification.

Practice: set an explicit exploration budget (say 10–20% of time/resources) dedicated to trying new things; let it decay as the field matures, but never bring it to zero.

Annealing: ε decays with experience, but keeps a never-closing slit of curiosity

Classic example

ε-greedy is standard in reinforcement-learning training: early on, a high ε lets the agent try widely to learn the environment; later, a lowered ε lets it steadily harvest the best policy it has learned — nearly every textbook agent grows up this way.

BigCat scenario

Deep learning's learning-rate and sampling-temperature schedules follow the same logic: take big exploratory steps across the parameter space first, then cool down to refine. Ported to time management, it's the famous "reserve 10% of time for exploratory projects" — a fixed budget hedging against ossification. Parenting and self-growth fit too: keep a space for your child (and yourself) to experiment without KPIs, while the bulk consolidates proven habits. The wisdom of annealing: converge with maturity, but never fully switch off curiosity.

AI Prompt

English Prompt

I want to set an explore–exploit budget for [area: learning / work / where I invest] and adjust it over time. My current state is [describe: novice vs mature, what the known best is]. Using ε-greedy with annealing, please: 1. Suggest how high my exploration rate ε should be now. 2. Give an annealing schedule that decays with experience. 3. Remind me how large to keep the "never-zero" minimum exploration slit.