The Multi-Armed Bandit — a row of slot machines with unknown payout rates frames every repeated choice: pull the arm that currently looks best (exploit — cash in known value) or pull an uncertain one (explore — spend a turn to gather information). The value of exploration lies not in the current turn but in the future: what you learn sharpens countless later choices, so the optimal amount of exploration depends on your time horizon. Long horizon → explore more; short horizon → harvest the known best. This turns a seeming personality trait ("do I like novelty?") into math: explore young, exploit late — not from growing timid, but from a shrinking horizon. The yardstick is regret: how much you lost versus knowing the best option from the start.
AI Prompts
中文提示词
我在 [情境:职业方向 / 产品策略 / 学习领域] 面临"继续深耕已知最优 vs 尝试新选项"的反复抉择。我目前的选项和已知表现是 [列出]。请用探索-利用框架帮我:
① 估计我大致还剩多长的"时间视界";
② 据此判断现在该偏探索还是偏利用;
③ 给出一个具体的探索预算比例和下一步动作。
English Prompt
In [context: career direction / product strategy / learning area] I keep facing the choice "keep deepening the known best vs try new options." My current options and their known performance: [list]. Using the explore–exploit framework, please:
1. Estimate how much "time horizon" I likely have left.
2. Judge whether I should now lean toward exploration or exploitation.
3. Give a concrete exploration-budget ratio and a next action.
Optimal Stopping (the Secretary Problem) — when choices come one at a time, must be decided on the spot, and can't be recalled, math gives a clean answer: the 37% rule. Observe the first 37% of candidates without picking, use them to set a bar, then take the first one that beats the bar. The 37% (precisely 1/e) is both how much to look at and the probability of landing the best — even the optimal strategy succeeds only ~37% of the time, an unbeatable ceiling. The crucial precondition is no recall, no backtracking; change that (allow returns or repeated bids) and the threshold shifts, so confirm your situation is truly irreversible first. It splits explore (calibrate, never act) and exploit (only act on the bar, never hesitate) cleanly along the time axis.
AI Prompts
中文提示词
我要在一串"看一个决定一个、错过不回头"的选项里做选择:[描述:招人 / 租房 / 选方案,并估计候选总数]。请帮我用最优停止法则:
① 确认我的处境是否真的不可回头(37% 法则是否适用);
② 算出我该先观察多少个来设门槛;
③ 给我一句可执行的"何时该停、立刻拍板"的判据。
English Prompt
I must choose from a stream of "decide-on-the-spot, no going back" options: [describe: hiring / apartment / picking a design, and estimate the total number]. Using optimal stopping, please:
1. Confirm whether my situation is truly irreversible (does the 37% rule apply).
2. Compute how many I should observe to set the bar.
3. Give me one actionable criterion for "when to stop and commit immediately."
围棋 AI 的蒙特卡洛树搜索用的正是 UCB 思想(UCT)——优先探索那些"胜率不确定但有潜力"的走法,而非只走当前看着最好的。生产级推荐系统则大量用汤普森采样做在线学习。迁移到研究与职业判断:在多个方向里下注时,别只挑已被验证的安全方向,给"高不确定、但上限极高"的方向留注码——因为押错的损失有限,押对的回报无限。
English Summary
UCB & Thompson Sampling — two elegant bandit algorithms sharing one motto: optimism in the face of uncertainty. UCB picks the option with the highest mean plus a bonus that grows with uncertainty (explicit optimism), so it tries options it's unsure about until the bonus shrinks. Thompson sampling draws one sample from your belief about each option and picks the highest (implicit optimism), making an option's selection probability equal to its probability of actually being best. Optimism is justified because uncertainty has value: trying an unsure option has bounded downside (confirm it's bad, stop) but unbounded upside (find a gold mine), an asymmetry that makes betting on uncertainty mathematically favorable. It is Bayesian thinking in action: belief is the posterior, and acting means sampling from it or taking its optimistic upper bound.
AI Prompts
中文提示词
我在 [情境] 有几个可下注的选项:[列出每个的已知表现 + 我有多确定]。请用"面对不确定时保持乐观"(UCB / 汤普森采样)的思路帮我:
① 给每个选项标出"平均表现"和"不确定性加成";
② 指出哪个是当前最值得探索的(高潜力但还没测够);
③ 提醒我哪些选项的下行有限、上行很大,值得乐观下注。
English Prompt
In [context] I have several options to bet on: [list each one's known performance + how sure I am]. Using "optimism in the face of uncertainty" (UCB / Thompson sampling), please:
1. Tag each option with its mean performance and an uncertainty bonus.
2. Identify which is most worth exploring now (high potential, under-tested).
3. Flag which options have limited downside and large upside, worth an optimistic bet.
ε-Greedy & Annealing — the simplest balance knob: most of the time (probability 1−ε) pick the known best (greedily exploit), with small probability ε try something random (explore). Simple and robust, so it's everywhere. A fixed ε wastes effort forever on options you already know are bad; the elegant fix is annealing — decay ε over time: explore boldly early, converge to exploitation as evidence and confidence grow. That decay curve is the same idea behind simulated annealing and learning-rate schedules. Versus UCB/Thompson, ε-greedy explores indiscriminately (wasting tries on clearly bad options) rather than targeting the promising ones, so it's a cheap baseline trading a little inefficiency for simplicity. The life lesson: converge with maturity, but never anneal ε to zero — keep a never-closing slit of curiosity, because the world changes and fully stopping exploration is slow ossification.
AI Prompts
中文提示词
我想给 [领域:学习 / 工作 / 投入方向] 设一个探索-利用的预算并随时间调整。我目前的状态是 [描述:新手还是成熟、已知最优是什么]。请用 ε-贪心与退火的思路帮我:
① 建议我现在的探索率 ε 该设多高;
② 给一条随经验递减的退火节奏;
③ 提醒我那条"永不归零"的最低探索缝隙该留多大。
English Prompt
I want to set an explore–exploit budget for [area: learning / work / where I invest] and adjust it over time. My current state is [describe: novice vs mature, what the known best is]. Using ε-greedy with annealing, please:
1. Suggest how high my exploration rate ε should be now.
2. Give an annealing schedule that decays with experience.
3. Remind me how large to keep the "never-zero" minimum exploration slit.