A normal LLM is like an O(1) table lookup—no matter how hard the question, it's one forward pass that immediately spits out an answer, with fixed compute. A reasoning model (o1 / DeepSeek-R1) is like a database query optimizer: trivial queries take the fast path, complex queries are willing to spend more CPU at runtime planning, trying, backtracking before returning. The essence: shift part of the "compute budget" from a one-time investment at training to per-request, on-demand investment at runtime.
Pain point: however large the model, the compute depth of a single forward pass is fixed. That's fine for questions you answer at a glance, but for math or coding problems requiring multi-step deduction, it's like forcing someone to blurt out the answer in 0.5 seconds—errors are inevitable. OpenAI's o1 (2024) answered this: let the model generate a long internal chain of thought before answering, turning "thinking" into a process that can spend more tokens and more time.
The core mechanism is not a prompt trick but a training method: o1 uses reinforcement learning (RL) to teach the model to "learn how to think"—the reward comes from whether the final answer is correct (in auto-verifiable domains like math and code). Through trial and error the model learns to break down steps, recognize and correct its own mistakes, and switch approaches when one fails. In 2025 DeepSeek-R1 went further, showing that pure RL (without human-annotated reasoning traces) can spontaneously elicit reflection, verification, and backtracking.
Key counterintuition: in reasoning models, "smarter" can be bought with "thinking longer"—this is the second scaling path beyond "make the model bigger" that has dominated since 2017. The cost is a steep rise in latency and token spend.
from openai import OpenAI client = OpenAI() # needs OPENAI_API_KEY # The key knob for reasoning models: reasoning_effort = "how long to think" resp = client.chat.completions.create( model="o1", reasoning_effort="high", # low/medium/high: high = more test-time compute messages=[{"role": "user", "content": "A three-digit number: digits sum to 12, " "hundreds = twice the units, tens = units + 1. Find it."}] ) print(resp.choices[0].message.content) # Note: internal reasoning tokens are hidden by OpenAI, but you pay for them print(resp.usage.completion_tokens_details.reasoning_tokens) # monitor cost
CoT is adding print statements to a black-box function—externalizing the hidden computation that the model would otherwise do in one shot inside its weights into a series of visible intermediate tokens. Those tokens get fed back in as input for the next step, so the model uses its own output as a scratchpad. Without a scratchpad, complex computation must be crammed into one forward pass; with it, serial computation unrolls into more steps.
The root cause is architectural: a Transformer's compute depth per forward pass is fixed (determined by layer count). A problem needing 20 steps of deduction crammed into fixed depth simply runs out of compute. CoT's mechanism is elegant: autoregressive generation is serial—each token the model emits lets it re-read all generated tokens before computing the next. So having it write out reasoning steps effectively spreads computation over the sequence length, trading sequence length for effective compute depth.
In 2022 Wei et al. found that simply showing the reasoning process in few-shot examples (rather than just the answer) sharply raised large models' accuracy on math and commonsense reasoning—and this ability only emerges in sufficiently large models. This is the famous "Let's think step by step."
Distinguish two layers: CoT prompting (the prompting technique from Day 3, induced by the prompt) and the CoT built into reasoning models (o1/R1, trained via RL, generated spontaneously, far longer than human-written, and self-correcting). The latter upgrades the former from a "trick" into the model's native ability.
from anthropic import Anthropic client = Anthropic() # Use <thinking> tags to give a normal model an explicit scratchpad (manual CoT) resp = client.messages.create( model="claude-opus-4-8", max_tokens=1024, messages=[{"role": "user", "content": "First reason step by step in <thinking>, then give the final answer in <answer>.\n" "Problem: a warehouse takes in 50 units/day, ships out 80/day, starts at 600. When empty?"}] ) # The model writes the reasoning first (net -30/day -> 600/30=20), then the answer print(resp.content[0].text) # In production: strip the <thinking> block before showing the user, keep <answer>
Generating an answer is the writer; verifying it is the reviewer—like writing code vs code review, or writing a test vs running it. The key insight: verification is often easier than generation (understanding a proof is easier than inventing it; reproducing a bug is easier than finding the fix). Reasoning models exploit this asymmetry—using a verifier (reward model) to check the generator's output, partly converting "hard generation" into "easy verification."
Pain point: when a reasoning chain is long, one wrong step dooms all the rest—an early miscalculation gets "taken for granted" by later steps, ending in a confident but wrong answer. How do you know which step is wrong? Two kinds of verifiers:
OpenAI's 2023 "Let's Verify Step by Step" gave the key empirical result: on hard math (the MATH dataset), process supervision (PRM) significantly outperforms outcome supervision (ORM). Intuition: telling the model "step 3 is wrong" carries far more information than just "the final answer is wrong"—the former resolves the credit-assignment problem down to the step level.
Self-verification has a lighter form too: have the model check itself ("re-examine the reasoning above—any miscalculation?"). But beware—the verifier is itself a model with blind spots. A model's ability to self-correct its own errors is limited; an independently trained verifier is usually more reliable than "checking yourself."
def verify_steps(problem, solution_steps): # Use a separate call as the "verifier", checking step by step (lightweight PRM) prompt = f"""Problem: {problem} Steps: {chr(10).join(f'{i+1}. {s}' for i, s in enumerate(solution_steps))} Check each step. Output only the number of the FIRST wrong step; if all correct, output OK.""" r = client.messages.create( model="claude-opus-4-8", max_tokens=256, messages=[{"role": "user", "content": prompt}]) return r.content[0].text # "OK" or "Step 3 wrong: ..." # Usage: generate -> verify -> if wrong, retry with the "step N is wrong" signal verdict = verify_steps("Compute 17x23", ["17x20=340", "17x3=51", "340+51=391"])
Best-of-N is the distributed-systems scatter-gather + ranking: sample N candidate answers to the same question in parallel (scatter), then pick the best with a verifier (gather + rank). It's of the same family as hedged requests—send a few, take the best, trading redundancy for quality. And self-consistency is its special case: majority-vote over N answers, like a Raft quorum.
Pain point: sampling has randomness (temperature > 0), so one generation may happen to go down the wrong path. But "occasionally right" ≠ "can't get it right"—if the correct answer appears even a few times in 100 samples, then as long as we can pick it out, accuracy jumps. Two ways to "pick":
This is the most direct embodiment of test-time compute: larger N = more compute spent = higher accuracy (log relationship, diminishing). o1-style models go further—internalizing "sample + verify + backtrack" into a single long reasoning chain instead of running N external passes.
from collections import Counter def self_consistency(problem, n=10): answers = [] for _ in range(n): r = client.messages.create( model="claude-opus-4-8", max_tokens=1024, temperature=0.8, # must be >0 for diversity messages=[{"role": "user", "content": f"Reason step by step, then end with 'Answer: X'.\n{problem}"}]) ans = extract_final_answer(r.content[0].text) # parse last line answers.append(ans) # Majority vote on the final answer (marginalize over reasoning paths) return Counter(answers).most_common(1)[0][0] # N independent reasoning paths converging -> much higher odds of correctness best = self_consistency("A pen costs $3; buy 7 with a 10% discount. Total?", n=10)