AI/ML Explained: Reasoning Models

Day 28 · 2026-06-14

For: engineers with coding experience, not specialized in AI

Engineering counterpart → super-individual D5: Agent Design (how to put a "thinking" model into an agent workflow)

Test-time Computeo1 Architecture

paradigm shiftreasoning

One-line analogy

A normal LLM is like an O(1) table lookup—no matter how hard the question, it's one forward pass that immediately spits out an answer, with fixed compute. A reasoning model (o1 / DeepSeek-R1) is like a database query optimizer: trivial queries take the fast path, complex queries are willing to spend more CPU at runtime planning, trying, backtracking before returning. The essence: shift part of the "compute budget" from a one-time investment at training to per-request, on-demand investment at runtime.

What it solves + how it works

Pain point: however large the model, the compute depth of a single forward pass is fixed. That's fine for questions you answer at a glance, but for math or coding problems requiring multi-step deduction, it's like forcing someone to blurt out the answer in 0.5 seconds—errors are inevitable. OpenAI's o1 (2024) answered this: let the model generate a long internal chain of thought before answering, turning "thinking" into a process that can spend more tokens and more time.

The core mechanism is not a prompt trick but a training method: o1 uses reinforcement learning (RL) to teach the model to "learn how to think"—the reward comes from whether the final answer is correct (in auto-verifiable domains like math and code). Through trial and error the model learns to break down steps, recognize and correct its own mistakes, and switch approaches when one fails. In 2025 DeepSeek-R1 went further, showing that pure RL (without human-annotated reasoning traces) can spontaneously elicit reflection, verification, and backtracking.

Two scaling axes (the key insight of reasoning models)

Train-time compute invest once, all requests benefit (classic route)
Test-time compute invest per request, think longer on hard ones

accuracy ∝ log(test-time tokens):
think 1K tokens
think 8K tokens
think 32K tokens ↑ diminishing but rising

Key counterintuition: in reasoning models, "smarter" can be bought with "thinking longer"—this is the second scaling path beyond "make the model bigger" that has dominated since 2017. The cost is a steep rise in latency and token spend.

Code example

from openai import OpenAI
client = OpenAI()  # needs OPENAI_API_KEY

# The key knob for reasoning models: reasoning_effort = "how long to think"
resp = client.chat.completions.create(
    model="o1",
    reasoning_effort="high",  # low/medium/high: high = more test-time compute
    messages=[{"role": "user",
               "content": "A three-digit number: digits sum to 12, "
                          "hundreds = twice the units, tens = units + 1. Find it."}]
)
print(resp.choices[0].message.content)
# Note: internal reasoning tokens are hidden by OpenAI, but you pay for them
print(resp.usage.completion_tokens_details.reasoning_tokens)  # monitor cost

Common pitfall + practical scenario

"Reasoning models are stronger, so use them for everything"—wrong. For tasks that don't need multi-step deduction—summarizing, rewriting an email, looking up a fact—a reasoning model is just slower and more expensive, not necessarily better, and sometimes "overthinks" a simple problem into complexity. The test: does this question need scratch work? Yes → reasoning model; can blurt it out → normal model.

📌 Real scenario: use a reasoning model for decision support that requires rigorous multi-step work—e.g. "given these three homes' price, commute, and school-rank data, do a weighted comparison and flag hidden risks." Such problems need "scratch work," where o1/R1's step-by-step deduction clearly beats a normal model.

Takeaway + question

💡 The essence of reasoning models: buy accuracy with runtime compute—turning "thinking" from an instantaneous act into a tunable resource.
🤔 In your workflow, which tasks are worth 10× the token cost for "the model thinks 30s longer"? Which are a waste?

Chain-of-ThoughtCoT

mechanismintermediate state

One-line analogy

CoT is adding print statements to a black-box function—externalizing the hidden computation that the model would otherwise do in one shot inside its weights into a series of visible intermediate tokens. Those tokens get fed back in as input for the next step, so the model uses its own output as a scratchpad. Without a scratchpad, complex computation must be crammed into one forward pass; with it, serial computation unrolls into more steps.

What it solves + how it works

The root cause is architectural: a Transformer's compute depth per forward pass is fixed (determined by layer count). A problem needing 20 steps of deduction crammed into fixed depth simply runs out of compute. CoT's mechanism is elegant: autoregressive generation is serial—each token the model emits lets it re-read all generated tokens before computing the next. So having it write out reasoning steps effectively spreads computation over the sequence length, trading sequence length for effective compute depth.

In 2022 Wei et al. found that simply showing the reasoning process in few-shot examples (rather than just the answer) sharply raised large models' accuracy on math and commonsense reasoning—and this ability only emerges in sufficiently large models. This is the famous "Let's think step by step."

Why CoT works: spread fixed depth over sequence length

direct → question → [one pass, fixed depth] → answer ✗ error-prone

CoT → question → step1 → step2 → step3 → ... → answer
　　　　　└─ each step re-reads all history → effective depth ∝ #steps ─┘ ✓

Distinguish two layers: CoT prompting (the prompting technique from Day 3, induced by the prompt) and the CoT built into reasoning models (o1/R1, trained via RL, generated spontaneously, far longer than human-written, and self-correcting). The latter upgrades the former from a "trick" into the model's native ability.

Code example

from anthropic import Anthropic
client = Anthropic()

# Use <thinking> tags to give a normal model an explicit scratchpad (manual CoT)
resp = client.messages.create(
    model="claude-opus-4-8", max_tokens=1024,
    messages=[{"role": "user", "content":
        "First reason step by step in <thinking>, then give the final answer in <answer>.\n"
        "Problem: a warehouse takes in 50 units/day, ships out 80/day, starts at 600. When empty?"}]
)
# The model writes the reasoning first (net -30/day -> 600/30=20), then the answer
print(resp.content[0].text)
# In production: strip the <thinking> block before showing the user, keep <answer>

Common pitfall + practical scenario

"The reasoning the model writes = its actual thought process"—not necessarily. Research repeatedly shows CoT can be post-hoc rationalization: the model already leans one way, then fabricates plausible-looking steps. So don't treat CoT text as evidence of the model's reliability—coherent steps don't mean the answer is right, nor that it "really thought that way."

📌 Real scenario: in cross-disciplinary thinking, have the model explicitly write out the reasoning chain before concluding—e.g. "analyze this organizational problem through a complexity-science lens." A visible chain lets you audit step by step which analogy doesn't hold, instead of accepting/rejecting a black-box conclusion.

Takeaway + question

💡 CoT trades "sequence length" for "compute depth"—it doesn't make the model smart, it gives it room for scratch work.
🤔 If a model's reasoning chain might be fabricated after the fact, how would you design a verification mechanism to judge whether the answer is truly trustworthy? (The next section is half the answer.)

Self-VerificationProcess Reward

verificationreward model

One-line analogy

Generating an answer is the writer; verifying it is the reviewer—like writing code vs code review, or writing a test vs running it. The key insight: verification is often easier than generation (understanding a proof is easier than inventing it; reproducing a bug is easier than finding the fix). Reasoning models exploit this asymmetry—using a verifier (reward model) to check the generator's output, partly converting "hard generation" into "easy verification."

What it solves + how it works

Pain point: when a reasoning chain is long, one wrong step dooms all the rest—an early miscalculation gets "taken for granted" by later steps, ending in a confident but wrong answer. How do you know which step is wrong? Two kinds of verifiers:

ORM (Outcome Reward Model)—judges only the final answer, like an end-to-end test. Simple but sparse feedback: you know it's wrong, not where;
PRM (Process Reward Model)—scores every step of the chain, like line-by-line lint / single-step breakpoints. Dense feedback that precisely locates the first wrong step.

OpenAI's 2023 "Let's Verify Step by Step" gave the key empirical result: on hard math (the MATH dataset), process supervision (PRM) significantly outperforms outcome supervision (ORM). Intuition: telling the model "step 3 is wrong" carries far more information than just "the final answer is wrong"—the former resolves the credit-assignment problem down to the step level.

ORM vs PRM: granularity of feedback

step1 ✓ step2 ✓ step3 ✗ step4 ✓ step5 → answer
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
ORM looks at the end only → "answer wrong" (where?)
PRM scores each step → "step 3 wrong" (located → targeted fix/retry)

Self-verification has a lighter form too: have the model check itself ("re-examine the reasoning above—any miscalculation?"). But beware—the verifier is itself a model with blind spots. A model's ability to self-correct its own errors is limited; an independently trained verifier is usually more reliable than "checking yourself."

Code example

def verify_steps(problem, solution_steps):
    # Use a separate call as the "verifier", checking step by step (lightweight PRM)
    prompt = f"""Problem: {problem}
Steps:
{chr(10).join(f'{i+1}. {s}' for i, s in enumerate(solution_steps))}

Check each step. Output only the number of the FIRST wrong step;
if all correct, output OK."""
    r = client.messages.create(
        model="claude-opus-4-8", max_tokens=256,
        messages=[{"role": "user", "content": prompt}])
    return r.content[0].text  # "OK" or "Step 3 wrong: ..."

# Usage: generate -> verify -> if wrong, retry with the "step N is wrong" signal
verdict = verify_steps("Compute 17x23", ["17x20=340", "17x3=51", "340+51=391"])

Common pitfall + practical scenario

"If the model says 'I checked it, looks fine', trust it"—wrong. Models often confidently confirm wrong answers (verifier and generator share the same blind spots). Effective self-verification uses either an independently trained verifier or an external verifiable signal (run the code, plug numbers back in)—anchoring verification to truth outside the model.

📌 Real scenario: for critical calculations/decisions, explicitly require a second independent verification—"re-compute it a different way and cross-check." For ROI, first use the compound-interest formula, then have it verify by year-over-year accumulation; if the two disagree, there's an error.

Takeaway + question

💡 Verification being easier than generation is the lever of reasoning models—but "self-checking" has blind spots; independent/external verification is what's trustworthy.
🤔 In your field, which problems are far easier to "verify" than to "solve"? Those are exactly where AI reasoning + verification shines.

Best-of-N SamplingBest-of-N

sampling strategyspeculative

One-line analogy

Best-of-N is the distributed-systems scatter-gather + ranking: sample N candidate answers to the same question in parallel (scatter), then pick the best with a verifier (gather + rank). It's of the same family as hedged requests—send a few, take the best, trading redundancy for quality. And self-consistency is its special case: majority-vote over N answers, like a Raft quorum.

What it solves + how it works

Pain point: sampling has randomness (temperature > 0), so one generation may happen to go down the wrong path. But "occasionally right" ≠ "can't get it right"—if the correct answer appears even a few times in 100 samples, then as long as we can pick it out, accuracy jumps. Two ways to "pick":

Self-Consistency (majority vote)—sample N reasoning chains, vote on the final answer, take the most frequent. Mathematically it marginalizes over reasoning paths: answer = argmax over the count of paths reaching that answer. No extra verifier needed—a free boost;
Verifier-guided—score the N candidates with a trained verifier, pick the highest. Cobbe et al. 2021 (the GSM8K paper) showed verifier selection is more compute-efficient than just scaling the model, and keeps improving with more samples.

Best-of-N: scatter → gather → rank

　　　　　　┌→ cand 1 → answer A
question → sample×N ┼→ cand 2 → answer B　┐
　　(temp>0)　├→ cand 3 → answer A　├→ select → A
　　　　　　└→ cand 4 → answer A　┘
　　vote: A appears 3× → wins / or verifier picks highest score

This is the most direct embodiment of test-time compute: larger N = more compute spent = higher accuracy (log relationship, diminishing). o1-style models go further—internalizing "sample + verify + backtrack" into a single long reasoning chain instead of running N external passes.

Code example

from collections import Counter

def self_consistency(problem, n=10):
    answers = []
    for _ in range(n):
        r = client.messages.create(
            model="claude-opus-4-8", max_tokens=1024,
            temperature=0.8,  # must be >0 for diversity
            messages=[{"role": "user",
                       "content": f"Reason step by step, then end with 'Answer: X'.\n{problem}"}])
        ans = extract_final_answer(r.content[0].text)  # parse last line
        answers.append(ans)
    # Majority vote on the final answer (marginalize over reasoning paths)
    return Counter(answers).most_common(1)[0][0]

# N independent reasoning paths converging -> much higher odds of correctness
best = self_consistency("A pen costs $3; buy 7 with a 10% discount. Total?", n=10)

Common pitfall + practical scenario

"Bigger N is always better"—there's a ceiling. (1) Cost grows linearly in N while accuracy grows only logarithmically—ROI drops fast; (2) majority vote only works for problems with a unique answer (math, multiple choice)—open-ended writing can't be voted on; (3) if the model systematically makes the same mistake, 100 candidates agree on being wrong—voting amplifies consensus, not truth.

📌 Real scenario: for high-stakes calculations with a ground truth (taxes, dosage, finance), run self-consistency manually—ask the same question 3-5 times; only trust an answer if they agree. Disagreement is the danger signal telling you to bring in a human to cross-check.

Takeaway + question

💡 Best-of-N turns randomness into an advantage via "sampling redundancy + selection"—but it amplifies consensus, and is helpless against systematic error.
🤔 Voting amplifies consensus, not truth—where else have you seen this trap, in group decisions or scientific consensus?
Engineering counterpart → super-individual D5: Agent Design

Deep Questions

1. "Test-time compute" and "making bigger models" are two scaling paths—do they substitute or complement? How to weigh them in 2026?

They are complementary and multiplicative, not substitutes. Bigger models (train-time compute) raise the model's capability ceiling—how much it "knows," how accurate its single-step reasoning is; test-time compute raises how fully you extract that capability—given a fixed-tier model, thinking longer approaches its ceiling. The relationship is like CPU clock speed vs how long you let a program run: a slow CPU can't solve an over-its-head problem no matter how long it runs, but once the clock is fast enough, more time solves harder problems. Practical trade-offs: (a) basic tasks → small model + low effort, save money; (b) hard problems → strong model + high effort; (c) the key insight is that test-time compute has diminishing returns (log), so blindly raising effort stops being worth it past a point—better to switch to a stronger base model. The 2024-2025 industry pivot: as pretraining data nears exhaustion and train-time scaling's marginal gains drop, test-time compute became the new growth curve—the root cause of the o1/R1 boom. A distributed-systems analogy: it's like "vertical scaling (stronger box)" vs "giving each request a larger timeout budget"—two knobs, tuned by workload.

2. CoT may be "post-hoc rationalization"—so does a reasoning model's long thinking chain have cognitive value? What should we trust it for, and not?

Split CoT into two functions and judge each separately. Function one: as a computation carrier (functional role)—CoT tokens really do give the model extra serial compute space; this part is genuinely effective—remove CoT and accuracy drops, proving it's not just decoration. Function two: as an interpretability window (faithfulness)—"does this text faithfully reflect the model's true reasoning?" This part is not fully trustworthy: studies find a model can be swayed by a bias in the prompt yet never mention it in the CoT, or reach a conclusion first and backfill steps. So the right stance: trust CoT's "compute value," doubt its "confession value." It's a scratchpad, not a polygraph. Practical implications: (1) using CoT to improve answer quality—safe; (2) using CoT to audit "where the model went wrong"—useful but skeptical, since visible steps may not be the true causes; (3) treating CoT as evidence of the model's confidence—dangerous, since coherent reasoning often wraps a wrong answer. This is why verification (section 3) can't rely solely on reading the model's own chain—it must anchor to external signals. Deeper: it forces us to rethink what "understanding" is—a system that produces useful intermediate steps that don't faithfully reflect its internals, does it "reason"? Neuroscience has a mirror: much of human "self-explanation" is likewise post-hoc rationalization.

3. Section 3 (verifier selection) and section 4 (Best-of-N) are two faces of one thing—how do they relate to o1's "single long chain"?

They are external-parallel vs internal-serial implementations of the same idea: "explore multiple paths + select/correct." Best-of-N (external): run N independent samples, paths unaware of each other, then pick the best by vote or verifier at the result level—pros: simple, parallel, controllable; cons: N paths are fully redundant, can't learn from each other, and selection happens only at the end. o1/R1 (internal): compress the search into one long chain—the model "tries a path → finds it wrong → backtracks → switches," with verification and exploration interwoven in one sequence, so later steps can use lessons from earlier failures. This is closer to how humans solve problems: not doing 10 parallel exam sheets and picking the best, but repeatedly editing one sheet. The cost is that this ability must be specifically trained via RL (normal models don't spontaneously backtrack). The trend: o1-style "internalized search" proved more efficient—information is reused within the chain, nothing wasted. But they're not mutually exclusive: you can stack Best-of-N on top of o1 (run a reasoning model N times and vote), squeezing accuracy further at multiplied cost. From a search-algorithm view: Best-of-N is like independent random restarts, while o1's internal search is more like DFS/beam search with backtracking—the latter uses state memory to avoid re-exploring.

4. Reasoning models excel at "verifiable domains" (math, code). Does the paradigm still hold in domains with no objective right/wrong (writing, strategy, interpersonal judgment)?

The core constraint is nailed: this wave's breakthrough depends heavily on "auto-verifiable" reward signals—math answers can be checked, code can be tested, so RL has a clean reward. Once you enter domains with no ground truth, all three mechanisms loosen: (1) RL training lacks a reward—"is this essay good" has no auto-grader, only human preference (expensive, slow, biased) or another model as judge (introducing the judge's own bias); (2) verification is no longer easier than generation—judging whether a strategic decision is correct can be as hard as making it, sometimes only knowable years later, so the verification lever vanishes; (3) Best-of-N's vote fails—open problems have no "unique answer" to vote on. So currently reasoning models' gains in soft domains are far smaller than in hard ones. But there's partial transfer: (a) soft problems often contain verifiable sub-structures—market sizing and financial models inside a strategy analysis are hard, can use a reasoning model, leaving the soft judgment to humans; (b) process soundness is still partly checkable—logical consistency, self-contradiction, missing key factors are easier to judge than "final correctness." A deeper lesson for the "AI super-individual": humans' irreplaceable value is concentrating in "judgments that can't be auto-verified"—taste, value ordering, betting under uncertainty. AI commoditized "thinking with a known answer," which paradoxically highlights how human-exclusive "defining what a good answer is" remains.

5. If "accuracy ∝ log(test-time compute)," is there a problem—given infinite compute, can AI solve anything? Where's the boundary?

No, the boundary is real and multi-layered. First, the log relationship is itself a ceiling signal: raising accuracy from 90% to 99% may take 10× compute; 99% to 99.9% another 10×—exponential cost for linear gain, hitting an economic wall fast. Second, base capability is a hard ceiling: test-time compute can only approach the model's existing ceiling, it can't conjure knowledge or reasoning patterns the model lacks. A model that never learned group theory can't prove a theorem needing it no matter how long it thinks—thinking only explores "what it can reach" more thoroughly, not creates from nothing. Third, the presence of a verification signal (see previous question): in domains without verifiable reward, more compute has no reliable "selection" basis—thinking longer may just refine errors. Fourth, some problems are computationally irreducible: certain problems have no shortcut, you must actually compute step by step, and no amount of clever "thinking" compresses it—echoing Wolfram's insight in complexity science. So the more accurate picture: test-time compute is a powerful but bounded knob—it makes "problems within the model's reach but requiring effort" solvable, but can't touch "beyond the model's ceiling" or "intrinsically irreducible" problems. For the super-individual: AI's leverage is largest on "known method, needs patient execution" problems; genuine creative breakthroughs (inventing new methods, posing new questions) remain scarce—and that's exactly where humans should invest.

Test-time Computeo1 Architecture

Chain-of-ThoughtCoT

Self-VerificationProcess Reward

Best-of-N SamplingBest-of-N

Further Reading

Deep Questions