DAY 49 / PHASE 6 · ENGINEERING THE NEW FRONTIER

Reasoning Model Engineering

Routing · Thinking Budget · Agentic Reasoning · Using the Trace

2026-06-29 · BigCat

Reasoning isn't a "smarter model" — it's a slower, pricier knob that also overthinks.

Prerequisite → ai-ml-daily Day 28 (Reasoning Models)

// WHY THIS MATTERS

o1 / o3 / DeepSeek-R1 / Claude extended thinking turn "reasoning ability" into a tunable parameter. The engineering reality: most teams either turn thinking on for everything (cost and latency multiply, and simple tasks get less accurate) or never turn it on (forcing a standard model through tasks that genuinely need reasoning). This issue isn't about what a reasoning model is or how RL trains the CoT — that's ai-ml-daily Day 28. It's about four engineering decisions: which requests should route to reasoning, how to set the thinking budget (and why "longer = more accurate" is false), how to use reasoning inside an agentic loop without breaking the protocol, and how far you can trust a reasoning trace. The core counter-intuition: longer chains of thought are often less accurate — and you paid double the tokens for them.

// 01

Routing: Which Requests Actually Deserve Reasoning

Claim: a reasoning model isn't the default tier — it's a high-cost tier triggered by "verifiable + steps not compressible up front."

Background & Principle

Reasoning models trade test-time compute for accuracy: more tokens spent on hidden thinking. Snell et al. 2024, Scaling LLM Test-Time Compute Optimally (arXiv 2408.03314), gives the key engineering result — the optimal strategy depends on task difficulty: adding reasoning to easy questions yields almost nothing, while on hard ones test-time compute can beat simply scaling model parameters. So the routing test isn't "important vs not," it's three conditions that must all hold to escalate: (1) the answer is verifiable (math/code/logic with objective truth); (2) it needs multi-step planning whose step count can't be pre-compressed into a workflow; (3) the cost of error exceeds the extra money and latency. Conversely, retrieval QA, formatting, classification, rewriting, extraction — one-shot tasks — routed to reasoning just burn money and often lose accuracy to overthinking.

request in │ ▼ ┌─────────────────────┐ no │ objectively ├────────▶ standard model (fast/cheap) │ verifiable answer? │ └──────────┬──────────┘ │ yes ▼ ┌─────────────────────┐ no │ multi-step & step ├────────▶ workflow / standard model │ count unknown? │ └──────────┬──────────┘ │ yes ▼ ┌─────────────────────┐ low │ error cost vs 2-5× ├────────▶ standard + self-check │ overhead? │ └──────────┬──────────┘ │ high ▼ reasoning model + tiered thinking budget

Hands-on

Write routing as a code-layer pre-classifier, not "send everything to o3":

def route(task):
    # cheap classifier (standard model/rules) picks a difficulty tier
    if task.kind in ("lookup", "format", "classify"):
        return ("claude-haiku-4-5", None)            # no thinking
    if task.kind in ("math", "plan", "debug") and task.verifiable:
        return ("o3", "medium")                     # escalate + medium budget
    return ("claude-sonnet-4-6", None)

An underrated fact: within the same reasoning family, a smaller model on high effort often loses to a larger model on low effort — but is far cheaper. So the routing table must tune two dimensions, model size and reasoning tier, not just one.

Failure mode: treating reasoning as "quality insurance" turned on everywhere. Result: (1) latency goes from 1s to 10s+, killing streaming first-token UX; (2) easy tasks get less accurate (see §2 overthinking); (3) bills 3-5×. Reasoning is a guided weapon, not a default shield.

Resources · Snell et al. Scaling LLM Test-Time Compute Optimally, arXiv 2408.03314 · OpenAI Reasoning best practices, developers.openai.com

// 02

Thinking Budget: Longer ≠ More Accurate

Claim: the thinking budget is a knob with a sweet spot — too low underthinks, too high overthinks, both lose accuracy.

Background & Principle

The two vendors expose different control surfaces. OpenAI uses reasoning_effort (low / medium / high); Anthropic extended thinking historically used budget_tokens (must be < max_tokens), but on Opus 4.6 / Sonnet 4.6 budget_tokens is deprecated, replaced by adaptive thinking — depth is handled by effort and the model decides how long to think. The key counter-intuition comes from two empirical papers: On the Underthinking of o1-like LLMs (arXiv 2501.18585) finds models flip between ideas without digging in; and the opposite, overthinking, is just as real — longer chains of thought can lower average accuracy as the model loops and self-revises. In short, more budget isn't better: each task type has its own sweet spot, found via eval, not guessed.

Hands-on

# OpenAI o-series: effort is the main knob (Responses API)
resp = client.responses.create(
    model="o3", reasoning={"effort": "medium"}, input=task)

# Anthropic extended thinking (3.7 ~ 4.x < 4.6): explicit budget_tokens
resp = client.messages.create(
    model="claude-sonnet-4-5", max_tokens=8000,
    thinking={"type":"enabled", "budget_tokens": 4000},  # must be < max_tokens
    messages=msgs)

# 4.6+: budget_tokens deprecated, use adaptive thinking — depth via effort
thinking={"type":"enabled"}   # model adapts, paired with the effort knob

Engineering practice: set three budget tiers per task class (low/medium/high, or ~1k/4k/16k tokens), then sweep a set of ground-truth questions and plot "budget vs accuracy," taking the knee, not the peak. Most tasks top out at medium; high only adds cost, not score.

Failure mode: (1) handing a simple classification task a high budget — triggers overthinking, accuracy drops, latency spikes. (2) setting budget_tokens ≥ max_tokens errors out; thinking tokens draw from the same output budget. (3) shipping old budget_tokens code on 4.6+ — deprecated; migrate to effort.

Resources · Anthropic Extended thinking docs, docs.claude.com/.../extended-thinking · Wang et al. Underthinking of o1-like LLMs, arXiv 2501.18585

// 03

Reasoning Inside the Agentic Loop: Interleaved Thinking & the Protocol Trap

Claim: the biggest trap with reasoning in an agent isn't when to turn it on — it's preserving thinking blocks verbatim between tool calls.

Background & Principle

Standard reasoning is "think, then answer." But an agent needs "think a bit → call tool → see result → think again → call again" — that's interleaved thinking (Anthropic enables it with the beta header interleaved-thinking-2025-05-14): the model can insert thinking around each tool call and adjust the next step from the tool_result. OpenAI's o3/o4-mini instead train tool calling natively into the CoT, so the model decides when to use tools while thinking. Here's the easily-tripped protocol trap: across turns, you must pass back the previous assistant turn's thinking blocks verbatim (including their signatures). Many harnesses strip thinking as "filler" to save tokens — the result is either an error, or a severed reasoning chain and a suddenly worse agent. In interleaved mode the thinking tokens also accumulate across blocks, so budget semantics differ from a single turn.

single-turn reasoning: [think ........] → answer agentic interleaved: [think]→tool_use──▶ tool_result ──▶[think]→tool_use──▶ ... → answer ▲ ▲ └── these thinking blocks must be passed back verbatim (w/ signature) stripping them = protocol error / severed reasoning chain

Hands-on

msgs = [{"role":"user", "content": task}]
while True:
    r = client.beta.messages.create(
        model="claude-sonnet-4-5", max_tokens=8000,
        betas=["interleaved-thinking-2025-05-14"],
        thinking={"type":"enabled", "budget_tokens":4000},
        tools=schemas, messages=msgs)
    msgs.append({"role":"assistant", "content": r.content})  # ← keep thinking+tool_use verbatim
    if r.stop_reason != "tool_use": break
    msgs.append({"role":"user", "content": run_tools(r.content)})
    # critical: do NOT delete thinking blocks from r.content here

Another engineering point: reasoning is most valuable in an agent for planning and recovery — let it think hard about the overall plan up front, and about why a tool failed afterward. Purely mechanical tool calls in between don't need high reasoning every step; tune the tier on demand.

Failure mode: (1) compaction/history trimming deletes thinking blocks — in interleaved mode this errors or severs reasoning; if you keep a turn, keep all of it. (2) jamming a reasoning model into a fixed-step pipeline as an "agent" — that's a workflow; a standard model + code control flow is faster and more stable (see Day 3).

Resources · Anthropic Extended thinking · interleaved thinking, docs.claude.com/.../extended-thinking · OpenAI o3/o4-mini function calling guide, developers.openai.com/cookbook

// 04

Prompt Anti-Patterns & the Honest Use of the Trace

Claim: a CoT prompt tuned for a standard model often hurts a reasoning model; and the reasoning trace is not a faithful explanation.

Background & Principle

A reasoning model already does internal thinking, so telling it externally to "think step by step" is redundant and sometimes harmful — OpenAI's official guide is explicit: these models prefer concise, direct prompts, and forced CoT scaffolding plus heavy few-shot can lower performance (Simon Willison's 2025 prompting notes summarize this well). So the "let's think step by step + 5 examples" templates carried over from the GPT-4 era should be torn out. The second trap is the trustworthiness of the reasoning trace: a chain of thought reads coherent, but it's not a faithful record of the model's decision — interpretability research repeatedly shows post-hoc rationalization (backward rationalization). So don't parse CoT text for hard downstream decisions, and don't believe a conclusion just because "the reasoning looks right." Trust the verifiable final output (run the tests, reconcile the numbers, check the citations), not the narrated process.

Hands-on

# BAD: drop a GPT-4 CoT template straight onto o3 / extended thinking
"Let's think step by step. Here are 6 examples... Now reason carefully:"
# → redundant scaffolding + heavy few-shot, often lowers reasoning-model perf

# GOOD: concise and direct, state the constraints, let it think
"Fix this concurrency bug. Constraints: don't change the public API, keep it backward compatible."
# When you need markdown (OpenAI o-series), put on the first line of the developer message:
"Formatting re-enabled"

An eval rule of thumb that runs against intuition: do not assume a longer trace = more trustworthy. When evaluating reasoning output, watch objective accuracy and don't be dazzled by a long, meticulous trace — "thought thoroughly yet answered wrong" is common on hard problems. It's essentially using an imperfect verifier to pick among imperfect chains, laundering errors into more confident wrongness.

Failure mode: (1) treating the reasoning model as one that "explains itself" and using the CoT as audit evidence — it's a performance, not a log. (2) piling on few-shot CoT templates — spend more, perform worse. (3) an LLM-judge that rewards "thorough reasoning" — that trains the model to overthink.

Resources · Simon Willison OpenAI reasoning models: advice on prompting, simonwillison.net · DeepSeek-R1 (RL-incentivized reasoning), arXiv 2501.12948

// Capstone · Add a "reasoning router + budget" layer to your agent

String the four points into a shippable weekend upgrade: give an existing agent tiered reasoning instead of all-on or all-off.

Routing layer (§1): add a cheap classifier before the LLM call — push lookup/format to a small standard model, and math/plan/debug that's verifiable to reasoning. Log the hit rate per class.
Budget sweep (§2): for each escalated task class, prepare 10-20 ground-truth questions, sweep low/medium/high, plot the accuracy curve, and take the knee. You'll usually find medium already tops out.
Agentic wiring (§3): if the agent thinks-then-tools, enable interleaved thinking and audit your message assembly — confirm thinking blocks survive verbatim across turns and compaction doesn't drop them.
Prompt slimming (§4): delete all "step by step" scaffolding and redundant few-shot; switch to concise-and-direct + explicit constraints. A/B before vs after.
Honest eval: judge only verifiable final accuracy, p95 latency, and per-request cost; never reward trace length. Reconciling these gives you an "accuracy / cost / latency" table — the entire decision basis for reasoning engineering.

Once done, "should we use reasoning?" stops being a gut call — you have a routing table keyed by task type with tiers and budgets: cheaper, lower latency, and often more accurate.

// KEY TERMS

Reasoning Model: A model that trades test-time compute (hidden thinking tokens) for accuracy: o1/o3, DeepSeek-R1, Claude extended thinking.
Test-Time Compute: Extra compute spent at inference (not training) time. Snell 2024: the optimal allocation depends on task difficulty.
Thinking Budget: Token cap for internal thinking. Anthropic budget_tokens / OpenAI reasoning_effort.
Adaptive Thinking: The Opus/Sonnet 4.6+ replacement for fixed budget_tokens; depth adapts via effort.
reasoning_effort: OpenAI o-series reasoning-depth knob: low / medium / high.
Interleaved Thinking: Inserting thinking between tool calls (Anthropic beta header). Requires keeping thinking blocks across turns.
Overthinking / Underthinking: Overthinking (verbose, self-revising, less accurate) vs underthinking (jumping between ideas without depth).
Backward Rationalization: The model reaches a conclusion then invents reasons. So a CoT trace is not a faithful decision log.

// DEEPER QUESTIONS

If the reasoning trace isn't faithful, why does passing it back to the model (interleaved thinking) still improve agent performance?

Distinguish "external explanation" from "internal working memory." As a causal explanation for humans the trace is untrustworthy (post-hoc rationalization); but as conditioning input for the model's own next step it's genuinely useful — it caches intermediate conclusions into context, and later tokens compute on top of them. So stripping it severs the chain (loses working memory), while using it as audit evidence fools you (it's not a faithful log). Same text, two uses, completely different trustworthiness.

Given a "budget sweet spot," why don't vendors just let the model decide how long to think instead of exposing budget/effort knobs?

4.6's adaptive thinking is exactly that move. But knobs still matter: (1) adaptation is the model's subjective difficulty estimate, with systematic bias (overthinks easy items); (2) production needs predictable cost and latency ceilings, which pure adaptation makes p95 uncontrollable; (3) different businesses sit at different accuracy-vs-latency trade-offs and need external enforcement. So the trend is "adaptive by default + effort as a ceiling," not pure auto.

The routing classifier itself costs an LLM call. When does this routing layer become a net loss?

When the request distribution is highly homogeneous (almost all one difficulty tier), routing just adds a hop of latency and cost — fixing the tier is better. Routing pays off when the difficulty distribution is wide and the high tier is much pricier: the saved big-model/high-reasoning calls dwarf the classification cost. Optimizations: classify with rules/a small model/cache (sub-second, near-free), reserve expensive judgment for genuinely ambiguous requests; or route for free on upstream metadata (task type, source).

"Shorter answers are often more accurate" — does that mean we should penalize long output? How do we avoid squashing hard problems too short?

You can't penalize length across the board, or hard problems get underthought. The fix is to tier by difficulty: squash easy items (prevent overthinking), give hard ones enough budget (prevent underthinking), decided jointly by §1 routing and §2 budget sweeps. Statistically "shorter is more accurate" is largely survivorship bias — models tend to write longer when unsure. The real lever is matching "thinking amount ↔ true difficulty," not monotonically rewarding or punishing length.

Reasoning models train tool calls into the CoT (e.g. o3). What does that mean for Day 3 harness design?

It means the "when to call a tool" decision shifts partly from the harness into the model. Upside: the model can weigh tool use while thinking and parallelize calls; cost: the harness loses visibility into control flow, and the permission gate must still physically intercept at execution (the model "deciding" in CoT ≠ being allowed). Design-wise the harness retreats from "orchestrate every step" to "define boundaries + intercept dangerous actions + backstop the loop," handing tactical decisions to the model while it guards the strategic rails.

// FURTHER READING

Anthropic · Extended thinking — official docs for budget_tokens / adaptive thinking / interleaved thinking
OpenAI · Reasoning best practices — concise prompts, developer message, effort usage
Snell et al. · Scaling LLM Test-Time Compute Optimally — foundational paper on difficulty-based compute allocation
Wang et al. · On the Underthinking of o1-like LLMs — empirical reasoning failure modes
Simon Willison · Advice on prompting reasoning models — why to stop saying "think step by step"