Reasoning isn't a "smarter model" — it's a slower, pricier knob that also overthinks.
o1 / o3 / DeepSeek-R1 / Claude extended thinking turn "reasoning ability" into a tunable parameter. The engineering reality: most teams either turn thinking on for everything (cost and latency multiply, and simple tasks get less accurate) or never turn it on (forcing a standard model through tasks that genuinely need reasoning). This issue isn't about what a reasoning model is or how RL trains the CoT — that's ai-ml-daily Day 28. It's about four engineering decisions: which requests should route to reasoning, how to set the thinking budget (and why "longer = more accurate" is false), how to use reasoning inside an agentic loop without breaking the protocol, and how far you can trust a reasoning trace. The core counter-intuition: longer chains of thought are often less accurate — and you paid double the tokens for them.
Reasoning models trade test-time compute for accuracy: more tokens spent on hidden thinking. Snell et al. 2024, Scaling LLM Test-Time Compute Optimally (arXiv 2408.03314), gives the key engineering result — the optimal strategy depends on task difficulty: adding reasoning to easy questions yields almost nothing, while on hard ones test-time compute can beat simply scaling model parameters. So the routing test isn't "important vs not," it's three conditions that must all hold to escalate: (1) the answer is verifiable (math/code/logic with objective truth); (2) it needs multi-step planning whose step count can't be pre-compressed into a workflow; (3) the cost of error exceeds the extra money and latency. Conversely, retrieval QA, formatting, classification, rewriting, extraction — one-shot tasks — routed to reasoning just burn money and often lose accuracy to overthinking.
Write routing as a code-layer pre-classifier, not "send everything to o3":
def route(task):
# cheap classifier (standard model/rules) picks a difficulty tier
if task.kind in ("lookup", "format", "classify"):
return ("claude-haiku-4-5", None) # no thinking
if task.kind in ("math", "plan", "debug") and task.verifiable:
return ("o3", "medium") # escalate + medium budget
return ("claude-sonnet-4-6", None)
An underrated fact: within the same reasoning family, a smaller model on high effort often loses to a larger model on low effort — but is far cheaper. So the routing table must tune two dimensions, model size and reasoning tier, not just one.
The two vendors expose different control surfaces. OpenAI uses reasoning_effort (low / medium / high); Anthropic extended thinking historically used budget_tokens (must be < max_tokens), but on Opus 4.6 / Sonnet 4.6 budget_tokens is deprecated, replaced by adaptive thinking — depth is handled by effort and the model decides how long to think. The key counter-intuition comes from two empirical papers: On the Underthinking of o1-like LLMs (arXiv 2501.18585) finds models flip between ideas without digging in; and the opposite, overthinking, is just as real — longer chains of thought can lower average accuracy as the model loops and self-revises. In short, more budget isn't better: each task type has its own sweet spot, found via eval, not guessed.
# OpenAI o-series: effort is the main knob (Responses API)
resp = client.responses.create(
model="o3", reasoning={"effort": "medium"}, input=task)
# Anthropic extended thinking (3.7 ~ 4.x < 4.6): explicit budget_tokens
resp = client.messages.create(
model="claude-sonnet-4-5", max_tokens=8000,
thinking={"type":"enabled", "budget_tokens": 4000}, # must be < max_tokens
messages=msgs)
# 4.6+: budget_tokens deprecated, use adaptive thinking — depth via effort
thinking={"type":"enabled"} # model adapts, paired with the effort knob
Engineering practice: set three budget tiers per task class (low/medium/high, or ~1k/4k/16k tokens), then sweep a set of ground-truth questions and plot "budget vs accuracy," taking the knee, not the peak. Most tasks top out at medium; high only adds cost, not score.
budget_tokens ≥ max_tokens errors out; thinking tokens draw from the same output budget. (3) shipping old budget_tokens code on 4.6+ — deprecated; migrate to effort.
Standard reasoning is "think, then answer." But an agent needs "think a bit → call tool → see result → think again → call again" — that's interleaved thinking (Anthropic enables it with the beta header interleaved-thinking-2025-05-14): the model can insert thinking around each tool call and adjust the next step from the tool_result. OpenAI's o3/o4-mini instead train tool calling natively into the CoT, so the model decides when to use tools while thinking. Here's the easily-tripped protocol trap: across turns, you must pass back the previous assistant turn's thinking blocks verbatim (including their signatures). Many harnesses strip thinking as "filler" to save tokens — the result is either an error, or a severed reasoning chain and a suddenly worse agent. In interleaved mode the thinking tokens also accumulate across blocks, so budget semantics differ from a single turn.
msgs = [{"role":"user", "content": task}]
while True:
r = client.beta.messages.create(
model="claude-sonnet-4-5", max_tokens=8000,
betas=["interleaved-thinking-2025-05-14"],
thinking={"type":"enabled", "budget_tokens":4000},
tools=schemas, messages=msgs)
msgs.append({"role":"assistant", "content": r.content}) # ← keep thinking+tool_use verbatim
if r.stop_reason != "tool_use": break
msgs.append({"role":"user", "content": run_tools(r.content)})
# critical: do NOT delete thinking blocks from r.content here
Another engineering point: reasoning is most valuable in an agent for planning and recovery — let it think hard about the overall plan up front, and about why a tool failed afterward. Purely mechanical tool calls in between don't need high reasoning every step; tune the tier on demand.
A reasoning model already does internal thinking, so telling it externally to "think step by step" is redundant and sometimes harmful — OpenAI's official guide is explicit: these models prefer concise, direct prompts, and forced CoT scaffolding plus heavy few-shot can lower performance (Simon Willison's 2025 prompting notes summarize this well). So the "let's think step by step + 5 examples" templates carried over from the GPT-4 era should be torn out. The second trap is the trustworthiness of the reasoning trace: a chain of thought reads coherent, but it's not a faithful record of the model's decision — interpretability research repeatedly shows post-hoc rationalization (backward rationalization). So don't parse CoT text for hard downstream decisions, and don't believe a conclusion just because "the reasoning looks right." Trust the verifiable final output (run the tests, reconcile the numbers, check the citations), not the narrated process.
# BAD: drop a GPT-4 CoT template straight onto o3 / extended thinking
"Let's think step by step. Here are 6 examples... Now reason carefully:"
# → redundant scaffolding + heavy few-shot, often lowers reasoning-model perf
# GOOD: concise and direct, state the constraints, let it think
"Fix this concurrency bug. Constraints: don't change the public API, keep it backward compatible."
# When you need markdown (OpenAI o-series), put on the first line of the developer message:
"Formatting re-enabled"
An eval rule of thumb that runs against intuition: do not assume a longer trace = more trustworthy. When evaluating reasoning output, watch objective accuracy and don't be dazzled by a long, meticulous trace — "thought thoroughly yet answered wrong" is common on hard problems. It's essentially using an imperfect verifier to pick among imperfect chains, laundering errors into more confident wrongness.
String the four points into a shippable weekend upgrade: give an existing agent tiered reasoning instead of all-on or all-off.
lookup/format to a small standard model, and math/plan/debug that's verifiable to reasoning. Log the hit rate per class.Once done, "should we use reasoning?" stops being a gut call — you have a routing table keyed by task type with tiers and budgets: cheaper, lower latency, and often more accurate.
budget_tokens / OpenAI reasoning_effort.