The model is just the CPU; the harness is the OS that actually determines capability.
Reality in 2026: most people staring at "model version" when they use Claude Code, Cursor, or Devin don't realize that the same Sonnet 4.5 inside Claude Code is a different organism than the one you call from a bare API. The difference isn't the model — it's the harness: the invisible operating system dispatching tools, managing permissions, controlling the loop, recycling state, compacting history. Karpathy put it bluntly: "LLM is the new CPU, the context window is RAM, the harness is the OS." Being able to write prompts is not entry-level; reading, modifying, and designing harnesses is. This issue covers four things: what a harness actually is, where its boundary with prompts + tools lies, how SOTA harnesses like Claude Code are designed, and how to build a minimum viable harness in 100 lines of Python — alongside a close reading of Anthropic's underrated engineering blog Building Effective Agents.
A single messages.create() is stateless inference: you hand it input, it hands you output, and the model neither knows nor cares what happened before or after. Wrapping that one inference call into something that autonomously completes multi-step tasks — everything happening in between is the harness:
tool_use block, route to a local function, and feed tool_result back into the next turn's messages.stop_reason == "end_turn", max_iters, token budget, or a human checkpoint.90% of real-world experience comes from how these pieces are engineered, not from the model's raw reasoning. Cognition said it plainly in their 2025 Don't Build Multi-Agents post: their moat is harness design, not smarter prompts. Claude Code's plan mode, Cursor's composer, Aider's git auto-commit — same base model, different OS.
Separate prompt issues from harness issues with a 30-second triage — when your agent misbehaves, ask first:
# —— PROMPT's fault ——
- model picks the wrong tool
- output format unstable
- skips reasoning steps
- violates a stated constraint
# —— HARNESS's fault ——
- spins forever in a loop after 10 turns (missing loop control)
- agent crashes on first tool error (missing recovery)
- gets dumber as context grows (missing compaction)
- no chance for human approval on dangerous calls (missing permission gate)
- same prompt works in Claude Code but fails in your script (different harness)
Pin this triage table at the root of your agent project — running it before debugging cuts time in half.
Claude Code is the public SOTA coding harness today. Pulling it apart gives you a blueprint you can copy into your own agent. Six core abstractions:
None of these are "nice to have" — every single one maps to a concrete failure mode: no plan mode → the model acts before it thinks; no hooks → you re-run lint manually on every commit; no subagent → one big search pollutes the main context into uselessness. Harness design is making "the things you should do to avoid stepping on rakes" part of the system.
Reading Claude Code's own .claude/settings.json is the fastest harness onramp — directly observe how a SOTA harness exposes configuration surface:
{
"permissions": {
"allow": ["Read", "Grep", "Bash(git status:*)", "Bash(npm test:*)"],
"ask": ["Edit", "Write", "Bash(git push:*)"],
"deny": ["Bash(rm -rf:*)", "Bash(*--no-verify*)"]
},
"hooks": {
"PostToolUse": [{
"matcher": "Edit|Write",
"hooks": [{"type":"command", "command":"pnpm lint --fix $CLAUDE_FILE_PATHS"}]
}],
"Stop": [{"hooks":[{"type":"command","command":"pnpm test --silent"}]}]
},
"env": { "BASH_DEFAULT_TIMEOUT_MS": "120000" }
}
This 30-line config does four things that would otherwise need code: tiered permissions (allow / ask / deny with precise glob patterns), save-triggered lint, tests on session end, and a Bash timeout floor. That's the declarative interface a harness gives you — you don't fork Claude Code to make its runtime obey your engineering discipline.
rm -rf, force-push, --no-verify, etc.
The minimal skeleton of a harness has five components: tool registry · loop controller · message buffer · permission gate · recovery. Everything else is a variation. The 100-line Python below runs end-to-end, handles multi-turn tool calls, blocks dangerous commands, and compacts when tokens cross a threshold — it's not pretty, but every line is something you can read and modify.
import anthropic, json, subprocess
from pathlib import Path
client = anthropic.Anthropic()
MODEL = "claude-sonnet-4-6"
# 1) TOOL REGISTRY: schema for the model, handler stays local
TOOLS = {
"read_file": {
"schema": {"name":"read_file","description":"Read a file",
"input_schema":{"type":"object","properties":{"path":{"type":"string"}},"required":["path"]}},
"handler": lambda p: Path(p["path"]).read_text()[:8000]
},
"run_bash": {
"schema": {"name":"run_bash","description":"Run a shell command (read-only)",
"input_schema":{"type":"object","properties":{"cmd":{"type":"string"}},"required":["cmd"]}},
"handler": lambda p: subprocess.run(p["cmd"],shell=True,capture_output=True,text=True,timeout=30).stdout[:8000]
}
}
# 2) PERMISSION GATE: physical interception, never trust model self-discipline
DENY = ["rm -rf", "--no-verify", "sudo", "curl ", "> /"]
def permit(tool, args):
if tool == "run_bash":
cmd = args.get("cmd","")
if any(d in cmd for d in DENY): return False, f"denied: {cmd}"
return True, None
# 3) LOOP CONTROLLER + 4) MESSAGE BUFFER + 5) RECOVERY
def agent(task, max_iters=20):
msgs = [{"role":"user", "content": task}]
schemas = [t["schema"] for t in TOOLS.values()]
for i in range(max_iters):
r = client.messages.create(model=MODEL, max_tokens=4096, tools=schemas, messages=msgs,
system="You are a careful research agent. Use tools step by step.")
msgs.append({"role":"assistant", "content": r.content})
if r.stop_reason == "end_turn":
return next((b.text for b in r.content if b.type=="text"), "")
results = []
for b in r.content:
if b.type != "tool_use": continue
ok, reason = permit(b.name, b.input)
if not ok:
out = reason
else:
try:
out = TOOLS[b.name]["handler"](b.input)
except Exception as e:
out = f"ERROR: {type(e).__name__}: {e}" # critical: errors go back to the model too
results.append({"type":"tool_result","tool_use_id":b.id,"content":str(out)})
msgs.append({"role":"user", "content": results})
return "hit max_iters"
Four key decisions: (1) schema and handler are decoupled — schema is the contract for the model, handler is your local code, and you never let the model directly eval. (2) Permissions are a physical gate, not a prompt constraint — the deny list runs at the harness layer, and no amount of model persuasion bypasses it. (3) Errors go back to the model rather than raising — this is the foundation of self-recovery; error messages are some of the most valuable signal the model gets. (4) max_iters defaults to 20 — you must have a ceiling, otherwise tool-selection jitter will spin forever.
tool_result under "role":"user" — Anthropic's protocol routes tool results through user role; getting this wrong returns 400. (2) The handler raises uncaught and kills the loop; always convert exceptions to strings and feed them back. (3) Writing "don't run rm" only in the prompt — the model can, and occasionally will, do it anyway. The deny list must live in the harness.
Anthropic's late-2024 Building Effective Agents (Erik Schluntz & Barry Zhang) is the most important agent-engineering writing of the past two years. The thing most worth citing isn't the code — it's the taxonomy. It forcibly splits all "agentic" applications into two camps:
The payload of this distinction: most production systems should be workflows, not agents. Workflows are more predictable, cheaper, easier to eval, easier to debug. Agents are only the right move when the task is open-ended, the step count can't be planned in advance, and decisions depend on intermediate results (autonomous debugging, deep research, complex codework).
The blog lists five workflow patterns worth memorizing as an architectural checklist:
An actual agent sits one level above these five — you escalate to it only when the task is too open to express as a workflow. Quoted directly: "When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed."
A typical mis-built-as-agent ask: "read a user-uploaded PDF, extract key points, generate a one-page summary." Sounds agentic? It's actually a prompt chain:
# BAD: build an agent and give it read_pdf / summarize / save as tools
agent("Read report.pdf and give me a one-page summary")
# → model may skip extraction and fabricate, occasionally forgets to save, hard to debug
# GOOD: prompt-chain workflow with explicit code-level control flow
def summarize_pdf(path):
text = extract_pdf(path) # pure code, no LLM
if len(text) > 50_000:
chunks = chunk_by_heading(text)
partials = [summarize_chunk(c) for c in chunks] # parallel LLM calls
return synthesize(partials) # second LLM step
return summarize_short(text)
When do you actually escalate to an agent? Three signals firing at once: (1) the number of steps can't be known in advance; (2) each step's next move depends on the previous step's result; (3) the final output is verifiable (otherwise the agent goes off the rails and you can't tell). Coding, research, complex data analysis qualify. Writing emails, translation, report generation generally don't.
Wire the four ideas into a weekend project: build a personal research harness that does 30 minutes of topic research for you. The goal isn't to clone Perplexity; it's to feel all five harness components yourself.
web_search (Tavily or Brave API), fetch_url (requests + readability), save_note (write to local .md). Small and orthogonal.fetch_url only allows https + a domain allowlist; save_note is jailed to ./notes/; any shell tool — denied.max_iters=25; record the tool sequence each round; if the same tool is called with the same args three times in a row, force-break (deadlock guard).{url, key_points[], quote} and drop the originals.After this, every "agent product" you look at gets reflexively peeled — you find its tool registry, permission gate, loop control — instead of getting hypnotized by "agentic" marketing.