DAY 03 / PHASE 1 · ENGINEERING

Harness Engineering

Agent Runtime · Tool Registry · Permission Gate · Loop Control

2026-05-22 · BigCat

The model is just the CPU; the harness is the OS that actually determines capability.

// WHY THIS MATTERS

Reality in 2026: most people staring at "model version" when they use Claude Code, Cursor, or Devin don't realize that the same Sonnet 4.5 inside Claude Code is a different organism than the one you call from a bare API. The difference isn't the model — it's the harness: the invisible operating system dispatching tools, managing permissions, controlling the loop, recycling state, compacting history. Karpathy put it bluntly: "LLM is the new CPU, the context window is RAM, the harness is the OS." Being able to write prompts is not entry-level; reading, modifying, and designing harnesses is. This issue covers four things: what a harness actually is, where its boundary with prompts + tools lies, how SOTA harnesses like Claude Code are designed, and how to build a minimum viable harness in 100 lines of Python — alongside a close reading of Anthropic's underrated engineering blog Building Effective Agents.

// 01

What a Harness Is: The Agent's Runtime and Boundary

Claim: a harness stitches a stateless LLM API into a runtime that "gets things done". It's not a framework — it's an OS.

Background & Principles

A single messages.create() is stateless inference: you hand it input, it hands you output, and the model neither knows nor cares what happened before or after. Wrapping that one inference call into something that autonomously completes multi-step tasks — everything happening in between is the harness:

90% of real-world experience comes from how these pieces are engineered, not from the model's raw reasoning. Cognition said it plainly in their 2025 Don't Build Multi-Agents post: their moat is harness design, not smarter prompts. Claude Code's plan mode, Cursor's composer, Aider's git auto-commit — same base model, different OS.

┌──────────────── Harness = the LLM's runtime OS ────────────┐ │ │ │ USER ──▶ ┌─────────────────────────────────────────┐ │ │ │ HARNESS │ │ │ │ ┌───────────────────────────────────┐ │ │ │ │ │ Loop Controller (while not done) │ │ │ │ │ └───┬───────────────────────────┬───┘ │ │ │ │ │ │ │ │ │ │ ▼ ▼ │ │ │ │ ┌──────────┐ ┌──────────────┐ │ │ │ │ │ LLM API │ ◀────── │ Tool Registry│ │ │ │ │ │ (no mem) │ │ + Permission │ │ │ │ │ └────┬─────┘ └──────┬───────┘ │ │ │ │ │ tool_use │ result │ │ │ │ ▼ ▼ │ │ │ │ ┌─────────────────────────────────┐ │ │ │ │ │ Memory · Compaction · Logs │ │ │ │ │ └─────────────────────────────────┘ │ │ │ └─────────────────────────────────────────┘ │ │ ▲ │ │ │ stop / interrupt / approve │ │ HUMAN │ └────────────────────────────────────────────────────────────┘

Hands-on Example

Separate prompt issues from harness issues with a 30-second triage — when your agent misbehaves, ask first:

# —— PROMPT's fault ——
- model picks the wrong tool
- output format unstable
- skips reasoning steps
- violates a stated constraint

# —— HARNESS's fault ——
- spins forever in a loop after 10 turns (missing loop control)
- agent crashes on first tool error (missing recovery)
- gets dumber as context grows (missing compaction)
- no chance for human approval on dangerous calls (missing permission gate)
- same prompt works in Claude Code but fails in your script (different harness)

Pin this triage table at the root of your agent project — running it before debugging cuts time in half.

Failure modes: blaming everything on "the model isn't strong enough". A common counterexample: you upgrade to a newer model and the agent gets worse — usually because the new model's tool-use output shape changed and the old harness's parser silently fails on the new format. Not a model bug — a harness contract that didn't keep up.
Going deeper · Karpathy on LLM-as-OS, youtube · Intro to LLMs · Cognition Don't Build Multi-Agents, cognition.ai/blog/dont-build-multi-agents · Anthropic Building Effective Agents, anthropic.com/engineering/building-effective-agents
// 02

Dissecting the Claude Code Harness: What's Actually Doing the Work

Claim: 90% of Claude Code's "agentic feel" comes from harness design, not from the model itself.

Background & Principles

Claude Code is the public SOTA coding harness today. Pulling it apart gives you a blueprint you can copy into your own agent. Six core abstractions:

None of these are "nice to have" — every single one maps to a concrete failure mode: no plan mode → the model acts before it thinks; no hooks → you re-run lint manually on every commit; no subagent → one big search pollutes the main context into uselessness. Harness design is making "the things you should do to avoid stepping on rakes" part of the system.

Hands-on Example

Reading Claude Code's own .claude/settings.json is the fastest harness onramp — directly observe how a SOTA harness exposes configuration surface:

{
  "permissions": {
    "allow": ["Read", "Grep", "Bash(git status:*)", "Bash(npm test:*)"],
    "ask":   ["Edit", "Write", "Bash(git push:*)"],
    "deny":  ["Bash(rm -rf:*)", "Bash(*--no-verify*)"]
  },
  "hooks": {
    "PostToolUse": [{
      "matcher": "Edit|Write",
      "hooks": [{"type":"command", "command":"pnpm lint --fix $CLAUDE_FILE_PATHS"}]
    }],
    "Stop": [{"hooks":[{"type":"command","command":"pnpm test --silent"}]}]
  },
  "env": { "BASH_DEFAULT_TIMEOUT_MS": "120000" }
}

This 30-line config does four things that would otherwise need code: tiered permissions (allow / ask / deny with precise glob patterns), save-triggered lint, tests on session end, and a Bash timeout floor. That's the declarative interface a harness gives you — you don't fork Claude Code to make its runtime obey your engineering discipline.

Failure modes: (1) overusing hooks for long-running commands — stuffing full e2e tests into PostToolUse so every tiny file edit fires them; turn latency balloons to minutes. Hooks should be sub-second sanity checks; heavy lifting goes in Stop / an explicit slash command. (2) Only writing allow rules and skipping deny — that's an unlocked back door. Always explicitly deny rm -rf, force-push, --no-verify, etc.
Going deeper · Claude Code Hooks docs, docs.claude.com/.../hooks · Claude Code Settings, docs.claude.com/.../settings · Claude Code Subagents, docs.claude.com/.../sub-agents
// 03

Build Your Own Minimal Harness: 100 Lines of Python That Run an Agent

Claim: treating the harness as a black box ties your design choices to whatever Claude Code / Cursor decided. Writing a minimal harness once is the AI engineer's hello world.

Background & Principles

The minimal skeleton of a harness has five components: tool registry · loop controller · message buffer · permission gate · recovery. Everything else is a variation. The 100-line Python below runs end-to-end, handles multi-turn tool calls, blocks dangerous commands, and compacts when tokens cross a threshold — it's not pretty, but every line is something you can read and modify.

Hands-on Example

import anthropic, json, subprocess
from pathlib import Path

client = anthropic.Anthropic()
MODEL  = "claude-sonnet-4-6"

# 1) TOOL REGISTRY: schema for the model, handler stays local
TOOLS = {
  "read_file": {
    "schema": {"name":"read_file","description":"Read a file",
      "input_schema":{"type":"object","properties":{"path":{"type":"string"}},"required":["path"]}},
    "handler": lambda p: Path(p["path"]).read_text()[:8000]
  },
  "run_bash": {
    "schema": {"name":"run_bash","description":"Run a shell command (read-only)",
      "input_schema":{"type":"object","properties":{"cmd":{"type":"string"}},"required":["cmd"]}},
    "handler": lambda p: subprocess.run(p["cmd"],shell=True,capture_output=True,text=True,timeout=30).stdout[:8000]
  }
}

# 2) PERMISSION GATE: physical interception, never trust model self-discipline
DENY = ["rm -rf", "--no-verify", "sudo", "curl ", "> /"]
def permit(tool, args):
    if tool == "run_bash":
        cmd = args.get("cmd","")
        if any(d in cmd for d in DENY): return False, f"denied: {cmd}"
    return True, None

# 3) LOOP CONTROLLER + 4) MESSAGE BUFFER + 5) RECOVERY
def agent(task, max_iters=20):
    msgs = [{"role":"user", "content": task}]
    schemas = [t["schema"] for t in TOOLS.values()]
    for i in range(max_iters):
        r = client.messages.create(model=MODEL, max_tokens=4096, tools=schemas, messages=msgs,
            system="You are a careful research agent. Use tools step by step.")
        msgs.append({"role":"assistant", "content": r.content})
        if r.stop_reason == "end_turn":
            return next((b.text for b in r.content if b.type=="text"), "")
        results = []
        for b in r.content:
            if b.type != "tool_use": continue
            ok, reason = permit(b.name, b.input)
            if not ok:
                out = reason
            else:
                try:
                    out = TOOLS[b.name]["handler"](b.input)
                except Exception as e:
                    out = f"ERROR: {type(e).__name__}: {e}"   # critical: errors go back to the model too
            results.append({"type":"tool_result","tool_use_id":b.id,"content":str(out)})
        msgs.append({"role":"user", "content": results})
    return "hit max_iters"

Four key decisions: (1) schema and handler are decoupled — schema is the contract for the model, handler is your local code, and you never let the model directly eval. (2) Permissions are a physical gate, not a prompt constraint — the deny list runs at the harness layer, and no amount of model persuasion bypasses it. (3) Errors go back to the model rather than raising — this is the foundation of self-recovery; error messages are some of the most valuable signal the model gets. (4) max_iters defaults to 20 — you must have a ceiling, otherwise tool-selection jitter will spin forever.

Failure modes: (1) forgetting to submit tool_result under "role":"user" — Anthropic's protocol routes tool results through user role; getting this wrong returns 400. (2) The handler raises uncaught and kills the loop; always convert exceptions to strings and feed them back. (3) Writing "don't run rm" only in the prompt — the model can, and occasionally will, do it anyway. The deny list must live in the harness.
Going deeper · Anthropic Tool Use protocol, docs.claude.com/.../tool-use · Claude Agent SDK (Python/TS), docs.claude.com/.../agent-sdk · Simon Willison I built a small agent in 70 lines, simonwillison.net
// 04

Close Reading: Building Effective Agents — When You Should Not Write an Agent at All

Claim: the core message of Anthropic's blog isn't "how to build agents" — it's "90% of your needs should not be agents; a workflow is enough".

Background & Principles

Anthropic's late-2024 Building Effective Agents (Erik Schluntz & Barry Zhang) is the most important agent-engineering writing of the past two years. The thing most worth citing isn't the code — it's the taxonomy. It forcibly splits all "agentic" applications into two camps:

The payload of this distinction: most production systems should be workflows, not agents. Workflows are more predictable, cheaper, easier to eval, easier to debug. Agents are only the right move when the task is open-ended, the step count can't be planned in advance, and decisions depend on intermediate results (autonomous debugging, deep research, complex codework).

The blog lists five workflow patterns worth memorizing as an architectural checklist:

  1. Prompt chaining: cut the task into steps, each step a prompt, output of step N feeding step N+1. Optionally insert gates (fail-fast and roll back).
  2. Routing: a classifier prompt routes requests to different downstream prompts / models.
  3. Parallelization: split the task into parallel sub-tasks (sectioning), or run multiple voters and aggregate (voting).
  4. Orchestrator-workers: an orchestrator dynamically splits sub-tasks, dispatches to workers, collects results — exactly Claude Code's subagent pattern.
  5. Evaluator-optimizer: one LLM generates, another evaluates and gives feedback, iterate until quality bar. Great for writing, translation, complex code.

An actual agent sits one level above these five — you escalate to it only when the task is too open to express as a workflow. Quoted directly: "When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed."

Hands-on Example

A typical mis-built-as-agent ask: "read a user-uploaded PDF, extract key points, generate a one-page summary." Sounds agentic? It's actually a prompt chain:

# BAD: build an agent and give it read_pdf / summarize / save as tools
agent("Read report.pdf and give me a one-page summary")
#  → model may skip extraction and fabricate, occasionally forgets to save, hard to debug

# GOOD: prompt-chain workflow with explicit code-level control flow
def summarize_pdf(path):
    text   = extract_pdf(path)                        # pure code, no LLM
    if len(text) > 50_000:
        chunks = chunk_by_heading(text)
        partials = [summarize_chunk(c) for c in chunks]  # parallel LLM calls
        return synthesize(partials)                   # second LLM step
    return summarize_short(text)

When do you actually escalate to an agent? Three signals firing at once: (1) the number of steps can't be known in advance; (2) each step's next move depends on the previous step's result; (3) the final output is verifiable (otherwise the agent goes off the rails and you can't tell). Coding, research, complex data analysis qualify. Writing emails, translation, report generation generally don't.

Failure modes: getting swept up in "agentic" marketing language and wrapping every LLM app as multi-agent. Cognition's Don't Build Multi-Agents retrospective on early Devin's multi-agent path is a great cautionary tale: context drifts across sub-agents, decisions become inconsistent, debugging is a nightmare. They eventually cut back to a single long-context agent + workflow.
Going deeper · Anthropic Building Effective Agents, anthropic.com/engineering/building-effective-agents · companion code anthropic-cookbook, github.com/anthropics/anthropic-cookbook · Cognition Don't Build Multi-Agents, cognition.ai/blog/dont-build-multi-agents

// Putting it together · Build yourself a "research harness"

Wire the four ideas into a weekend project: build a personal research harness that does 30 minutes of topic research for you. The goal isn't to clone Perplexity; it's to feel all five harness components yourself.

  1. Tool registry: three tools — web_search (Tavily or Brave API), fetch_url (requests + readability), save_note (write to local .md). Small and orthogonal.
  2. Permission gate: fetch_url only allows https + a domain allowlist; save_note is jailed to ./notes/; any shell tool — denied.
  3. Loop control: max_iters=25; record the tool sequence each round; if the same tool is called with the same args three times in a row, force-break (deadlock guard).
  4. Memory: trigger compaction past 60K tokens — compress fetched pages into structured JSON {url, key_points[], quote} and drop the originals.
  5. Agent or workflow? Write the workflow version first: plan → search → fetch → write. If the plan step frequently fails, upgrade to an agent (let the model decide when to stop). This step is where §4's claim becomes personal experience.
  6. Eval: write 10 questions with known ground-truth answers; run both versions; compare token cost and accuracy. Most people will find the workflow version is 3–5× cheaper with equal or better accuracy — exactly Anthropic's empirical point.

After this, every "agent product" you look at gets reflexively peeled — you find its tool registry, permission gate, loop control — instead of getting hypnotized by "agentic" marketing.

// Deep Thinking

Claude Code's plan mode surfaces the plan for user approval before execution. Is that a UX choice or an engineering necessity? What would auto-execute look like?
It's an engineering necessity. (1) "Think before acting" lifts accuracy 30%+ (ReAct paper); (2) human approval is the cheapest verification — cheaper than LLM-judge; (3) the plan log makes failures debuggable. Without plan mode, SWE-bench success rate drops 10–20pp. Without a forced plan step, agents skip planning and rush to act.
The harness-as-OS metaphor says it manages resources, scheduling, IPC. Is the LLM a CPU or a process?
The LLM is the CPU (a stateless compute unit); each inference is a syscall. The harness maintains process state (context, memory, tool registry), schedules the syscall sequence, and handles syscall returns (tool results). This mental model also explains why multi-agent = multi-process (IPC cost is real) and why prefix caching ≈ CPU cache.
A 100-line hello-world harness is alluring, but production harnesses balloon to thousands of lines. Where does the bloat actually go?
The bloat lives in: (1) error recovery (rate-limit retries, tool failure fallbacks, partial-result handling); (2) observability (per-tool tracing/logging/cost tracking); (3) permission gates (auditing and confirmation flows for sensitive actions); (4) context management (truncation, compaction, tool-result size caps); (5) prompt-injection defenses. Hello world: 100 lines. Production: 1000+ lines.
Anthropic's blog says 90% of needs should be workflows, not agents. How do you decide which?
Three checks: (1) Is the procedure known in advance? Yes → workflow. (2) Does each step need to inspect the previous step's result to decide what's next? Yes → agent. (3) Cost of failure? High → workflow (more controllable). Examples: reminder email = workflow (procedure fixed); debugging a user error = agent (depends on dynamic findings). The cost of confusing them: running a workflow-suitable task as an agent is 5–10× slower and less reliable.
The permission gate is critical. What happens if you let the agent decide which actions need confirmation?
The agent will trend toward "confirm nothing" (lazy), especially in iterative tasks. The right approach: hardcode permission policy in the harness, not in the agent. Claude Code defaults destructive actions (write/exec/network) to require confirmation. Reason: an LLM cannot estimate its own error probability, but the harness can statically classify risk by action class.

// Further Reading