DAY 19 / PHASE 2 · APPLICATIONS & SYSTEMS

Coding Agents

Edit Format · Plan Mode · Autonomy Spectrum · Closing the Loop

2026-06-02 · BigCat

The same model inside Aider / Cursor / Claude Code is three different creatures — the gap is the harness, not the weights.

// WHY THIS MATTERS

You use Cursor's tab, Claude Code's agentic loop, and the occasional Aider batch refactor every day — yet few people can articulate how these three actually differ architecturally. They all run Claude / GPT underneath, but the gap is so wide they feel like different products. The reason: the real difficulty of a coding agent isn't "make the model write correct code," but two underrated engineering problems — reliably landing the model's intent onto the filesystem (edit format), and controlling its autonomy (plan / pair / full-auto). This issue isn't a tool tutorial; it dissects the design fault-lines of the three major coding harnesses: the token-vs-success trade-off of edit formats, why plan mode is an engineering necessity rather than UX decoration, how to pick a level on the autonomy spectrum, and why an agent that scores high on SWE-bench can be harder to use in your real repo. By the end you'll see any coding product with "X-ray" eyes.

// 01

The Architectural Fault-Lines of Three Coding Harnesses

Claim: Aider / Cursor / Claude Code differ not in the model but along two axes — "who holds the control flow" and "how edits land."

Background & Principles

Put all three on one coordinate system and the split is instant. The horizontal axis is autonomy (who decides the next step: you or the model); the vertical axis is how edits land (how the model changes files):

Aider: git-native CLI. You explicitly /add files into context, the model returns changes as search/replace diff blocks, and each change is auto-git commited. Control flow stays mostly with you — it's a highly controllable "semi-auto pairing," not a free-roaming agent. Wins: git history as audit trail, instant /undo, extremely token-thrifty.
Cursor: IDE-native. Its core is two-stage — the foreground uses a frontier model for planning (chat / composer), the background uses a specially trained fast-apply model (~70B, ~1000 tok/s) to land intent into a file diff. Tab completion is a separate lightweight autonomy track. Wins: zero context-switch inside the editor, blazing apply.
Claude Code: terminal-native agent. A full agentic loop — it Greps/Reads to explore on its own, decides which files to change, uses the Edit tool for string replace, runs Bash to verify. Control flow lives with the model; you constrain it via permission gate + plan mode. Wins: highest autonomy for open-ended tasks.

Key insight: these three aren't "who's stronger," they're three autonomy levels. Aider fits batch execution where you already know the exact change; Cursor fits high-frequency small iterations inside the editor; Claude Code fits open-ended tasks of "I'm not sure how to change it, go explore." Picking the wrong level costs more than picking the wrong model.

low autonomy ◀──────────── control flow owner ───────────▶ high autonomy (you drive) (model drives) Aider ────────────── Cursor ────────────── Claude Code /add picks files tab + composer Greps/Reads to explore diff blocks + auto fast-apply model Edit tool string replace commit lands the change agentic multi-step loop you drive each step fast editor iterations "go figure it out" ────────────────────────────────────────────────────────────────────── "I know what to change" "type as I go" "you handle it"

Failure mode: wrong level. Using Claude Code's full-auto loop for a mechanical replace where you already have a precise diff in mind — it'll "improvise," explore forever, touch things you never told it to. Conversely, using Aider for open-ended debugging — you have to hand-feed every file; it won't go find them. Autonomy must match the task's "uncertainty."

Resources · Aider Edit Formats docs, aider.chat/docs/more/edit-formats · Cursor Instant Apply, cursor.com/blog/instant-apply · Anthropic Claude Code Best Practices, anthropic.com/engineering/claude-code-best-practices

// 02

Edit Format: The Coding Agent's Underrated Hidden Bottleneck

Claim: "the model can write correct code" ≠ "it can land the change into the file" — edit format decides half the reliability.

Background & Principles

A counterintuitive fact: many coding-agent failures aren't logic errors, they're failed edit application — the model's search block doesn't match the file's actual text (whitespace, quotes, hallucinated old code), so the patch silently drops or errors. Aider's leaderboard quantifies exactly this as a core metric: the model must not only write correct code but successfully apply every change without human intervention. The four mainstream edit formats each trade off:

Whole file: model emits the entire file. Easiest for the model to "get right," but tokens explode, large files exceed limits, and it tends to mangle unrelated lines.
Unified diff: standard @@ diff. Token-thrifty, but the model often miscounts line numbers/context, giving a high apply-failure rate.
Search/replace block (Aider's mainstay): a precise old-text + new-text pair. More robust than unified diff (matches by text, not line number) — a sweeter spot of token cost vs reliability.
Apply model (Cursor's route): the frontier model only emits "intent + rough diff," then hands it to a specially trained small model to land it precisely. This decouples "think clearly" from "write accurately" — the core of Cursor's 1000 tok/s instant apply.

The core trade-off: whole file is easiest to get right but priciest; diff is cheapest but most prone to apply failure; search/replace splits the difference; apply model buys reliability with a second model. This is why different harnesses perform so differently on the same base model — the edit contract differs.

Example

The exact shape of a search/replace block (same lineage in Aider / Claude Code) — the SEARCH segment must match the file character-for-character:

src/auth.py
<<<<<<< SEARCH
def verify_token(token):
    return jwt.decode(token, SECRET)
=======
def verify_token(token):
    try:
        return jwt.decode(token, SECRET, algorithms=["HS256"])
    except jwt.ExpiredSignatureError:
        raise AuthError("token expired")
>>>>>>> REPLACE

Practical takeaways: (1) before the agent edits a large file, make it Read the exact slice first — hallucinated old code is the #1 cause of apply failure; (2) keep changes small and local, one semantic unit per replace — far higher apply success than rewriting a whole function; (3) when the model repeatedly fails to apply, falling back to whole-file mode beats stubbornly forcing diffs.

Failure mode: a big search/replace inside a very long file. The longer the file, the more likely the model misremembers some line's exact text (extra spaces, a comment it skipped); a SEARCH mismatch silently drops the change — you think it edited, it didn't. Symptom: the agent says "fixed" but tests are still red. Fix: shrink edit granularity + force a Read before editing.

Resources · Aider Code Editing Leaderboard, aider.chat/docs/leaderboards/edit · Fireworks How Cursor built Fast Apply, fireworks.ai/blog/cursor · Cursor Editing at 1000 tok/s, cursor.com/blog/instant-apply

// 03

Plan Mode: Separating Research-Plan-Implement Is Engineering, Not UX

Claim: making a coding agent think before it acts boosts accuracy by "locking the problem," not by "thinking longer."

Background & Principles

Anthropic's Claude Code best practices stress one rule repeatedly: separate research/planning from implementation. The biggest risk of letting an agent code straight away isn't sloppy code — it's solving the wrong problem: patching the symptom and bypassing the root cause you actually wanted fixed. Plan mode has three layers of engineering value:

Problem locking: explore read-only first (Grep / Read), produce a plan, force the agent to articulate "what to change and why" before touching anything. A wrong direction gets caught here — this is the cheapest verification, far cheaper than reviewing hundreds of lines of diff after the fact.
Physical isolation: Claude Code's plan mode forbids Write/Edit/Bash at the harness layer — not a prompt begging it not to act, but a physical block. No matter how "itchy" the model gets, it can't change files (echoing Day 3's permission gate).
Auditability: the plan is text, so on failure you can tell whether the "plan was wrong" or the "execution was wrong" — far clearer than debugging a pile of messy diffs.

The counterintuitive point: plan mode's payoff doesn't come from the model thinking more (that's reasoning's job), it comes from giving a human a low-cost correction checkpoint. For open-ended tasks (debugging, cross-file refactors, new features) planning first is almost always worth it; for a one-line change, planning is pure overhead.

Example

Bake "explore → plan → code → commit" into a workflow rather than letting the agent write right away:

# Claude Code: enter plan mode (Shift+Tab), read-only explore + produce a plan
> Study how the payment module handles retries. Don't change code yet —
  give me a plan to fix the double-charge bug: which files, risk points, how to test

# You review the plan, stop wrong directions, then switch back to execute
> Plan looks good, but don't touch the webhook handler. Implement it, and run the relevant tests after each file

# Aider equivalent: use /ask to ask-only first, then switch to code mode
/ask Where in this retry logic could it double-charge? How should it be fixed?
# once satisfied: /code to land it

Key discipline: require the agent to explicitly list "what NOT to touch" in the plan. A coding agent's most common overreach is casually "improving" code you never asked it to change — drawing boundaries at the plan stage beats intercepting at review.

Failure mode: forcing plan mode on trivial tasks too — making it produce three paragraphs of plan to fix a typo wastes turns and tokens. Plan's value rises with task uncertainty: skip it for a one-line change, always use it for cross-file/debugging/new features. Same criterion as §1: the more uncertain the task, the more you plan first.

Resources · Anthropic Claude Code Best Practices (explore-plan-code), anthropic.com/engineering/claude-code-best-practices · Anthropic Building Effective Agents, anthropic.com/engineering/building-effective-agents

// 04

The Autonomy Spectrum and Closing the Loop

Claim: the core skill of AI pair programming isn't writing prompts — it's "picking the right autonomy level" + "feeding verification back to the agent."

Background & Principles

"AI pair programming" isn't one mode, it's an autonomy spectrum. Each step right raises throughput but weakens your ability to review every line and propagates errors further. What decides the level is the task's verifiability — can the agent know on its own whether it's right:

Tab completion: you write, it continues. You review every token. Good when you have the code in mind and just want to type faster.
Chat / inline edit: you describe a local change, it emits a diff, you apply one by one. Good for clear small changes.
Full-auto agent: give a goal, it loops explore→edit→test→fix on its own. Good for open-ended tasks, but only safe when the loop can be closed by verification.

Whether the right-end full-auto works depends almost entirely on closing the loop: can the agent run tests / lint / type-check after editing and feed failures back to itself to fix? If the loop closes (tests exist, errors are clear) → let it iterate autonomously; if it can't close (UI tweaks, scripts with no tests) → drop back to a lower level and be the "test" yourself. This echoes Day 3: feeding errors back is the root cause of an agent's self-repair.

Example

Give the full-auto agent an explicit verification loop, not just a goal then walk away:

> Fix the race condition in #482. Work discipline:
  1. First write a failing test that reproduces the bug (red)
  2. Change code to make it green
  3. Run `pytest tests/ -q` + `mypy src/`; done only when all green
  4. Any step red: paste the full error, analyze, then fix — don't skip

# Key: give it an objective "done" criterion (green) and verification tools,
# so the agent can close its own loop. Without this it self-declares "fixed" and leaves hidden bugs

This is why test-first matters more for AI agents than for humans: tests are the agent's only trustworthy "right/wrong signal." Without them, an agent's self-eval is wildly unreliable — it tends to declare success (backward rationalization, see Day 5). Only with a red test first dare you let full-auto run.

Failure mode: being misled by SWE-bench scores. A high-scoring agent has a clear PASS/FAIL test closing its loop on that benchmark — your real repo often lacks an equally clear verification signal. Treating "SWE-bench 80%" as "80% reliable in my project" is wrong: the benchmark closed the loop for the agent; your codebase may not. Build a verifiable loop for the agent first, and the score will transfer.

Resources · OpenAI Introducing SWE-bench Verified, openai.com/index/introducing-swe-bench-verified · Anthropic Building Agents with the Claude Agent SDK, anthropic.com/engineering/building-agents-with-the-claude-agent-sdk

// PUTTING IT TOGETHER · Pick the right harness + level for one task

Next time you get a coding task, don't open a tool yet — run this decision flow once. It chains this issue's four points into an executable selector:

Do I have a precise diff in mind? Yes → low-autonomy level (Cursor inline / Aider /code), don't let the agent improvise. No → continue.
Is the task loop-verifiable? (tests exist / clear errors) Yes → full-auto agent (Claude Code), make it write a red test first then let go. No → drop back to chat level and be the verifier.
Is uncertainty high? (cross-file / debugging / new feature) High → plan mode first to lock the problem and draw the "don't touch" boundary, then execute. Low → just change it.
Is the file large? Large → force the agent to Read the exact slice before editing + small-grained search/replace, to cut apply failures.
Give a done criterion: whatever the level, hand it an objective "green" signal (tests/lint/type-check all pass); don't let the agent self-declare success.

After a few rounds you'll find: the felt improvement from picking the right "level" often beats swapping in a stronger model. That's harness thinking — the model is the CPU; your workflow is the real lever.

// KEY TERMS

Edit Format: The contract for landing a change into a file: whole-file / unified-diff / search-replace / apply-model. Determines apply success rate.
Search/Replace Block: A diff format giving exact old-text + new-text (Aider's mainstay). Matches by text, not line number — more robust than unified diff.
Apply Model: A specially trained small model that lands a frontier model's rough intent into a precise file diff. Core of Cursor instant apply.
Fast Apply: Cursor's high-speed edit landing (~1000 tok/s), accelerated by speculative edits.
Plan Mode: Claude Code's read-only exploration level, physically write-forbidden at the harness layer. Separates research/plan from implementation.
Closing the Loop: The agent runs tests/lint after editing and feeds failures back to self-fix. The prerequisite for safe full-auto.
Autonomy Spectrum: The tab → chat → full-auto autonomy range. Pick a level by task verifiability.
SWE-bench Verified: 500 human-validated real GitHub issue-fix tasks; resolved rate is the main metric.
Test-first for Agents: Write a red test before letting the agent loose. Tests are the agent's only trustworthy right/wrong signal.

// DEEP THINKING

Cursor uses a second model to apply; Claude Code has the same model emit string-replace directly. What's the cost of each route?

Cursor's route: the frontier model only "thinks," the apply model handles "accuracy" — frontier can express intent with a cheaper rough diff, apply is fast and high-success; the cost is maintaining an extra model, plus possible misalignment between the two (apply model misreads intent). Claude Code's route: single model end-to-end, no misalignment, no extra infra; the cost is the frontier model must itself produce character-exact SEARCH blocks, easy to misremember in long files, apply failure rises with length. It's a "reliability vs simplicity" architectural choice, depending on whether you can afford to feed an apply model.

Plan mode's physical write-ban brings safety but also cuts off "discovering mid-work that the plan must change." How to weigh this trade-off?

Plan/execute separation assumes "the plan can be basically settled without touching code." For many tasks this fails — the real constraints often surface only once you start changing (an API doesn't exist, dependency conflicts). Clinging to the plan makes the agent ram an outdated one into a wall. Pragmatic move: the plan isn't a one-shot contract, it's a checkpoint. Let the agent stop and re-plan when execution reveals the plan is void, rather than plowing through a wrong plan. A good harness supports the "execute→hit wall→back to plan" loop, not a one-way waterfall. This is why feedback-bearing workflows like evaluator-optimizer are more robust than pure linear plan-execute.

If test-first is the prerequisite for agent autonomy, where's the autonomy ceiling for domains you can't test (frontend visuals, prompt tuning, exploratory data analysis)?

The ceiling is locked by "the objectivity of the verification signal." In these domains "right/wrong" is either subjective (does it look good) or needs a human to judge, so the agent can't close its own loop → it falls back to low autonomy with a human as oracle. Ways out: (1) build proxy metrics — screenshot diff for visuals, LLM-as-judge + a fixed eval set for prompts, turning the subjective into a runnable signal; (2) human-in-the-loop, reviewing every step; (3) shrink to verifiable sub-problems (frontend logic is testable, styling is human-reviewed). Core: an agent's autonomy = how objective a verification loop you can build for it. Can't build one? Don't pretend full-auto.

Why does an agent that scores high on SWE-bench often "drop" when moved to your real repo? Beyond the test loop, what hidden differences exist?

Beyond §4's loop difference: (1) SWE-bench tasks are self-contained, well-scoped single issues, while real work is often a vague "make it better" — defining the problem itself is hard; (2) benchmark repos have complete docs/tests, your private repo may have bad docs and many implicit conventions, so the agent lacks context; (3) benchmarks allow multiple tries taking the best (pass@k), in production you often see only the first; (4) data contamination — hot issues may appear in training data. So a leaderboard is a "relative ranking of capability ceilings," not an absolute prediction of "how usable in my repo." Read agents for trend, don't worship absolute numbers.

Aider auto-commits each change; Claude Code tends to batch and let you review. How do these "change granularity" philosophies affect human-machine collaboration differently?

Aider's fine-grained commits: every step is /undo-able, git history as audit, cheap rollback → encourages "let it try a lot," revert on error; cost is a fragmented history needing later squash. Claude Code's batching: gives you a semantically complete block to review, concentrated cognitive load, coherent diff, but if a middle step is wrong, rollback is costly. Philosophical difference: Aider pushes "reversibility" to the extreme, fit for exploratory, fault-tolerant work; Claude Code bets on "getting one complete unit right at once," fit for when you trust it and want clean history. Good habit either way: control commit boundaries yourself — make each commit an independently reviewable/revertable semantic unit, instead of being dragged by the tool's default granularity.

// FURTHER READING

Aider · Edit Formats — token / reliability trade-offs of four edit formats
Aider · Code Editing Leaderboard — a benchmark that treats "successful apply" as a core metric
Cursor · Editing Files at 1000 Tokens/s — apply model + speculative edits architecture
Anthropic · Claude Code Best Practices — the explore-plan-code-commit workflow
OpenAI · Introducing SWE-bench Verified — what a coding-agent benchmark actually measures