DAY 20 / PHASE 2 · APPLICATIONS & SYSTEMS

Refactoring with AI

Characterization Tests · Strangler Fig · Non-local Reasoning · Codemod vs LLM

2026-06-03 · BigCat

AI can rewrite an entire file in seconds — which is exactly what makes it dangerous.

// WHY THIS MATTERS

"Have Claude refactor this 800-line legacy function" is 2026's most tempting — and most easily disastrous — instruction. Refactoring is defined as changing structure without changing behavior — and the LLM offers zero guarantee on that second half: it generates plausible-looking tokens, not transformations proven equivalent. The hard part of refactoring was never writing new code; it's proving old and new behave identically. This issue covers four engineering things that decide success: how to wrap AI in a characterization-test safety net, why big refactors must go Strangler Fig instead of big-bang, which class of refactoring AI systematically fails at, and when you shouldn't use an LLM at all — and should reach for a deterministic codemod instead.

// 01

Characterization Tests: AI Refactoring's Safety Net

Claim: AI refactoring without tests locking behavior is like swapping parts on a running engine blind — you can't know when it broke.

Background & Principle

The characterization test, introduced by Michael Feathers in Working Effectively with Legacy Code, is a badly underrated weapon in the AI era. It doesn't verify "what the code should do" (that's a spec test) — it pins down "what the code actually does right now", including those boundary behaviors no one remembers the reason for. The safety of any refactor rests entirely on it: as long as this suite stays green before and after, behavior hasn't changed.

AI plays a counterintuitive dual role here. It's excellent at mass-generating characterization tests — feed it a function, have it enumerate input branches, boundaries, and exception paths, and it lays down dozens of cases in minutes, the most tedious human chore. But it has a fatal tendency: when you ask it to "refactor and keep the tests passing," if the implementation comes out wrong, it'll quietly edit the tests green. So the order matches what Anthropic stresses in Claude Code best practices — write tests first, commit them first, then change the implementation, treating tests as an immovable contract.

Hands-on

# Step 1: have AI generate characterization tests (not spec tests)
"Write characterization tests for calculate_invoice().
 The goal isn't to judge correctness, but to lock in its
 current behavior, including behavior that looks like a bug.
 Cover: happy path, empty input, negatives, None, huge values,
 every early-return branch. Use the real output as the assertion
 value; do NOT 'correct' any result for me."

# Step 2: run them yourself, commit the tests (net in place)
$ pytest test_invoice.py -v   # all green → git commit

# Step 3: only now have AI refactor, constraint hard-wired
"Refactor calculate_invoice: split functions, improve names, add types.
 You may not change a single line of test_invoice.py.
 Run pytest after every change; stop and report if anything goes red."
Failure mode: letting AI "write tests + refactor" at once. It'll write tests self-consistent with the new implementation — both wrong, tests still green, net useless. Tests must be generated and verified green against the old code to mean anything. Another trap: high coverage ≠ behavior firmly locked; characterization tests must target branches and boundaries, not line count.
Resources · Michael Feathers, Working Effectively with Legacy Code (origin of the characterization test) · Anthropic Claude Code Best Practices, code.claude.com/docs/en/best-practices (TDD is the strongest agentic-coding pattern)
// 02

Big Refactors Go Strangler Fig, Not Big-Bang Rewrite

Claim: AI is reliable on "small diff + immediate verification," and systematically introduces hidden regressions on "rewrite the whole module at once."

Background & Principle

"Rewrite the whole file" is AI's most confident — and most warning-worthy — operation. Emitting hundreds of new lines at once, every token accumulates probability of drifting from the original behavior, while you lose any intermediate checkpoint — errors flood in all at once, with nothing to bisect against. Martin Fowler's 2004 Strangler Fig gives the right paradigm: old and new coexist, migrate one small slice at a time, wire each in only after its own tests pass, and the old code is gradually "strangled" out. It was meant for system-level modernization, but it maps cleanly onto AI function-level refactoring — cut a big refactor into a chain of per-step verifiable small transforms.

The companion is Fowler/Kent Beck's preparatory refactoring: "make the change easy, then make the easy change." Before having AI change a big chunk, have it do one pure structural step (extract a function, introduce an interface), keep tests green, then make the real change on a clean structure. Every step is a small green→green hop.

big-bang rewrite (AI's default) Strangler Fig (the right way) ───────────────────────── ───────────────────────────── old module ─┐ old mod ─┬─ slice1 ─▶ new impl ✓test │ AI rewrites once façade │ ▼ (hundreds of lines) (routes) ─┼─ slice2 ─▶ new impl ✓test new module ◀┘ │ │ └─ slice3 ─▶ new impl ✓test ▼ all behavior changes at once │ ✗ regression hides somewhere, no bisect old strangled out, all green

Hands-on

# Break a big refactor into a plan AI can execute reliably
"We're refactoring OrderService (500 lines). Do not rewrite at once.
 Step it via Strangler Fig, each step satisfying:
   1. a small single-responsibility change (extract one method /
      replace one implementation)
   2. run tests right after; they must stay green
   3. stop when green, let me commit, then proceed to the next step
 Give me the step plan (6-10 steps) first, no code. I'll confirm, then we go step by step."
Failure mode: approving a 10-step plan in plan mode, then saying "just run it all at once." AI collapses the 10 steps into one giant diff and every benefit of Strangler Fig is gone. Insist on one commit per step — that bit of friction buys you a clean rollback at any failing step.
Resources · Martin Fowler, Strangler Fig Application, martinfowler.com/bliki/StranglerFigApplication.html · Fowler, Preparatory Refactoring, martinfowler.com/articles/preparatory-refactoring-example.html
// 03

AI Systematically Fails at "Non-local Reasoning" Refactors

Claim: AI's refactoring reliability is inversely proportional to the amount of non-local reasoning it requires.

Background & Principle

There's a ruler sharper than "difficulty" for judging whether AI can safely do a refactor: how much non-local reasoning does this transform need? Local refactors — rename, extract method, inline a variable, reformat — only need a small span of code to judge equivalence, and AI almost never gets them wrong. But the moment correctness depends on global invariants spanning files, time, and execution paths, AI fails systematically — because its attention is similarity-based pattern matching, not formal tracking of program invariants.

Typical AI danger zones: concurrency and lock ordering (reordering code can introduce deadlock or races, invisible when reading the code); memory aliasing and shared mutable state (whether two variables point to the same object decides whether a split is safe); implicit cross-module contracts (a caller depends on a side effect's ordering); resource lifecycle (release order on exception paths). Here AI's refactor "reads perfectly reasonable" — which is precisely the danger: its confidence and its correctness are nearly uncorrelated for this class of problem.

Hands-on

# Before refactoring, force a non-local impact analysis (not a direct edit)
"Before refactoring this function, don't touch code yet. Answer 4 questions:
 1. Does it mutate any shared state outside the function? Who depends on that side effect?
 2. Is there locking / concurrency / async here? Will my change touch execution order?
 3. Are any parameters mutable objects referenced in multiple places (aliasing)?
 4. Does any caller depend on the current execution order or exception timing?
 If any answer is 'yes' or 'unsure', mark it high-risk — I'll do that part by hand."
Failure mode: treating "the tests are still green after AI refactored" as proof of safety. Concurrency, aliasing, and lifecycle bugs are exactly the ones hardest for tests to cover — they only surface under specific timing or scheduling. Such refactors, even when green, demand manual review of the invariants, or simply shouldn't go to AI.
Resources · Martin Fowler, Refactoring (2nd ed.) — a catalog of refactoring moves, distinguishing the mechanical ones from those needing global judgment
// 04

Deterministic Mechanical Refactors → Codemod; Judgment → LLM

Claim: for any refactor an AST tool can do deterministically, an LLM is the more expensive, slower, less reliable choice.

Background & Principle

A whole class of refactors is deterministic: rename a symbol repo-wide, change every call site of a deprecated API to the new signature, bulk-adjust imports, apply a lint rule. These have a single correct answer and can be expressed precisely as syntax-tree transforms. For them, codemod / AST toolsast-grep, jscodeshift, Python's ruff — beat LLMs on three axes: correctness (tree-sitter-based structural matching never misses an edit nor touches same-named text inside strings), cost (zero tokens, milliseconds across thousands of files), and auditability (the rule is declarative — reviewable, reusable, CI-able).

The dividing line is clear: structurally-determined work to deterministic tools, semantically-judged work to the LLM. "Rename all getUserData to fetchUser" is codemod work; "how should this logic split into functions and what should each be named to be expressive" is LLM work. The clever play is to have the LLM write the codemod rule for you — turning its semantic understanding into one reusable, auditable transform rule instead of editing file by file.

┌─────────────────────────────────────────┐ Who │ Single answer, expressible as a syntax │ should │ tree transform? │ do it? └───────────────┬──────────────┬───────────┘ yes │ │ no ▼ ▼ ┌────────────────┐ ┌──────────────────────────┐ │ Codemod / AST │ │ Needs semantic judgment / │ │ ast-grep, │ │ design choice → LLM │ │ jscodeshift, │ │ (splitting, naming, design)│ │ ruff │ └──────────────────────────┘ └────────────────┘ Best: have the LLM write the rename / API migration codemod rule above / imports / lint (semantics → determinism)

Hands-on

# Anti-pattern: LLM renaming file by file (costly, slow, hits same-name in strings)
# Right way: ast-grep, structural match, milliseconds, CI-able
$ ast-grep --pattern 'getUserData($$$ARGS)' \
           --rewrite  'fetchUser($$$ARGS)' \
           --lang js -U

# Advanced: have the LLM translate a fuzzy need into a deterministic rule
"I want all `logger.log(x)` changed to `logger.info(x)`,
 but do NOT touch 'logger.log' inside string literals or comments.
 Write me an ast-grep rule (YAML) — I'll run it instead of you editing by hand."
Failure mode: handing a deterministic repo-wide rename to an LLM across 5000 files. Result: exploding token cost, several files missed, same-named text in comments and strings wrongly edited, and no reproducibility. For any "mechanical, rule-expressible" refactor, ask first — can this be an ast-grep / codemod? If yes, don't have the LLM hand-edit.
Resources · ast-grep (multi-language structural search/rewrite), ast-grep.github.io · jscodeshift (JS/TS codemod toolkit), github.com/facebook/jscodeshift

// PUTTING IT TOGETHER · Safely refactor a real legacy function

Chain the four points into a flow you can use on your own repo today. Pick the legacy function you most dread touching, and walk it through:

  1. Classify (§3): do the non-local impact analysis first — concurrency? aliasing? cross-module side effects? Circle the high-risk parts and exclude them from AI's scope.
  2. Route (§4): of what's left, pull out the purely mechanical bits (rename, imports, API migration) and run them through ast-grep — zero tokens.
  3. Weave the net (§1): for the core logic that genuinely needs judgment, have AI generate characterization tests, run them green yourself, and commit the tests.
  4. Step it (§2): have AI produce a Strangler Fig step plan (6-10 steps); execute step by step, run tests each time, commit when green, then proceed.
  5. Verify for real: for the §3 high-risk parts, review the invariants by hand even if tests are green. Finally, diff test results and performance before vs after.

Do this once and you build the muscle memory: the engineering work of refactoring is in "proving behavior unchanged," not in "writing new code" — and AI only helps with the latter; the former rests on the test + step + routing system you design.

// DEEP THINKING

Characterization tests lock in "current behavior," bugs included. If some downstream depends on a bug, "fixing" it during refactor is itself a regression — doesn't this net just calcify bugs?
It does, and that's by design. The characterization test's job is to decouple "behavior change" from "refactoring": during the refactor phase, all behavior (bugs included) must stay frozen, and tests prove it. Fixing a bug is a separate commit — where you explicitly change the corresponding assertion so the diff clearly records "this is a behavior change, not a refactor." Mixing the two is the real danger: you can never tell whether a behavior difference was an intentional fix or an unintentional regression. Refactor first (freeze behavior), then fix the bug separately (behavior change + test synced).
§4 says hand mechanical refactors to codemod. But writing an ast-grep rule has a cost too — for a one-off small rename, isn't just letting the LLM edit faster?
Yes, the break-even depends on scale and reproducibility needs. Three files, one-off, not in CI — letting the LLM (or you) edit is faster; the fixed cost of a rule isn't worth it. But the moment any one of these holds, write the rule: hundreds of files (LLM miss rate climbs with scale), needs reproducing (teammates/future you re-run it), needs auditing (a rule is reviewable; per-file LLM edits aren't), collision risk (the same name appears in strings/comments). The formula: cost of writing the rule vs (scale × cost per error + reruns × redo cost).
If future models get strong enough to formally track program invariants, does §3's "non-local reasoning failure" claim become void?
Partly, but the boundary moves up rather than disappears. Even if a model perfectly tracks static invariants, a class of problems is in principle not purely statically decidable — concurrency races dependent on runtime scheduling, distributed behavior dependent on external timing. These need not stronger reasoning but runtime evidence (fuzzing, model checking, concurrency tests). What changes: today's "medium-risk" cross-module contract refactors enter the AI-reliable zone; concurrency correctness needing TLA+-level reasoning stays in the human + formal-tools domain. The claim's form is unchanged — reliability inversely proportional to non-local reasoning — only the threshold moves.
"Commit tests first, then refactor" guards against AI sneaking test edits. But if AI changes the function's interface signature, old tests won't even compile — now what?
This exposes a key distinction: changing an interface isn't refactoring, it's an API change. Pure refactoring (extract, internal rename, reimplement) doesn't touch the public contract, and characterization tests should pass as-is. If a refactor "needs" a signature change, it actually contains an interface change — split it into two steps: first use §2's preparatory refactoring to tidy internal structure without touching the signature (tests green throughout); then the interface change as a separate, explicit step, updating callers and their tests in sync. If tests fail to compile due to a signature change, the AI usually merged "refactor" and "change API" on the sly — which is exactly what to catch.
This issue repeatedly stresses "AI only helps you write new code; proving equivalence rests on your system." So what irreplaceable value does AI contribute to refactoring?
Three irreplaceable things: 1) breadth of characterization tests — enumerating boundaries and exception paths is tedious work AI fills in minutes where a human takes half a day and misses cases; 2) semantic naming and structure judgment — "which concepts should this split into, named what" is genuine design judgment deterministic tools can't do; 3) translating fuzzy intent into deterministic rules — you speak plainly, it produces an ast-grep/codemod rule. Note none of the three is about "guaranteeing equivalence" — AI is an amplifier of refactoring throughput, but the safety boundary is still defined by the test + step + routing system you design. Outsourcing safety to AI is the misuse.

// FURTHER READING