DAY 33 / PHASE 4 · GOVERNANCE

AI Governance for Large Legacy Codebases

Strangler Rollout · The Ratchet · Guardrails-as-Code · Review Without the Bottleneck

2026-06-15 · BigCat

AI maxed out the throughput of writing code, so the bottleneck moved wholesale to review. The fix isn't more reviewers — it's letting machines filter the noise first.

// WHY THIS MATTERS

Two realities hit at once. One: you inherit / maintain a 300k-line legacy codebase with no tests, no standards, left by a previous generation, and you want to introduce new governance (types, architecture rules, test gates) — but "enforce across the whole repo on day one" instantly surfaces thousands of violations. Nobody reads them, and the whole ruleset gets switched off. Two: AI has exploded the throughput of writing code; PR count multiplies, while review is still human, still line-by-line, so the bottleneck shifts from "writing" to "reviewing." Reviewers fatigue into rubber-stamping, and the hallucinated APIs, sneakily-edited tests, and missed edge cases in AI-generated code slip into main. These are the same engineering problem: how to introduce sustainable quality constraints into a large messy system without relying on "more people." Today, four shippable pieces: roll in gradually with strangler / codemod, make the repo only-get-better with a ratchet, turn rules into CI red/green with guardrails-as-code, and tier review so humans only see what machines can't judge. Core mantra: governance isn't "write it in the wiki so everyone notices" — it's turning constraints into automated gates on the diff.

// 01

Gradual Rollout: Strangler & codemod, no big bang

Claim: "enforce a new rule across the whole legacy repo at once" is doomed — thousands of warnings nobody reads = the rule gets turned off. New governance should apply only to new code + the files this change touches.

Background & principle

Martin Fowler's Strangler Fig: don't rewrite — grow new structure at the boundary and gradually "strangle" the old until it can be removed. Two concrete tactics for governance. First, changed-files-only gating: new rules (type checks, lint, complexity limits) enforce only on the files actually modified in a PR — existing code is exempt but frozen — so the rule seeps in as developers "fix what they touch," instead of detonating overnight. Second, hand mechanical legacy conversion to codemods (jscodeshift / OpenRewrite / ast-grep) — AST-level scripts do bulk edits instead of humans hand-editing file by file. AI's role is clear: the long tail a codemod can't handle (irregular semantics, needs judgment) goes to an agent for bulk edits — but every change still passes the §3/§4 gates.

In practice

# strict checks only on files changed in this PR; legacy untouched
CHANGED=$(git diff --name-only origin/main... | grep '\.ts$')
if [ -n "$CHANGED" ]; then
  npx tsc --noEmit --strict $CHANGED      # new/changed files must pass strict
  npx eslint --max-warnings 0 $CHANGED    # legacy files not in the list → exempt
fi
# mechanical migration via codemod, not by hand:
# npx jscodeshift -t rename-api.js src/   ·   ast-grep --pattern '...' --rewrite '...'

Core mental model: wire governance to the diff, not to a whole-repo scan. A whole-repo scan produces thousands of violations you won't fix today — it only trains the team to "ignore red," which is how governance dies.

Failure modes: (1) Big-bang rewrite / whole-repo enforcement at once — a violation tsunami, and the team learns to ignore the red CI. (2) Hand-editing mechanical changes a codemod could do — slow and error-prone. (3) Touching a legacy file without "upgrading" it to the new standard — files you've touched should be folded into governance, or the strangling never finishes.
Going deeper · Martin Fowler Strangler Fig Application, martinfowler.com/bliki/StranglerFigApplication · Software Engineering at Google (Large-Scale Changes / codemod), abseil.io/resources/swe-book
// 02

The Ratchet & Guardrails-as-Code: make the repo only get better

Claim: governance's goal isn't "zero violations now" — it's monotonic improvement: tolerate existing debt, forbid new debt, and let debt only decrease.

Background & principle

Ratchet: record the current violation count as a baseline; CI fails only when it's worse than baseline. Existing debt is tolerated, new debt is blocked; each fix lowers the baseline, and it can't ratchet back up — one direction only. Tools like betterer make this a general mechanism. The companion is guardrails-as-code / fitness functions (Building Evolutionary Architectures): write architecture constraints as executable tests — dependency direction, layer boundaries, forbidden imports, complexity limits — run via ArchUnit / dependency-cruiser / import-linter. Governance shifts from "docs + people watching" to "CI red/green," depending on no one's memory. AI's place: have an agent shave the baseline a little per PR (fix one or two legacy items), so the long-tail debt gets ground down inside normal work — not deferred to a "refactor sprint" no one ever has time for.

In practice

# fitness function: "UI layer must not import the DB directly" as a test
import { dependencyCruiser } from "dependency-cruiser"
forbidden = [{
  name: "ui-not-to-db",
  from: { path: "^src/ui" },
  to:   { path: "^src/db" },          # violation = CI red, not caught by review
}]

# ratchet: only block "worse than yesterday"
BASE=$(cat .debt-baseline)               # e.g. 412
NOW=$(count_violations)
if [ "$NOW" -gt "$BASE" ]; then echo "debt rose: $BASE→$NOW"; exit 1; fi
[ "$NOW" -lt "$BASE" ] && echo "$NOW" > .debt-baseline   # lock the new low
Failure modes: (1) Rules only in the wiki / CONTRIBUTING, not in CI — they effectively don't exist; governance run by human memory inevitably decays. (2) The ratchet lets the baseline rise again — debt climbs back, the ratchet is pointless. (3) A fitness function set too strict at once turns all legacy red — back to §1's tsunami; pair it with a ratchet to tighten gradually.
Going deeper · Ford/Parsons/Kua Building Evolutionary Architectures (fitness functions), thoughtworks.com · betterer (general ratchet tool), betterer docs
// 03

Review Without the Bottleneck (I): machines filter first

Claim: AI makes "writing" 5× faster while review stays human and line-by-line → the bottleneck moves to review. The fix isn't more reviewers — it's letting machines filter first, so humans only see what machines can't judge.

Background & principle

Think of code review as Day 34 HITL's confidence routing applied to code: each PR first passes a row of pre-human gates — tests, types, lint, coverage delta, SAST/secret scan, dependency audit, change-impact analysis. A gate not passed = deny; it never enters the human queue. Only when all gates pass does a human decide whether to review closely. This is especially valuable for AI-generated code: AI's most frequent failures — hallucinated APIs, turning a failing test into assert True to make CI green (see Day 31's test tampering), missed null / concurrency edge cases — are largely machine-detectable. Let sanitizers, the type system, mutation tests, and secret scanners catch that layer, so human attention isn't wasted on "what a machine could have found." The inverse holds too: don't make a human do what a machine does better (humans can't read out a race condition, but ThreadSanitizer can).

PR ──▶ ┌─────────── PRE-HUMAN GATES (fully automated) ───────┐ │ tests · types · lint · coverageΔ · SAST │ │ secret-scan · dep-audit · change-impact analysis │ └───────────────┬──────────────────────────────────────┘ any fail ──▶ DENY: back to author, no human slot all pass ─▶ enter §4 risk tiering ├─ high risk → human full review ├─ low risk → sampling + post-merge audit └─ mechanical/low → auto-merge don't give humans what machines can judge · humans see only "passed the machine but still needs judgment"
Failure modes: (1) Leaving machine-checkable things to human eyes as the line of defense — fatigue + misses, lose-lose. (2) No coverage / secret / SAST gate, hoping a reviewer spots a leaked key or uncovered branch by eye. (3) Gates too slow (30-min CI) — devs bypass and batch-merge, the gate is theater; gates must be fast enough that "didn't pass" is known instantly.
Going deeper · Google Code Review Developer Guide (what to look for / speed), google.github.io/eng-practices/review · This site, Day 34 Human-in-the-Loop (the general form of confidence routing)
// 04

Review Without the Bottleneck (II): risk tiers · sampling · provenance

Claim: for code that passed the automated gates, let risk decide review intensity — high-risk full review, low-risk sampling + post-merge audit. A flat "review every PR closely" is both slow and fake.

Background & principle

Three things compress human attention to a level that can actually review carefully. ① Risk scoring: tier by the paths touched — payment / auth / data migration / crypto = high; copy / tests / comments = low — layered with diff size and blast radius (see Day 31). High-risk forces full review + a CODEOWNERS owner; low-risk gets sampled human review + full post-merge audit. ② Trust tiering: give "AI authors" and "human authors" different default review intensity, adjusted by track record (same as Day 34's audit-driven retuning) — a newly wired agent is reviewed strictly, then relaxed once it's proven. ③ Provenance / attestation: mark which code was AI-generated, with which model / prompt (SLSA-style). When something breaks you can pinpoint it and bulk-recall "all code from one bad prompt" — a necessity at AI scale, not a compliance ornament.

In practice

def review_policy(pr):
    paths = pr.changed_paths
    high = any(p.startswith(("src/payment","src/auth","migrations/")) for p in paths)
    risk = "high" if high or pr.diff_lines > 400 else "low"
    trust = author_trust(pr.author)         # new AI < senior human, adjusts with record
    if risk == "high": return "full_review + CODEOWNERS"
    if trust < 0.7:      return "full_review"
    return "sample_5pct + post_merge_audit"   # low risk, high trust → sample

# carry origin metadata on the PR for tracing / recall
provenance = { "ai_generated": True, "model": "claude-opus-4-8", "prompt_id": "refactor-v3" }
Failure modes: (1) One review intensity for all PRs — high-risk under-reviewed, low-risk wastes human effort. (2) Trusting AI code as much as human code — AI's error distribution differs (more "confidently wrong"), so its default trust should be lower. (3) No provenance — one bad prompt mass-produces 200 risky PRs and you can't locate which batch or bulk-recall it afterward.
Going deeper · SLSA Supply-chain provenance / attestation, slsa.dev · This site, Day 31 Personal AI Safety (blast radius / reversibility) · Day 34 HITL (audit-driven retuning)

// Hands-on · give a legacy repo an "AI-ready governance chassis"

A one-week plan: pick a messy large repo you maintain, install — in this order — a governance layer that doesn't lean on more people, and verify it actually frees review from being the bottleneck.

  1. Ratchet: baseline your most painful rule (e.g. any count / cyclomatic complexity); CI blocks new debt, locks the low when it improves.
  2. Changed-files gate: enforce new rules only on files the PR touches; legacy exempt and frozen.
  3. Fitness function: encode one architecture constraint (dependency direction / forbidden import) as a CI test.
  4. Pre-human gates: tests + types + secret scan + coverageΔ fully automated; failing them never enters the human queue.
  5. Risk routing: write review_policy() — high-risk full, low-risk sampling + post-merge audit, AI authors stricter by default.
  6. Provenance: add "AI-generated? / model / prompt_id" fields to the PR template; build trace & recall capability.
  7. Retro: a week later, see where reviewers spent their time — if it's still on "what a machine could have caught," your §3 gates are insufficient; add gates, don't add people.

Once you've built this, you'll instinctively ask of any "AI writes code" workflow: how much noise did the machine gates filter, is human review risk-tiered or flat, can an incident be traced to a specific prompt — instead of agonizing over the wrong question, "should I trust AI-written code." The right question is always: which layer holds the trust, and what backs it up.

// ENGLISH GLOSSARY

Strangler Fig
Gradual migration that grows new structure at the old system's boundary and replaces it incrementally instead of rewriting. Named by Martin Fowler.
Codemod
AST-level scripted bulk code change (jscodeshift / OpenRewrite / ast-grep), replacing hand-editing file by file.
Ratchet
Baseline the current violation count; block only new debt, lock in improvements, so debt decreases monotonically.
Fitness Function
Architecture / quality constraints written as executable tests, turning governance into CI red/green.
Guardrails-as-Code
Rules live in CI, not the wiki — independent of anyone's memory.
Pre-human Gate
Fully automated gates before human review (tests / types / scans); failing them never enters the human queue.
Risk-based Review
Review intensity set by change risk: high-risk full review, low-risk sampling.
Sampling Review
Sample human review of low-risk changes + full post-merge audit, instead of inspecting each.
Provenance / Attestation
Record who / which model / which prompt generated code, enabling tracing and bulk recall. SLSA.
CODEOWNERS
A review mechanism binding high-risk paths to designated owners.

// Deeper Thinking

A ratchet tolerates existing violations and only blocks new ones — won't tech debt stay stuck at a high baseline and never get paid down?
It will, if you only block and never repay. The ratchet's discipline is "never get worse," but "get better" needs a separate driver. Two common approaches: (a) make lowering the baseline a soft norm of every PR (fix one or two items in files you touch), grinding debt down in normal traffic; (b) the AI era has a stronger lever — dispatch an agent to bulk-clear legacy (small batches, each through the full §3/§4 gates), turning the "refactor sprint no one has time for" into hundreds of low-risk small PRs. The ratchet guarantees no regression; agent traffic drives progress. Miss either and you stall (ratchet-only) or get offset by new debt (refactor-only).
Automated gates catch AI's crude errors, but AI's most dangerous output is the "looks-totally-correct subtle logic error" — machines miss it and sampling misses it too. What then?
This is the system's real boundary: gates + sampling only lower the cost of detectable errors and low-risk paths; they cannot handle "subtle logic errors on high-risk paths" via sampling. That's why §4's risk tiering is the key valve — high-risk paths are not sampled, fully reviewed, and precisely because machines filtered out the low-risk noise, humans have the bandwidth to slow down and really read high-risk PRs. In other words, this isn't "review less" — it's "redeploy the saved attention to where a human brain is genuinely needed." Subtle logic errors are still guarded by humans; you've just made sure humans aren't drowning in trivia. Strong property-based / mutation tests add another layer, but don't replace full review of high-risk paths.
Sampling review = accepting that some fraction of AI code reaches main without close human inspection. Doesn't that contradict "AI code needs more review"? How is the risk math not self-deception?
No contradiction, provided sampling is used only in the low-risk × high-trust quadrant and is backstopped by post-merge audit. The logic: on low-risk paths (copy / tests / internal tools), even an error has a small blast radius and is reversible (see Day 31), so "sampling + full post-merge audit + fast rollback" has lower expected cost than full review of each. Two places it becomes self-deception: one, distorted risk tiering (mislabeling a truly high-risk path as low) — so keep the risk table conservative, defaulting unclassified to high; two, audit in name only (you sampled but nobody actually audits) — then it degrades into "no review." Sampling's legitimacy rests entirely on "audits actually run and rollback is actually fast," else it's an excuse to cut corners.
Provenance marking code as "AI-generated" — won't it become a blame tool (blame the AI when things break) rather than a driver of improvement?
Depends what you do with it. Used as "evidence" = blame ("the AI wrote it, not me"); used as an observability signal = improvement. The concrete improvement move: aggregate defect rates by model / prompt_id — if one prompt template's output has a markedly higher defect rate, go fix that prompt (back to Day 34's audit-driven retuning, and the future Day 47 prompt-as-code); if a high-risk path's AI defect rate is high, tighten its trust threshold and force full review. Provenance's value isn't "assigning fault" — it's "making AI output quality a measurable, regressable quantity." Incident recall is the floor use; driving prompt / routing improvement is its real leverage.
How does this code governance relate to Day 34 HITL and Day 31 safety? When do you use which?
All three are the same "tiered trust" idea projected onto different objects. Day 31 governs an agent's runtime actions (kill switch / sandbox / irreversible gate) — defending against harm while the agent executes. Day 34 HITL is the general approval orchestration (when to ask, how to go async, how to take over) — the mechanism layer. Today instantiates that mechanism onto one specific scenario: "code entering the repo." Pre-human gate = confidence routing, risk tiering = the impact axis, provenance = the audit. Which to use depends on the object: constraining an agent's running actions → Day 31; building a general human-review workflow → Day 34; governing the specific flood of "lots of AI code into a legacy repo" → today. They stack; they're not mutually exclusive.

// Further Reading