Deep Research · Deep dive

When Code Becomes Cheap: Theory and Evidence for the AI-Native Transformation of Software Organizations

This is the deep-dive edition · read the plain-language edition →

TL;DR

Transaction costs don't disappear — they migrate from production/coordination to integration/verification. This single pivot explains METR's slowdown, DORA's stability penalty, and the restructuring of human roles, and yields the sequence for high-reliability legacy organizations: enter through migrations → guardrails before scale → capitalize verification infrastructure. The essay closes with six testable claims.

86 sources166 adversarial votes4 theoretical spectra6 testable claims

Every empirical citation in this essay is graded. Theoretical claims went through adversarial verification during the research phase — three independent verifiers per claim, checking quotes against primary sources and searching for counter-evidence. The 13 load-bearing empirical figures cited in the text (METR, DORA, the Google/Airbnb migrations, the Copilot RCTs, the Knight Capital and TSB cases, and others) went through a second three-vote adversarial pass after drafting: 12 held, 1 was corrected per the verifiers' notes, 0 were overturned. Methodological caveats — correlational vs. causal, self-reported, vendor-sourced, n=1 — are stated inline. A full source index appears at the end.

0. A paradox that needs explaining

In 2025, DORA's survey of nearly 5,000 technology professionals produced a set of numbers that contradict each other: 90% of respondents use AI at work, and more than 80% believe it has made them more productive. At the same time, the negative relationship between AI adoption and software delivery stability held for the second consecutive year — "without robust control systems, like strong automated testing, mature version control practices, and fast feedback loops, an increase in change volume leads to instability" (DORA 2025; cross-sectional, correlational; adversarially verified).

The gap between individual perception and organizational metrics is not survey noise. It was measured directly, in a randomized controlled trial. In 2025, METR recruited 16 experienced open-source maintainers (averaging 5 years of contribution history on their own repositories; the repositories averaged over a million lines of code and roughly ten years of history) and randomly assigned 246 real tasks to allow or forbid AI use. The result: allowing AI increased task completion time by an average of 19% (roughly +2% to +40%, the 95% CI as read from the paper's figure). The developers had predicted beforehand that AI would make them 24% faster — and afterward, having actually been slowed down, they still estimated they had been sped up by 20%. Subjective perception and objective measurement pointed in opposite directions, roughly 40 percentage points apart. (This result has a sequel that must be read with it: in METR's February 2026 follow-up, the original cohort's point estimate flipped to roughly an 18% speedup, but the confidence interval still crossed zero; 30–50% of participants admitted withholding tasks they didn't want to do without AI, so METR called the new data "an unreliable signal" of current productivity effects, said the estimates are "likely a lower bound," and is revising the study design. The study's most durable contribution is not the 19% point estimate — it is the hard evidence of the perception-reality gap, and the list of slowdown mechanisms. Both return later in this essay.)

Put these together and you get the paradox this essay sets out to explain: individuals genuinely feel accelerated, while organizations genuinely fail to get faster — and become less stable. If your transformation decisions run on engineers' gut feel, vendor demos, and velocity charts, you are most likely optimizing the perception, not the reality.

This paradox is not an AI-specific curiosity. It has a rigorous theoretical explanation, and that explanation can be run forward to derive how organizations should change. That is the route this essay takes: the theoretical pivot first (Section 1), then where human roles move (Section 2) and where organizational boundaries move (Section 3); then the hardest case — organizations maintaining extremely complex legacy codebases at very high QPS under strict reliability requirements, BigQuery-class organizations (Section 4); and finally, why traditional organizations struggle to transform and how to sequence the path (Section 5).

1. The pivot: the cost didn't disappear — it moved

1.1 Starting from Coase

The framework Coase laid out in "The Nature of the Firm" (1937) still has no better replacement: firms exist because coordinating through the market — search, bargaining, contracting, monitoring — is costly; when internal hierarchy coordinates more cheaply than market transactions, activity is pulled inside the firm. Transaction costs determine the boundary of the firm.

AI agents act directly on every component of transaction cost: search, contracting, monitoring, settlement, and verification can all be automated. The NBER chapter "The Coasean Singularity?" (Shahidi, Rusak, Manning, Fradkin & Horton, 2025) argues that AI agents execute price discovery, negotiation, and compliance monitoring at near-zero marginal cost, and that the firm–market boundary will therefore move. This direction-neutral claim — the boundary will move — survived adversarial verification.

But the popular inference — "transaction costs fall, therefore firms shrink and hierarchy dies" — does not survive contact with the evidence. A 2025 analysis in California Management Review (Warin, "From Coase to AI Agents") makes three counter-arguments, all of which passed adversarial verification:

AI agents lower transaction costs at the micro level, but the proliferation of agents creates new coordination costs at the organizational level — duplicated effort, conflicting processes, organizational entropy. Coase's logic hasn't failed; the cost has been transferred from production and coordination to integration.
Relying on external platforms to deploy agents creates a new form of lock-in: organizations unknowingly outsource organizational coherence itself to platform vendors (the author calls it "digital feudalism"). Corroborating industry data: a16z's 2025 enterprise survey found switching costs rise significantly in agentic-workflow scenarios; in a 2026 Zapier survey, 81% of business leaders worried about AI vendor dependence and only 6% believed they could switch painlessly.
Ungoverned agent adoption threatens the firm's existence as a coherent entity — an AI-native organization needs a centralized agent-governance layer. That is not bureaucracy; it is a Coasean necessity.

In the same literature stream, "The Agentic Economy" (2026) extends transaction-cost economics with the concept of coordination friction (verified; a conceptual framework, not an empirical estimate): agentic systems lower the old transaction costs while manufacturing new frictions — model error, protocol failure, audit opacity, human-review burden. One sentence from the paper deserves quoting in full: "A system may act faster but become less auditable."

So the robust conclusion of this first spectrum fits in one sentence: transaction costs do not disappear; they migrate from production and interpersonal coordination to integration, verification, and platform relationships. That is the pivot of this essay.

1.2 A seductive strong version that has already been falsified

One more radical version deserves explicit demolition, because it circulates widely: "The coordination frictions that limit firm size — communication bandwidth, tacit knowledge, shirking incentives — are human-specific. They don't apply to AI agents, so agentified firms can break through the traditional scale ceiling." The source of this claim (Hadfield & Koh, 2025) actually phrases it as a research-agenda conjecture, hedged with "seem" — and current empirical work points the other way. Google Research/MIT/DeepMind's study of scaling agent systems (2025) measured coordination rounds growing as a power law in the number of agents (exponent 1.724), coordination overheads of 263–515%, and per-agent reasoning capacity becoming "too thin to be usable" beyond 3–4 agents. The MAST failure taxonomy (Cemri et al., 2025), built from over 1,600 execution traces, attributes roughly 79% of failures to specification, coordination, and verification gaps.

Coordination friction changed form; it did not disappear. Bounded context maps to bounded rationality, misalignment maps to opportunism and agency cost, error propagation maps to team-production moral hazard — and these frictions bite at scales far smaller than any human firm.

Principal-agent theory, incidentally, gets a second life here: the AI alignment problem stands in close analogy to incomplete-contract theory (Hadfield-Menell & Hadfield, AIES 2019 — note analogy, not equivalence: human incomplete contracts are backstopped by external norms and legal enforcement; AI agents have no such backstop). Reward specification is the inevitably incomplete contract between principal and agent: "even though AI agents are optimizers, we cannot be sure what they are optimizing." Delegating work to agents doesn't eliminate the principal-agent problem — it transforms it. Alignment, monitoring, and verification costs are the new agency costs.

1.3 The software-engineering mirror: Brooks rewritten, not repealed

The economics of cost transfer has a precise mirror in software engineering theory. Brooks's n(n−1)/2 communication cost from The Mythical Man-Month is the most famous law of software organization. When agents take over most of the coding, does the O(n²) term vanish?

Three independent authors converged on the same answer in 2025–2026 (all high-reputation practitioner accounts — directionally consistent, but not controlled studies):

Wes McKinney (creator of pandas): the coordination problem "doesn't disappear — it changes shape," becoming the human work of reconciling contradictory plans produced by multiple agent sessions that share no persistent memory and no common understanding — a crew of temp workers who never meet and forget everything each morning. In the classic dichotomy of Brooks's "No Silver Bullet" (essential complexity is the difficulty inherent in the problem itself; accidental complexity is the friction added by tools and process): AI largely eliminates accidental complexity, but agents cannot reliably recognize which difficulty is essential, and they generate new accidental complexity at machine speed (defensive boilerplate, overengineering). He also reports an empirical inflection point — a "brownfield barrier" at roughly 100 KLOC where agents begin "chasing their own tails," with visibly worse struggles in million-line codebases. "Agents are accelerating the 'easy part' while paradoxically making the 'hard part' potentially even more difficult."
Forret transplants Brooks's Law term by term into the agent era: "Adding autonomous AI agents to a late software project makes it later." Brooks's original mechanism had two parts — new hires need ramp-up time from existing staff, and every added person multiplies communication paths. With agents, neither cost disappears; each is renamed. "Onboarding" becomes preparing context for the agent — assembling task background, code conventions, and undocumented tacit knowledge into inputs the agent can digest. "Communication" becomes the human work of reviewing and integrating agent output. And that review often costs more than reviewing a human colleague's code, because agent errors have a particular deceptiveness: syntactically perfect and stylistically polished, but logically flawed and architecturally inconsistent. A human junior's mistakes are usually visible at a glance; an agent's mistakes hide behind a professional surface, and you have to genuinely understand every line to catch them.
O'Reilly Radar: "When generating code is free, knowing when to say 'no' is your last defense."

Conway's Law is likewise amplified rather than repealed: where domain boundaries are clean, agents reinforce them; where they are blurry, agents manufacture coupling — AI amplifies the organization's existing patterns. This is strictly congruent with DORA 2025's central, verified finding: "AI doesn't fix a team; it amplifies what's already there."

The Team Topologies cognitive-load lens states the same conclusion a third way: implementation load (syntax, API recall) collapses, while integration and coordination load rises. One practitioner hypothesis puts it bluntly (note: none of its numbers have an empirical source): "A 5-person AI-augmented team producing at the rate of 12 traditional engineers creates 12 engineers' worth of integration surface area."

1.4 What the pivot explains: back to the opening paradox

Now the paradox of Section 0 resolves. Cost-transfer theory explains all four datasets at once:

Schematic: as generation cost approaches zero, cost and the bottleneck move to verification/integration (proportions illustrative)

METR's slowdown. For experienced developers in mature codebases, generation was never the bottleneck — verification is. The screen recordings show the AI-allowed group spending roughly 9% of time reviewing and cleaning up AI output, with active coding time displaced by prompting, waiting, and reviewing. The speedup on the generation side is eaten by new verification-side costs — and in the setting where verification is most expensive (experts plus complex legacy), the balance goes negative.
DORA's stability penalty. Integration cost surfaces as instability. Amplified change volume punches through weak downstream control systems — that is precisely the transferred cost, made visible.
Faros's telemetry (10,000+ developers; vendor research): high AI adopters merge 98% more PRs — but PR size is up 154%, review time up 91%, and there is no correlation between AI adoption and cycle time. Individual throughput rises, organizational delivery speed doesn't; the cost piles up in review.
GitClear's longitudinal code-quality data (211 million changed lines, 2020–2024; vendor dataset, correlational): cloned/copy-pasted lines rose from 8.3% to 12.3% of changes, while refactoring-related (moved) lines fell from 25% in 2021 to under 10% in 2024 — the first year on record in which copy/paste lines exceeded moved lines. Generation got cheap, and structural maintenance got crowded out.

Section 1 in one sentence: AI pushes the cost of writing software toward zero and pushes the cost of confirming it is right into the organizational bottleneck. Every AI-native organizational design question is, at bottom, a question of repositioning around this cost transfer.

2. Where the humans go: the division between prediction and judgment

2.1 An economics framework that survived twelve verification votes

For the question "what do people do once agents do the work," the most solid theoretical source available is Agrawal, Gans & Goldfarb's prediction-judgment framework (NBER, 2018; the authors reaffirmed in HBR in 2024 that the framework applies to generative AI). It is the most thoroughly verified theory in this research base: 12+ adversarial votes, all passed, quotes checked word-for-word.

Two core propositions:

"We interpret recent developments in the field of artificial intelligence (AI) as improvements in prediction technology" — not general intelligence. This is not a dismissal; it is what makes AI tractable for economic modeling. When the cost of prediction falls, complements to prediction appreciate and substitutes depreciate.
"Prediction and judgment are complements as long as judgment is not too difficult." Judgment means the costly activity of determining payoffs across states — setting goals, defining the reward function, deciding what counts as good.

Mapped onto software engineering (note: this mapping is our inference — the 2018 paper never mentions software development): code generation, completion, and execution sit on the prediction side; requirement specification, architectural trade-offs, acceptance criteria, and "is this change worth the risk" sit on the judgment side. Prediction gets cheap; judgment appreciates — human value concentrates in specification and acceptance.

The framework carries a reverse clause that citers routinely drop: if judgment itself can be encoded into state-contingent machine behavior — if "what counts as good" can be written as rules — the complementarity flips into substitution. The same authors' companion paper states plainly that "not all human judgment will be a complement to AI." "Humans will always keep judgment" is not what this theory promises. It promises a moving boundary.

2.2 The evidence: directionally consistent, but vendor-sourced

Anthropic's analysis of roughly 400,000 Claude Code sessions from 235,000 users (vendor self-study; not adversarially verified) shows a division of labor consistent with the framework: humans retain about 70% of planning decisions while delegating about 80% of execution decisions to the agent — the human/machine split falls almost exactly along the planning/execution axis. More interesting is the role of expertise: a single instruction from an expert triggers about 12 actions and 3,200 words of output, versus 5 actions and 600 words for a novice; and coding professionals and non-coding professionals achieve nearly comparable verified success rates on coding tasks (34% vs. 29%, among sessions that produced code). Execution skill is depreciating; domain knowledge is appreciating — in the study's words, "the ability to steer Claude toward success comes more from command of a domain than from the ability to write code."

2.3 A boundary test: how far can one person plus a fleet of agents go

The Itaú Unibanco case (arXiv 2605.18461, 2026) is the most complete first-hand record so far on the minimum-team-size frontier: in a regulated brownfield banking environment, one staff engineer with 8+ years of experience, working under Spec-Driven Development with three AI tools covering four agent roles, delivered in 3 sprints a project scoped for a 4-person squad over 6 sprints. The quality and cost figures were adversarially verified: 90% of AI-generated code accepted in first-round review without structural changes; 113/113 integration tests passing; zero post-release defects; direct cost down 88%; effort per business capability point down from 8.93 to 4.35 hours (−51%).

The caveats are equally clear: n=1; the engineer is an author of the paper; the baseline is a historical project, not a parallel control; and the paper itself calls the setup "a boundary test rather than a target operating model."

But the most theoretically valuable part of the case is not the numbers — it is the mechanism, which fits Section 1 exactly: the speedup came not from one person coding faster, but from compressing the "outer loop" of cross-functional coordination. When product analysis, architecture, security, and QA are embodied in agents directed by a single engineer, cross-discipline round-trips drop from days to minutes. And what constrained success was not model capability but specification quality and institutional knowledge: vague specs produce garbage regardless of tooling, and undocumented legacy integration contracts were the largest source of rework.

Three lines of evidence — theory, telemetry, case study — point to the same organizational-design conclusion: the scarce role in an AI-native organization is not "programmer who can use AI" but someone who can compress domain knowledge into high-quality specifications and who holds acceptance authority. If your hiring, promotion, and development systems still price people on coding execution, they are pricing an asset in decline.

3. Will organizations get smaller? An honest stalemate

"AI lets ten people do the work of a hundred; firms will fragment into micro-specialist units" — this is the most popular inference in the AI-native narrative, and it is also the inference most thoroughly demolished by adversarial verification in this research. This section offers no conclusion, only the actual state of the dispute — the evidence itself is unresolved, and any one-sided claim outruns what it can support.

The shrinkage camp's star paper did not survive verification. "The Headless Firm" (2026) proposes an hourglass model — generative interfaces on top, a protocol waist in the middle, a market of micro-specialized execution agents below. Its three core assertions ("integration cost collapses from O(n²) to O(n)," "the hourglass is a stable equilibrium," "firm-size distributions bifurcate by knowledge-decay rate") were all refuted as statements of fact: the paper is an unreviewed preprint; both authors are founders of Mantix, an agentic-infrastructure company (a disclosed conflict of interest — the hourglass architecture is their product thesis); the paper is entirely conditional model reasoning with zero empirics; and the multi-agent failure evidence (MAST) shows verification and coordination costs still growing with interaction count. What can be legitimately cited are its two falsifiable predictions: if a protocol waist holds, the marginal integration cost of adding an execution provider should be approximately constant; and the coordination-cost-to-throughput ratio (C/T) should stay stable as the ecosystem grows — a sharp rise signals the model's collapse. These are worth remembering: they are observable criteria for judging whether the great unbundling is actually happening.

The expansion camp, meanwhile, holds peer-reviewed evidence. Babina, Fedyk, He & Hodson (Journal of Financial Economics, 2024) find that AI investment concentrates in larger firms, drives their growth, and increases industry concentration — including in fast-moving technology sectors. No fragmentation has been observed to date. On the theory side, the Chen/Elliott/Koh capability-formation model (JET, 2023; transmission verified) predicts a phase transition: when AI lowers the organizational cost of maintaining diverse capabilities and different markets begin to value similar capabilities (e.g., via transfer learning), the economy flips discontinuously from many specialized small firms to a few cross-industry giants. Note the two joint conditions — and note that this is a theoretical prediction, standing as a competing equilibrium to the shrinkage story.

A side observation, for what it is: Anthropic and OpenAI generate roughly $14M and $6.5M of revenue per employee respectively, exceeding every technology company in the Forbes Global 2000 (Epoch AI) — but this is correlational, and too many confounders separate emerging-monopoly economics from organizational form.

What can be safely asserted is the mechanism, not the direction: transferred transaction costs will move firm boundaries. Which way they move depends on the scaling law of verification cost (if verification scales sublinearly with task volume, unbundling has a chance; if it still scales with interaction topology, concentration continues) and on accountability regimes (professional sign-off, regulated workflows, evidence chains — Hydari & Muzaffar argue this is a necessary condition independent of verification cost). Anyone who claims to know the direction is selling certainty beyond the evidence.

The operational takeaway: rather than betting on "micro-teams" versus "platform giants," watch your own observables — how does your per-change verification cost grow with change volume? Is your agent ecosystem's C/T ratio rising or stable? Both are measurable from your own telemetry.

4. The special case: when what you run is BigQuery — AI-native transformation for high-reliability legacy organizations

The first three sections apply to all software organizations. But one class deserves separate treatment: organizations whose codebases are extremely complex and deeply legacy, which serve very high QPS, and whose reliability requirements are unforgiving — Google BigQuery/Spanner, large payment and trading systems, core banking. Nearly all the AI-native evangelism these organizations encounter was written from a startup context, and the startup playbook is not merely inapplicable to them — it is systematically dangerous. This section argues why, and what a viable path looks like.

4.1 Why the startup playbook doesn't transfer: the risk geometry is different

Perrow's Normal Accident Theory (Normal Accidents, 1984) supplies the right lens: systems that combine tight coupling (millisecond propagation, little slack) with interactive complexity (tens of millions of lines of legacy, webs of hidden dependencies) sit in the highest-risk quadrant; de-escalating on either dimension — looser coupling or more linear interactions — lowers the rate of catastrophic error.

Perrow risk lens (schematic): the same AI tool integrates to very different risk in the two quadrants; the danger quadrant adds short latency — systems slide to catastrophe faster than human cognition

The theory's limits need stating up front. NAT and its rival, High Reliability Organization theory (HRO), share a falsifiability problem — the thresholds of coupling and complexity cannot be objectively measured (verified). Under the strictest technical definition, arguably no real accident fully matches Perrow's "normal accident" criteria; Le Coze's (2015) reconstruction is that "Perrow was right that accidents are normal, but for the wrong reasons — the roots lie in organizational and governance dynamics, not the technical architecture itself." Accordingly, this essay uses Perrow as a risk-classification lens, not an accident law — strong causal claims of the form "introducing AI will cause accidents" are not supported by the evidence, and this essay does not make them.

Through that lens, the difference between startups and high-reliability organizations is geometric. A greenfield system naturally sits in the loose-coupling, low-complexity quadrant: error budgets are generous, blast radii are small, and low reliability can be traded for learning speed. A BigQuery-class system sits in the dangerous quadrant — and in its short-latency corner, where the system slides from apparently normal into catastrophe faster than human cognition can track (a verified theoretical proposition). Knight Capital calibrates the magnitude (an analogy case — no AI involved): on August 1, 2012, a deployment error — new code rolled to only 7 of 8 servers, with the unpatched server activating Power Peg, dead code retired eight years earlier but never deleted — lost over $460 million in about 45 minutes, and the firm lost its independence within a year. Note the shape of the causal chain: dead code, plus a reused flag, plus pre-open automated error notifications that no one acted on (the SEC specifically noted that the 97 error emails "were not designed to be system alerts" and that staff did not routinely review them — the existence of notifications and the functioning of alerts are different things). An organization generating code at AI speed without dead-code governance is mass-producing Power Pegs.

The same AI tool integrates to a completely different risk exposure in the two quadrants. That is the first-principles version of "don't copy the startup playbook."

4.2 The benefit side fails too: METR's setting is your daily reality

For these organizations the startup playbook fails on the benefit side as well. Recall METR's experimental setting: experienced developers (averaging 5 years on the repository), large mature codebases (million-plus lines, decade of history), and demanding contribution standards (documentation, test coverage, lint discipline). That is not an arbitrary experimental context — it is the daily reality of a high-reliability legacy organization. It is exactly there that the 19% slowdown was measured (CI roughly +2% to +40%; early-2025 tooling; follow-up caveats in Section 0). The paper's factor analysis supplies the mechanisms: the more familiar the developer was with the task, the worse the slowdown; AI could not exploit tacit repository context; and high quality standards inflate verification cost.

Why bigger context windows won't save you: Chroma's Context Rot evaluation across 18 frontier models shows performance degrading steadily with input length even on simple tasks, with semantically similar but irrelevant distractors compounding the degradation — and a giant codebase is precisely a sea of similar-but-irrelevant code patterns. Stuffing ten million lines into a window is not a plan.

Google's internal counter-move points at the real lever: DIDACT trains on internal development process data — fine-grained edits, build fixes, review round-trips — not just finished code. The implication is harsh and clear: a high-reliability legacy organization's AI capability partly depends on its private process-data assets, and that cannot be bought with tool licenses.

Lehman's laws of software evolution explain why this constraint is structural: complexity in an evolving system increases by default unless work is invested to contain it — legacy complexity is not an engineering failure, it is thermodynamics. And Hyrum's Law guarantees that "behavior-preserving" refactors are never zero-risk in a huge codebase: every observable behavior of an interface is already depended on by someone. The bulk of this complexity lives as tacit knowledge — historical edge-case fixes, undocumented behavioral dependencies, organizational conventions — exactly the distribution that publicly trained LLMs lack most.

4.3 Where the two curves meet: the verification bottleneck, and an organizational principle from 1999

The risk side (4.1) and the benefit side (4.2) converge on a single mechanism — the pivot of Section 1 under extreme conditions: as generation cost approaches zero, the bottleneck moves to verification, and verification capacity happens to be the most expensive, least compressible asset a high-reliability organization owns.

The evidence on this line converges from four independent directions — the most solid cluster in the entire research base:

Industry behavior data (verified; traced to the official press release and cross-confirmed by independent media): per Sonar's 2026 survey (n > 1,100; note that Sonar, a code-quality vendor, has a commercial interest in this narrative), 96% of developers do not fully trust the functional correctness of AI-generated code — yet only 48% always review AI-assisted code before committing. Between distrust and verification behavior lies a 48-percentage-point gap, and that gap is a reservoir of organizational risk.
Front-line practice (verified): security-audit firm SRLabs' operational conclusion — "AI produces useful leads, but verification and context-setting remain the dominant cost." They deploy AI late in the audit process, narrowly scoped, as a QA/coverage tool rather than a first-line discovery tool; their early practice saw roughly 80% false-positive rates in AI vulnerability findings, each requiring human triage.
Academic quantification: automated verifiers are far from reliable — the SAGA approach detects 90.62% of defects on the TCGBench benchmark, but when acting as a judge of whether a piece of code is correct overall, its accuracy is only 32.58%. One layer deeper: weak test cases contaminate model training itself — the now-standard technique of reinforcement learning from verifiable rewards (RLVR) pays the model for passing tests, so when the tests are lax, the model learns "fooling bad tests" as if it were "writing correct code." Verification quality caps not only deployment, but the AI improvement loop itself.
Organization-level evidence (first-hand, self-reported): in Google's JUnit3→JUnit4 migration, about 87% of AI-generated code was submitted unmodified — and the team explicitly recorded that the bottleneck was human review speed; they deliberately throttled change generation to avoid overwhelming reviewers. Surplus capacity on the generation side, rate-limiting on the verification side: the most direct organizational confirmation of the verification-bottleneck thesis.

On the theory side, an organizational principle from 1999 might have been written for this moment — and it is one of the most cleanly verified theoretical claims in this research (six votes, verifiers downloading the original PDF and checking word-for-word): Weick, Sutcliffe & Obstfeld's characterization of high-reliability organizations — reliability comes not from stable patterns of activity but from "stable processes of cognition" applied to "variable patterns of production." Cognition stable, action variable. Efficiency-seeking organizations often enact the opposite split: activity stable (frozen process), cognition variable (judgment drifting with whoever is on shift).

Translated to the AI era (the translation is our derivation; the mechanism itself is verified): AI makes action generation unprecedentedly cheap and variable, so the organization must make its cognitive processes — detection, evaluation, correction — correspondingly more stable and more capitalized, rather than letting AI take over both generation and verification. That is the theoretical core of "guardrails before scale." It is not conservatism; it is control theory.

Dietterich (2018; verified; a normative position paper) states the same logic as a deployment criterion: in high-risk applications, reliability is a property of the human-plus-AI combined system, not of the AI alone; do not deploy AI where the surrounding human organization cannot achieve high reliability — a team without verification capacity should not absorb high-volume AI output. The startup playbook — agents for everyone, adoption driven by gut feel, ship first and see — is a systematic violation of exactly this criterion.

4.4 The viable first step: migrations — the only entry point with first-hand success evidence

Having stated the constraints, the good news: there is one task category where AI has first-hand success records from multiple independent organizations on giant legacy codebases — and it happens to be the deepest standing pain of legacy organizations.

The task profile: old-pattern-to-new-pattern transformation, unit granularity, built-in oracle — an "oracle" being any automatic judge of whether the result is correct, here the compiler, the test suite, the type checker — labor-intensive, and universally unloved. Concretely: large-scale code migrations, test-framework migrations, language/framework upgrades, build fixes.

Google, int32→int64 monorepo migration (FSE 2025): 12 months, 3 developers, 39 migrations, 93,574 edits; the LLM generated 74.45% of code changes; developers self-estimated total time savings of 50% (note: a 3-person self-report, not instrumented measurement). The paper's statement of motivation is itself the argument: these migrations are labor-intensive, unrewarding, and can drag on for years — they are the economically rational first AI workload for a legacy organization.
Google, Ads int32→int64 and JUnit migrations (2025): 80% of the code changes in landed changelists were written entirely by AI (measured by character-level diff); the JUnit3→JUnit4 effort migrated 5,359 files and 149,000+ lines in three months, with ~87% of AI-generated code submitted unmodified. LLMs restarted migrations that had been shelved for years; a handful of engineers did work that would have taken hundreds of engineer-years. Google's stated success bar for each migration project is ≥50% end-to-end time savings.
Airbnb, Enzyme→React Testing Library (2025): nearly 3,500 test files, originally estimated at 1.5 years of engineering time, completed in 6 weeks. The automation curve is informative: the first pass migrated 75% in 4 hours; 4 days of tuning against common failure modes brought it to 97%; the last 3% took about a week of manual work — automation has a hard ceiling and a human long tail. The team's own retrospective attributes success to "picking the right related files" (context windows up to 40K–100K tokens, pulling in as many as 50 related files) — context engineering, not prompt wording.
Amazon Q, Java 8/11→17 (vendor-reported, not peer-reviewed): tens of thousands of production applications, per-application effort down from ~50 developer-days to hours, with claimed cumulative savings of 4,500 developer-years.

The architectural intersection of the four cases is a ready-made guardrail template:

Deterministic sandwich: deterministic tools (AST, code search/Kythe, static analysis) do the locating; the LLM generates only in the middle layer; deterministic verification loops (test/compile/lint) adjudicate — the AI never owns the merge decision.
Same-process principle: AI code goes through the same review and the same test gates as human code.
Throttle by verification capacity: generation throughput follows reviewer capacity, not generation capability.
Context engineering is the primary lever.
Honest unknowns: the long-term effect on code quality is not yet known (Google's own admission).

The counter-cases are ready to hand. In 2018, TSB migrated 5 million customers to a new platform in a single big-bang event: the independent investigation found the bank had not adequately justified a single-event migration, and live proving was too small to surface the problems. The result: online banking down, phishing fraud spiking 70-fold, migration-related total costs around £330 million (later partly offset by recoveries from the supplier), 80,000 customers lost, and a £48.7 million regulatory fine in 2022. Queensland Health's payroll rewrite turned a $6 million project into roughly $1.2 billion of losses. The mechanism by which big rewrites kill organizations is exactly what Lehman and Hyrum predict: institutional knowledge sedimented in code is discarded with the old system, then painfully rediscovered.

The strangler fig pattern (Fowler) is the right path because its structure — old and new coexisting, replacement in small verifiable slices, every step reversible — is exactly the guardrail AI needs. AI did not change the correctness of the strangler fig pattern; it made the pattern's most expensive manual step — unit-by-unit rewriting — an order of magnitude cheaper. That is the complete argument for migrations as the optimal AI entry point for legacy organizations.

4.5 A three-stage sequence, with the evidence grade of each stage

Collapsing this section into an executable sequence, with honest labels on what each step stands on:

Stage one: enter through closed-verification tasks. (Multi-organization first-hand evidence.) Large-scale migrations, test migrations, framework upgrades, build fixes — using the Google/Airbnb deterministic-sandwich architecture. This stage simultaneously builds two assets for stage two: your organization's own telemetry baseline for AI changes, and its context-engineering capability.

Stage two: guardrails before scale. (Partly empirical, partly extrapolated.) The empirically supported part: same-process principle, deterministic sandwich, throttling by review capacity, late-and-narrow AI placement, embedding into existing toolchains — Google's 2024 overview shows its own AI rollout following a "measure and stage" path, with the finding that AI features requiring users to remember to trigger them do not scale; the leverage is in embedding into existing workflows. DORA 2025 prescribes in the same direction: returns come from investment in testing, version control, and feedback loops, not the tool itself. The extrapolated part (no first-hand case of any organization implementing it): folding AI-generated changes into a unified error budget (the core SRE mechanism: the slack between your reliability target and perfection is a spendable "change budget," and when it's exhausted, releases freeze — turning "stability vs. speed" from a culture war into arithmetic), with the remaining reliability headroom automatically regulating AI change quotas; and using TLA+-style formal specifications as acceptance oracles for high-risk AI changes — AWS demonstrated formal methods' feasibility and ROI in exactly this class of organization (model checking found a DynamoDB design bug with a 35-step minimal error trace that had survived every human review; engineers went from zero TLA+ to useful results in 2–3 weeks), but Newcombe et al. concede the gap between design-level and code-level verification ("The answer is we do not know," in their words, on whether code correctly implements the verified design). "Formal methods guarding AI changes" is this essay's derivation, not established practice.

Stage three: organizational structure. (Mostly theoretical extrapolation, standing on verified mechanisms.) Establish verification infrastructure — test oracles, staged rollout, observability, review toolchains — as a first-class asset owned by dedicated teams rather than left to each team's discretion (the organizational translation of Weick's mechanism). Centralize agent authority governance while executing locally — Leveson's critique of decentralization applies here (cited as her position: she argues that in tightly coupled systems, uncoordinated local decisions are themselves an accident mode; the HRO school's counterweight is "migrating decision rights," expertise over hierarchy in anomaly handling). And instrument all AI benefit assessment (delivery metrics, change failure rate, review latency) — ban self-reported gut feel as a decision input. That is the direct governance consequence of METR's perception-reality gap.

One more thing has to be said out loud: this section's prescription — guardrails before scale, entry through verifiable tasks — matches the actual paths of Google, AWS, and Airbnb. But those organizations already had the strongest verification infrastructure on the planet. "The organizations with the strongest guardrails succeed with AI first" is both evidence for this essay's thesis and a possible survivorship bias. It is an open question, not a closed conclusion.

5. Why transformation is hard, and how to sequence it: the last piece, from innovation theory

Four sections in, one question remains: why organizations fail to move. Management science has forty years of theory and evidence on this, and the findings are unusually consistent.

Transformation is hard not because organizations are stupid, but because they are rational. Christensen's framework, in its AI-era form: incumbents adopting AI will rationally use it to optimize existing processes — doing the same things faster — rather than reorganizing how the work itself is done. HBR's 2026 observation matches: incumbents deploying AI broadly are reaping only marginal gains, and the constraint is organizational design, not deployment intensity. On the peer-reviewed side, Bughin (2025) finds that AI-induced competitive pressure does spur innovation, but the dominant driver of strategic renewal is internal organizational dynamics — the bottleneck is inside.

The quantified mechanism: systematic under-exploration. March's exploration-exploitation framework comes with hard numbers: Uotila et al.'s estimate is that 80% of firms explore too little and exploit too much — organizations naturally favor short-term certainty and systematically squeeze out exploration. Gilbert's study of newspapers going digital supplies the most lethal detail: transformation failed not for lack of resources, but for failure to change the processes that used the resources — a direct refutation of "buy AI tools, add budget, transformation follows." The three Copilot field experiments (4,867 developers; post-registered) hold an often-overlooked number: with free access and management encouragement (one site also provided training), 30–40% of engineers never tried the tool at all. The gate is not access; it is absorption friction.

Path design: structural ambidexterity. O'Reilly & Tushman's synthesis across hundreds of studies supports this recipe: separate exploration and exploitation into autonomous units, each with its own capabilities, incentives, processes, and culture, linked by common strategic intent — rather than asking one team to both protect delivery and drive transformation (sequential switching fails in fast-moving environments). Applied here: charter an AI-native exploration unit, let it redesign the work around the new cost structure, and reintegrate once it stands. The honest boundary: the classic ambidexterity evidence is thick, but there is as yet no controlled case of "the AI-native exploration unit wins" — this is extrapolation of classic theory into a new setting.

The sequence cannot be inverted. Absorptive capacity theory (Cohen & Levinthal) and DORA's amplifier finding (verified) converge: an organization's return on new technology depends on its existing capability stock, and AI amplifies what an organization already is rather than fixing it. Corollary: weak organizations must repair their underlying systems first — testing, version control, feedback loops — before scaling AI; strong organizations' AI dividend is compound interest on existing engineering capability. Buy tools first and fix fundamentals later, and you will collect DORA's stability penalty first, then lose your second chance when trust collapses.

The bottleneck's final address: the executive layer. Teece's dynamic-capabilities framework (sensing–seizing–reconfiguring) finishes the argument: the binding constraint on AI-native transformation is not developers' tool-adoption rate but the executive layer's capacity to sense the opportunity and reconfigure assets and organization. This echoes the pivot: the concrete content of that reconfiguration is moving investment from generation capacity to verification infrastructure, repricing talent from execution ability to judgment, and moving governance from after-the-fact compliance to centrally coordinated, locally executed.

6. Closing: six testable claims

The essay's argument reduces to six claims, ordered by evidence strength, each stated so it can be checked:

Individual productivity gains do not automatically convert into organizational delivery performance; the bottleneck moves to downstream control systems. (Strong: two years of DORA longitudinal data + Faros telemetry + METR micro-level time data; correlational and causal evidence cross-confirming.)
AI's benefit declines with developer experience and codebase maturity, and can go negative at the tail; practitioner self-perception is unreliable. (Strong: METR and the Copilot RCTs are causal and complementary in direction; METR's point estimate carries uncertainty — the perception gap is its most robust finding.)
Transaction costs migrate from production/coordination to integration/verification/platform relationships; this is the unifying mechanism behind the above. (Medium-strong: the theory is verified; the software-side evidence is independently convergent practitioner accounts; no one has yet produced a quantified cost-transfer ledger.)
Human roles restructure along the judgment/execution axis; specification quality and domain knowledge become the binding constraints. (Medium: theory verified and carrying its own reverse clause; empirics directionally consistent but mostly vendor data and an n=1 case.)
For high-reliability legacy organizations, the optimal path is: enter through closed-verification tasks → guardrails before scale → capitalize verification infrastructure. (Stage one has multi-organization first-hand evidence; stages two and three are extrapolations standing on verified mechanisms, with a survivorship-bias risk.)
Firm boundaries will move, but the direction is unresolved; the observable criteria are the scaling law of verification cost and the accountability regime. (An honest stalemate: the shrinkage camp's flagship evidence failed verification; the expansion camp holds peer-reviewed empirics.)

If, two years from now, DORA's stability penalty has disappeared, METR-style experiments measure consistent speedups on mature codebases, or the first controlled case of an AI-native exploration unit outperforming its parent appears — the corresponding claims here should be revised. The value of a theory is not that it is forever right, but that it tells you which numbers to watch.

Appendix: principal sources

Organizational economics: Coase, "The Nature of the Firm" (1937) · Shahidi et al., "The Coasean Singularity?" (NBER, 2025) · Warin, "From Coase to AI Agents" (California Management Review Insights, 2025) · Gondauri & Batiashvili, "The Agentic Economy" (arXiv:2605.18935) · Hadfield & Koh, "An Economy of AI Agents" (arXiv:2509.01063) · Hadfield-Menell & Hadfield, "Incomplete Contracting and AI Alignment" (AIES 2019) · Agrawal, Gans & Goldfarb, "Prediction, Judgment, and Complexity" (NBER, 2018) + "Generative AI Is Still Just a Prediction Machine" (HBR, 2024) · Chen, Elliott & Koh (JET, 2023) · Babina et al. (Journal of Financial Economics, 2024) · Klein & Wieczorek, "The Headless Firm" (arXiv:2602.21401 — restricted citation: unreviewed, author COI) · Hydari & Muzaffar, "Going Headless?" (arXiv:2605.17812)

Software engineering empirics: DORA 2024 (dora.dev/research/2024/dora-report) · DORA 2025 (dora.dev/dora-report-2025) · METR (arXiv:2507.09089; update: metr.org/blog/2026-02-24-uplift-update) · Cui et al., three Copilot field experiments (MIT working paper) · GitClear 2025 · Faros AI telemetry study · Itaú one-person squad (arXiv:2605.18461) · Anthropic Claude Code research (anthropic.com/research/claude-code-expertise) · Wes McKinney, "The Mythical Agent-Month" · O'Reilly Radar, same title · Forret, same title · Google, "Towards a Science of Scaling Agent Systems" (arXiv:2512.08296) · Cemri et al., MAST (arXiv:2503.13657)

High-reliability and safety theory: Perrow, Normal Accidents (1984) · Weick, Sutcliffe & Obstfeld, "Organizing for High Reliability" (ROB, 1999) · Leveson et al., "Moving Beyond Normal Accidents and High Reliability Organizations" (Organization Studies, 2009) · Le Coze (2015) · Dietterich, "Robust Artificial Intelligence and Robust Human Organizations" (arXiv:1811.10840) · Williams & Yampolskiy (Philosophies, 2021) · Bainbridge, "Ironies of Automation" (1983) · Lehman, "Programs, Life Cycles, and Laws of Software Evolution" (1980) · Hyrum's Law (hyrumslaw.com) · Beyer et al., Site Reliability Engineering (2016) · Newcombe et al., "How Amazon Web Services Uses Formal Methods" (CACM, 2015)

Migrations and practice: Fowler, "Strangler Fig Application" (2004) · Ziftci et al., "Migrating Code At Scale With LLMs At Google" (FSE 2025) · Nikolov et al. (arXiv:2501.06972) · Airbnb Engineering, "Accelerating Large-Scale Test Migration with LLMs" (2025) · AWS DevOps Blog (Amazon Q; vendor-reported) · Google Research, "AI in software engineering at Google" (2024) · DIDACT (2023) · SRLabs, "The verification bottleneck" (2026) · Sonar State of Code Survey 2026 · Chroma, "Context Rot" · TSB independent investigation (Slaughter and May, 2019) · Knight Capital (SEC order, 2013) · Queensland Health payroll audit

Management and innovation: Christensen, The Innovator's Dilemma (1997) · March, "Exploration and Exploitation" (1991) · O'Reilly & Tushman, "Organizational Ambidexterity" (AMP, 2013) · Gilbert (2005) · Uotila et al. (2008) · Teece et al. (1997/2007) · Cohen & Levinthal, "Absorptive Capacity" (1990) · Bughin (TASM, 2025) · HBR 2026-02, "Why New Technologies Don't Transform Incumbents"

Cybernetics (cited as design language; no controlled empirics): Ashby, An Introduction to Cybernetics (1956) · Conant & Ashby (1970) · Beer, Brain of the Firm (1972) · Snowden & Boone (HBR, 2007) · Trist & Bamforth (1951) · Ang, Sankaran & Liu (Applied Ergonomics, 2025) · Thoughtworks, "Cybernetics and human-on-the-loop" (2026)