DAY 32 / PHASE 3 · FRONTIER

The Next Five Years of AI Coding

SWE-bench Limits · Time Horizon · Skill Repricing · Agent-Ready Codebase

2026-06-15 · BigCat

Don't predict the model. Predict how your own engineering leverage changes.

// WHY THIS MATTERS

Talk of "the future of AI coding" slides into two traps: the panic ("developers are obsolete") and the arrogance ("it's just better autocomplete"). Both bet on the model curve—a variable you don't control. This issue flips the frame. Instead of predicting how strong models get, it asks four questions that cash out into engineering moves: (1) why even OpenAI deprecated SWE-bench Verified in 2026, and why leaderboard scores can't be your selection criterion; (2) why the real exponential is task time horizon—not "IQ"—and how it decides how long a task you can delegate to an agent; (3) which engineering skills are depreciating and which are appreciating; (4) how to make your codebase agent-ready now, turning five-year leverage into compounding moves you can make today. The conclusion up front: what determines your output in five years isn't which model you use—it's how verifiable your environment is to an agent.

// 01

Don't Select Models by SWE-bench Score: The Death of a Benchmark

Claim: the half-life of coding benchmarks is collapsing. SWE-bench Verified was officially deprecated in 2026—proof that public leaderboards carry near-zero information for your selection.

Background & Principle

SWE-bench (Jimenez et al., ICLR 2024) has been the de facto yardstick for coding agents: pull a bug from a real GitHub issue, have the agent fix the code, judge correctness with the repo's own tests. OpenAI then co-built SWE-bench Verified—a 500-task, human-filtered subset. But in February 2026 OpenAI stopped reporting Verified scores, for reasons every engineer should memorize, because they are the universal diseases of benchmarks:

Contamination: all 500 tasks come from public Python repos predating every model's cutoff. Models can simply "recall" the gold patch and problem-statement specifics—you think you're testing reasoning, but you're testing recitation.
The tasks themselves are broken: on review, more than half the remaining problems were flawed—some tests too narrow, rejecting functionally correct submissions; some too wide, demanding features never mentioned. The score ceiling is locked by noise.
Construct gap: the distribution—"single repo, existing tests, a few dozen lines of bug-fix"—is far from your daily reality of cross-service changes, vague requirements, writing your own tests. A high score ≠ getting work done in your codebase.

More subtle: swap the scaffold (harness) on the same model and SWE-bench scores can shift by double digits—the leaderboard measures the "model + harness" combo, not the model itself. Scale AI's SWE-bench Pro uses legally-untrainable private codebases as a held-out set to fight contamination, but it's still not your distribution. Bottom line: public benchmarks are vendors' arms-race metric, not your purchase contract.

Hands-On

The only trustworthy coding eval is your own, private, same-distribution-as-your-real-work one. Ten minutes gets you a prototype—pull from your merged PRs:

# Build a held-out coding eval from your git history (private = contamination-proof)
# Pick 5 changes with "known answer + tests", roll back to the pre-change commit
for pr in ["#412 fix timezone off-by-one", "#455 add pagination", ...]:
    checkout(pr.base_commit)              # back to before the change
    run_agent(task=pr.issue_title)        # have the agent under test redo it
    passed = run(pr.test_files)           # the real tests as oracle
    log(model, harness, passed, tokens, wall_time)
# Key: eval the model+harness combo; keep 5 tasks secret, rotate periodically

Run two or three candidate models / IDEs through it and you get a cost-and-latency-annotated comparison that's valid for your work—worth more than any leaderboard. Keep these 5 tasks confidential; the moment they hit any public channel, they're contaminated and void.

Failure mode: using a public benchmark directly as your regression set and swapping models by leaderboard. The next model gains 5pp on the board, yet feels dumber in your production—because it didn't gain on your construct (cross-file, weak tests, implicit constraints), and its changed output style may even break your old harness's diff parser.

Resources · OpenAI Why we no longer evaluate SWE-bench Verified, openai.com/index/why-we-no-longer-evaluate-swe-bench-verified · SWE-bench Pro, arXiv:2509.16941 · SWE-bench original, arXiv:2310.06770

// 02

What's Growing Is "Task Length," Not "IQ": Allocate Autonomy by Time Horizon

Claim: the real exponential is task time horizon—the length of task an agent completes independently, doubling roughly every 7 months. It should drive how long a task you hand to an agent, more than MMLU does.

Background & Principle

METR's 2025 work (Measuring AI Ability to Complete Long Tasks, arXiv 2503.14499) proposes a more engineering-meaningful metric: the 50% time horizon—the human-time length of tasks the model completes at 50% success. Calibrated on 170 tasks + 800 human baselines, this length has risen exponentially for six years, doubling about every 7 months. Crucially, the source of growth: not stronger raw reasoning, but better reliability, error recovery, and tool use—exactly what lets an agent survive longer chains without collapsing.

Engineering implication: autonomy is not a switch but a continuous curve, and the x-axis is "how long would this take a human". Don't ask "should AI do this"; ask "how much human-time is it worth, and is it verifiable," then pick the authority tier. A 50% success rate means the other half fails—so longer tasks need verifiable guardrails, not blind trust.

Task human-time Recommended authority tier ( rises as curve shifts right ) ──────────────────────────────────────────────────────────── sec ~ few min │ ████░░░░░ full auto · glance at the diff few min ~ 1 hr │ ██████░░░ Plan → approve → execute → review gate 1 hr ~ half-day │ ████████░ human splits into verifiable subtasks, delegate piecewise > half-day │ █████████ human drives architecture, agent fills impl + strong tests ──────────────────────────────────────────────────────────── curve shifts one notch right every ~7 months: what you watch today, you may release next year ( but the oracle is never retired )

Hands-On

Turn it into a routing rule: tag each delegated task with an estimated human-time first, then choose autonomy—rather than guessing by "sounds hard or not."

def route(task):
    h = estimate_human_minutes(task)        # your call: 5 / 30 / 180...
    verifiable = has_tests(task) or reversible(task)
    if h <= 5:                 return "auto"            # full auto, glance
    if h <= 60 and verifiable: return "plan+review"     # see the plan before go
    return "human-split"      # human splits into verifiable subtasks first

Failure mode: misreading "the 50% time horizon hit 1 hour" as "tasks under 1 hour are safe to delegate." 50% is a coin flip—running long tasks on full auto without an oracle (tests / reversibility / review) is gambling. The shifting curve lowers supervision frequency; it never cancels verification.

Resources · METR Measuring AI Ability to Complete Long Tasks, metr.org/blog/2025-03-19 · Paper, arXiv:2503.14499

// 03

Repricing Skills: What's Depreciating, What's Appreciating

Claim: as "producing code" approaches free, the bottleneck moves from writing code to "defining correctness + verifying at scale." The moat shifts from writing fast to judging well.

Background & Principle

This is the issue's most counterintuitive—and most actionable-today—point. When the marginal cost of code generation approaches zero, value moves from the production end to two poles: upstream specification (turning vague needs into precise, executable intent) and downstream verification (judging correctness across a flood of AI output). Day 25 covered how verification is already the real bottleneck of agentic IDEs—scaled to a career, that's a repricing of skills:

Depreciating: speed at hand-writing boilerplate, memorized APIs and syntax, single-point debugging tricks, keyboard-shortcut muscle memory. This is "production speed"—the first thing AI eats.
Appreciating: turning needs into executable spec / eval; code review at scale (reading and judging AI-written code—scarcer than writing it); systems-design taste (architecture errors get amplified by AI at speed); harness and workflow design; decomposing vague problems into verifiable subtasks.

Counterintuitive corollary: the junior edge built on "speed" gets erased, while the senior premium on "taste / judgment" rises. In Karpathy's Software 3.0 (2025), the human role shifts from "writing instructions" to "gatekeeping within partial autonomy"—your scarcity isn't in output speed, it's in the bottleneck seat that receives generation and renders the verdict.

Hands-On

Run an honest time audit to force the migration toward the appreciating zone:

# For one week, tag each block of work; tally the ratios on the weekend
PROD  = hand-writing impl / boilerplate / API lookup / point debugging  # ← depreciating, ratio↓
SPEC  = writing requirements / acceptance criteria / evals / decomposing # ← appreciating
VERIFY= reviewing AI output / reading code / designing tests & guardrails # ← appreciating

# Most senior engineers find PROD still > 50%—that's exactly the transferable dividend
# Action: each week, actively convert 5% of PROD time into SPEC / VERIFY

Failure mode: keep investing in the depreciating zone—memorizing more APIs, faster shortcuts, grinding typing speed. That's a head-on collision with AI's core advantage. Equally dangerous: abandoning review ability ("the AI wrote it anyway")—once you can't read AI's output, you degrade from gatekeeper to mere button-pusher.

Resources · Karpathy Software Is Changing (Again) / Software 3.0, latent.space/p/s3 · Anthropic Building Effective Agents, anthropic.com/engineering/building-effective-agents

// 04

Make Your Codebase Agent-Ready: Compounding Moves That Invest in Five Years Out

Claim: what determines your AI leverage in five years isn't which model you use—it's how readable and verifiable your codebase and workflow are to an agent.

Background & Principle

An agent's throughput is bounded by its feedback-loop quality, not by how smart the model is. Karpathy notes in Software 3.0: to get work done, agents need docs, memory, tests, and clear feedback loops. In other words—however strong the model, it stalls in an environment with no verification oracle. So "waiting for a stronger model" is a false strategy; real compounding comes from reshaping the environment—moves that lower your own cost today and become the agent's runway in five years:

High-coverage, fast test suite: the agent's most important asset—its self-verification oracle. Without it, the agent can't tell if its change is right and has to wait for you as the human test harness.
Make tacit knowledge explicit: use AGENTS.md / CLAUDE.md to write down "how to run tests, what the gotchas are, what the conventions are"—feeding both the agent and new hires.
Shrink the blast radius: reversible operations, sandboxes, clear module boundaries. The safer a task is to try-and-fail, the more it can be delegated to higher-autonomy agents (echoing Day 31).
Decompose vague tasks into verifiable subtasks: both the output of your spec skill and the prerequisite for agents to run long chains.

Hands-On

A minimal agent-readiness skeleton you can commit into any repo today:

# AGENTS.md  —— turn tacit knowledge into the agent's operating manual
## Build & Test
- install: `pnpm i`   all tests: `pnpm test`   single: `pnpm test <file>`
- must be green before commit: `pnpm lint && pnpm test`

## Architecture boundaries (don't cross)
- `core/` must not import `web/`; all IO goes through `adapters/`

## Known gotchas (verification oracle)
- store all timezones in UTC; money as integer cents, no float
- changing `schema.sql` requires a matching migration + regression test

# Agent-readiness checklist: each missing item is runway your "stronger model" loses
# [ ] tests run in one command, <2min   [ ] critical paths covered
# [ ] dangerous ops reversible / sandboxed   [ ] conventions in AGENTS.md

Failure mode: betting everything on "the next model" without touching the environment. The model jumps from 50 to 90, but your repo has no fast tests, no docs, and one line takes ten minutes to verify by hand—the agent still inches along, stopping at every step to wait for your confirmation. The model can't pay down the environment's debt.

Resources · Karpathy Software 3.0, on agent infra (docs/memory/tests/feedback), latent.space/p/s3 · Anthropic Building Effective Agents, anthropic.com/engineering/building-effective-agents

// CAPSTONE · Run an "AI Leverage Audit" on Yourself

String the four points into one weekend move. The goal isn't to predict the future—it's to measure your current leverage gap and patch the shortest plank.

Build a private eval (§1): pull 5 changes with "known answer + tests" from git history into a held-out set; run the 2 models/IDEs you use now; log accuracy + tokens + wall-time. Your only trustworthy selection criterion.
Tag time horizon (§2): sort last week's 10 tasks by human-time against the authority ladder; see whether you're "still manual on short tasks, blindly trusting on long ones"—both ends are common mismatches.
Time audit (§3): compute PROD vs SPEC/VERIFY ratio. If PROD > 50%, take one implementation task this week and force yourself to only write spec + review, letting the agent write the impl.
Patch agent-readiness (§4): add an AGENTS.md to the repo you change most and get "one command runs the tests" working. Compounding interest saved for your future self.

You'll find that "the future of AI coding" isn't a prediction problem for you—it's four engineering gaps you can act on today. Leave the model curve to the vendors; hold the leverage curve yourself.

// DEEP THINKING

If time horizon doubles every 7 months, that's ~500× in five years—"half-day tasks" become "multi-month tasks." Does that mean §4's agent-readiness investment gets drowned out by general capability and wasted?

The opposite. Growth's source is reliability and error recovery, both of which depend on external feedback—however good a model is at self-correction, something must first tell it "you're wrong." An agent that can run for three months without a strong test backstop is just compounding errors unsupervised for three months. General capability grows "per-step ability × steps"; agent-readiness grows "verifiability per step"—the latter is a multiplier on the former, not a substitute. Environment investment is one of the few forms of compounding the model curve won't eat.

§1 says public benchmarks carry zero information for personal selection, yet everyone watches SWE-bench at launches. If it's meaningless, why does the whole industry use it?

Because it serves a different decision. Vendors need a unified, comparable north star that drives iteration—even contaminated, relative ordering within a generation still carries signal. Your decision is "which one in my codebase, on my budget"; the leaderboard's distribution, harness, and cost aren't yours. Analogy: official fuel economy is comparable across cars but won't predict your real mileage in daily traffic.

§3 says speed depreciates, judgment appreciates. But judgment often comes from writing a lot of code by hand. If juniors no longer hand-write, how do they grow the taste to review a PR? Does this generational chain break?

A real problem with no clean answer. Taste partly comes from "having written it, hit the pitfalls, debugged it"—exactly what AI erases. Possible compensations: (1) review is itself high-intensity learning—reading lots of good/bad code paired with an AI that explains "why" may not be slower than hand-accumulating; (2) like pilots still practicing manual landings, deliberately keep an "autocomplete off, chew the hard bone" training zone. The risk is real: someone who only hits "accept" never grows taste and degrades into a button-pusher. This demands deliberately designing "intentionally effortful" training rather than sliding down the effort-saving default.

§2's authority ladder tiers by human-time. But some tasks are quick for humans and hard for AI (a one-line change depending on implicit context), and vice versa. Is human-time a good proxy variable?

Good-enough but imperfect first-order approximation. It's measurable, comparable, and unified across tasks. But "human–machine difficulty mismatch" is real. That's why §2's routing adds a second axis, verifiable—the real authority decision is a two-dimensional plane of human-time × verifiability, not a single axis. Human-time gives you a starting point; calibrate the mismatch cases against your own private eval (§1). That's why the two are used together.

Put the four together: if everyone does agent-readiness and migrates to SPEC/VERIFY, doesn't the moat get erased again? Where's the new scarcity?

Scarcity moves up but doesn't vanish, because it's hard to replicate at scale. When everyone can have an agent write code, the bottleneck becomes "asking the right question" and "judging what's worth doing"—taste, understanding of real needs, anticipating second-order effects—none of which has a benchmark, can be distilled, or learned fast. Another direction is responsibility and trust: those willing to put their name on AI output, and those who can design architectures where a swarm of agents collaborates safely. Every prior productivity tool (compilers, frameworks, cloud) failed to make engineers obsolete and instead pushed scarcity to a higher abstraction layer—likely the same here, only the migration is faster and falling behind costs more.

// FURTHER READING

OpenAI · Why we no longer evaluate SWE-bench Verified — a first-hand post-mortem of benchmark contamination and saturation
METR · Measuring AI Ability to Complete Long Tasks — evidence for time horizon doubling every 7 months
Karpathy · Software Is Changing Again (Software 3.0) — the partial-autonomy and agent-infra framing
SWE-bench Pro — next-gen coding benchmark using a private held-out set against contamination
Anthropic · Building Effective Agents — feedback loops and a simplicity-first engineering principle