DAY 35 / PHASE 4 · PROMPT-AS-CODE

Prompt-as-Code / Version Governance

Registry · Eval Gate · Canary & Rollback · Cross-Model Drift

2026-06-15 · BigCat

A prompt is a program you wrote for one specific model — it just has no version number, no tests, no rollback. This issue adds them.

// WHY THIS MATTERS

For most people a prompt lives like this: hardcoded in some f"""...""", scattered across a dozen files; who changed it, what changed, which version is live — all from memory; the model upgrades and the behavior quietly shifts with nobody noticing. Fine in the demo stage. But once a prompt is in production with real users depending on it, it's the most fragile yet least engineered part of the system — a one-comma change can break format adherence and no CI catches it. This issue ports the software-engineering loop — version → test → release → rollback → monitor — onto prompts, while keeping one essential difference in view: a prompt is not deterministic code. Its "correct" is distributional (same input, different outputs), it's model-specific (swap the model, it changes), and its decay is silent (no error, quality just slips). So copy-pasting CI/CD isn't enough — you must design for those three traits. Core mantra: treat a prompt as a versioned, eval-gated, one-click-rollbackable artifact, not a string.

// 01

Prompt as Artifact: from scattered strings to a registry

Claim: inline-hardcoded prompts have no version, no traceability, and are tightly coupled to code. Step one is extracting them into id@version artifacts that code references, not inlines.

Background & principle

Pull prompts out of code into a prompt registry as first-class artifacts: each has an id, a version, and metadata (purpose, target model, owner, linked eval set). Code calls get("triage", "v7") instead of inlining a 600-word system prompt into an f-string. Three immediate wins: traceability (which version is live right now — one lookup), independent iteration (change a prompt without changing code or cutting a release), and non-engineer editing (product / ops edit copy via the registry, not a source PR). The registry itself goes into git — getting review, diff, history. But face a difficulty §4 expands on: a prompt diff isn't "logically right or wrong at a glance" like a code diff — so git only solves "keep a trail"; "is it better" is settled by §2's eval.

In practice

# registry: a prompt is a versioned artifact, not an inline string
from registry import prompts

p = prompts.get("ticket-triage", version="v7")   # pin the version, not latest
msg = p.render(ticket=text, lang="en")              # template + variables

# registry entry (in git: reviewable / diffable / rollbackable)
# id: ticket-triage   version: v7   model: claude-opus-4-8
# owner: bigcat       eval_set: triage-golden-120   created: 2026-06-15

Core mental model: a prompt is an "external resource" code references, not part of the code. Just as you wouldn't inline a database schema into business logic — same for a prompt; it has its own lifecycle.

Failure modes: (1) Prompts inline-hardcoded and scattered — "changed it, but can't say what changed or which version shipped." (2) No version number, only "the latest one" — no way to recover the prior version on rollback. (3) Prompt tightly bound to code releases — changing one word requires the full release pipeline, killing iteration speed.

Going deeper · Langfuse Prompt Management (versioning / deployment), langfuse.com/docs/prompts · This site, Day 33 Legacy Code Governance (provenance / artifact tracing, same root)

// 02

Test & Release: a prompt's CI (but it's non-deterministic)

Claim: a prompt change should pass a gate like code — but not via assert ==. Judge aggregates on an eval set, the new version must not be worse than baseline; ship by canary, never a wholesale swap.

Background & principle

LLMs are non-deterministic; "tried one case, looked OK" has no statistical meaning. The right move is a regression gate: maintain an eval set (golden samples + LLM-judge + rule assertions, see Day 6), and the new prompt must score no lower than the live version on the whole set's aggregate (this is Day 33's ratchet projected onto prompts — only get better). Release likewise can't be "just swap it": shadow (run alongside, no user impact, compare old vs new offline) → canary (cut 5% of real traffic, watch metrics) → full; stop at any step where metrics drop. Key realization: a prompt's "correct" is distributional — one or two cases passing doesn't mean the whole is good; look at the aggregate over the eval set.

new prompt v8 ──▶ ┌─── EVAL GATE (offline, vs baseline v7) ───┐ │ golden set · LLM-judge · rule assertions │ │ aggregate < v7 ? ──▶ reject, don't ship │ └──────────────┬─────────────────────────────┘ pass ──▶ SHADOW: run alongside, compare only ──▶ CANARY: 5% real traffic, watch metrics metric drop ──▶ auto-stop + rollback v7 ──▶ full cutover to v8 (v7 still 1-click back) a prompt's "right" is distributional: aggregate over the eval set

In practice

def eval_gate(new, baseline, eval_set):
    s_new  = run_eval(new,      eval_set)   # aggregate of LLM-judge + rule asserts
    s_base = run_eval(baseline, eval_set)
    if s_new.score < s_base.score:          # ratchet: not worse than live
        raise Reject(f"{s_new.score} < baseline {s_base.score}")
    return "shadow"                          # passed → shadow, not straight to full

deploy = { "stage": "canary", "traffic": 0.05, "rollback_to": "v7" }

Failure modes: (1) Ship to 100% after one or two cases looked OK — blinded by single points to a distributional effect. (2) No baseline comparison, ship on "feels better" — without a ratchet it quietly regresses. (3) Wholesale swap, no canary, no rollback path — when it breaks, all users break together.

Going deeper · Martin Fowler Canary Release, martinfowler.com/bliki/CanaryRelease · This site, Day 6 Eval Engineering (LLM-judge / regression tests)

// 03

Rollback & Observability: prompts decay silently

Claim: prompt decay usually throws no error — quality just slips (more verbose, missed edge cases, tone drift). So: pin versions in prod, one-click rollback, watch live metrics.

Background & principle

A code bug throws; prompt decay just "answers a bit worse," with no red light — that's what makes it dangerous. Two lines of defense. One, pin + one-click rollback: production references a concrete v7, never latest; like a feature flag, on trouble snap back to the prior version instead of editing the prompt live to firefight. Two, live metrics: each prompt version is tied to quantifiable signals — downstream task success rate, format / schema compliance, user feedback (thumbs-down / retry rate), length, cost, latency — and dropping below threshold alerts and, if needed, auto-rolls back. This mirrors Day 34 HITL's "retune thresholds from production data": make prompt quality a monitorable, regressable quantity, not a vibe.

In practice

# pinned prod version + metric guard + auto rollback
LIVE = "v7"                              # always pin a concrete version, never latest

def on_metrics(ver, m):
    if m.schema_ok < 0.95 or m.thumbs_down > 0.08:   # visible proxies for silent decay
        alert(f"{ver} quality dropped"); rollback(to=prev(ver))   # snap back

# keep per-version metrics for comparison & regression (cf. Day 34 audit)
# v6: schema 0.97 / 👎 0.04   v7: schema 0.93 / 👎 0.09 ← regressed

Failure modes: (1) Production references latest — an upstream change silently shifts your system, with no "before" to roll back to. (2) Never rehearsed a rollback — at incident time you firefight by editing the prompt live, making it worse. (3) Watching only offline eval, not live metrics — passing the eval set doesn't mean no regression on real traffic.

Going deeper · Chip Huyen Building LLM Applications for Production, huyenchip.com/2023/04/11/llm-engineering · This site, Day 34 HITL (same metric-retuning engine)

// 04

Cross-Model Drift: the same prompt on a different model isn't the same prompt

Claim: a prompt is a program for a specific model. Swapping models — even a minor model-version bump — changes the runtime, so the same prompt's behavior can drift. Prompt × model is a matrix; eval each cell independently.

Background & principle

The most counterintuitive, most-stepped-on trap: treating a prompt as portable. But "a prompt tuned to perfection on Opus" moved to another model (or the same model's next version) can shift in format adherence, CoT depth, refusal boundaries, tone — because what you swapped is the runtime that interprets the prompt. Three governance rules. One, treat a model upgrade as a dependency upgrade: when the provider ships a new version, run the regression eval before cutting over, don't "just upgrade." Two, in multi-model routing (you have fallback / cost routing, see Day 23) — test each route's prompt on its model; don't assume good-on-A means good-on-B. Three, a drift sentinel: periodically run a fixed eval set against the production model to detect "same prompt, same input — has the output distribution drifted?" — catching silent upstream updates (the provider swaps weights without telling you).

claude-opus-4-8 gpt-class local 8B triage eval ✓ eval ? eval ✗ ← each cell tested summarize eval ✓ eval ✓ eval ? extract eval ✓ eval ✗ eval ✗ · model upgrade = dependency upgrade: re-run the whole column · run a fixed eval set periodically → catch "silent upstream drift"

Failure modes: (1) Swapping models / bumping versions without re-testing — "just cut over," and behavior drift surfaces in production. (2) Assuming prompts are portable — tuning on A is defaulted onto B. (3) No monitoring of silent upstream updates — the provider swaps weights and your output distribution quietly shifts, unnoticed.

Going deeper · Anthropic Prompt Engineering Overview (tuning to a model), docs.anthropic.com · This site, Day 23 Personal AI Infra (multi-model routing)

// Hands-on · build a prompt-as-code pipeline for your prompt assets

A one-week plan: take your most critical production prompt and upgrade it from "a string buried in code" to "a versioned, gated, rollbackable, monitored artifact."

Extract a registry: pull hardcoded prompts into id@version artifacts; change code to get(id, version).
Build an eval set: gather 50–100 golden samples + LLM-judge + rule assertions as the regression baseline.
Eval gate: a new version must score no lower than the live one on the eval set (ratchet), or it doesn't ship.
Canary release: shadow compare → canary 5% → full, watching metrics at each step.
Pin + rollback: pin a concrete version in prod, and actually rehearse a one-click rollback (don't let an incident be the first time).
Live metrics: tie each version to schema compliance / thumbs-down / cost / latency, alert on drops.
Cross-model matrix: list prompt × model, treat the next model upgrade as a dependency upgrade with regression; add a periodic drift sentinel.

Once you've built this, hearing "we tuned a better prompt" will make you instinctively ask: against what baseline, on how large an eval set, was it canaried, can it roll back, was it tested on the model swap — instead of trusting "feels better." A prompt is part of the software and deserves software's discipline — you just have to add extra defenses for its "non-determinism, model-dependence, silent decay."

// ENGLISH GLOSSARY

Prompt Registry: A central store of prompt artifacts, each with id / version / metadata, referenced by version from code.
Prompt-as-Code: Governing prompts as engineering artifacts that need versioning, testing, release, and rollback.
Eval Set / Golden Set: A golden-sample set for regression, with LLM-judge + rule assertions measuring aggregate quality.
Regression Gate: A gate that admits a new version only if it scores no lower than baseline on the eval set (ratchet).
Shadow Run: Run the new version alongside, with no user impact, comparing old vs new outputs offline.
Canary Release: Cut a small slice of real traffic to validate metrics before going full.
Pinning: Production references a concrete version, not latest, avoiding silent upstream propagation.
Rollback: One-click revert to the last known-good version, like a feature flag.
Prompt Drift: The same prompt's behavior drifting due to a model change / silent upstream update.
Prompt × Model Matrix: The grid of prompts and models, each cell eval'd independently; a model upgrade is a dependency upgrade.

// Deeper Thinking

Prompts go into git and get reviewed like code — but how does a reviewer judge a prompt change is "better"? Code has right/wrong logic; a prompt has only distributional effects.

This is the root divergence between prompt-as-code and real code: a prompt diff can't be reliably judged by eye. So review shifts focus — humans reviewing a prompt diff mainly check "did this introduce obvious risk" (injection surface, over-privileged instructions, PII, policy conflicts); "is it better" goes not to human eyes but to the eval gate. In other words, the "correctness" part of code review is taken over here by §2's eval set; humans keep only "safety / compliance / style" judgment. If you find yourself arguing in a PR over "is this wording better," that's a signal the eval data should decide, not reviewer intuition.

The eval set is the foundation of this whole governance — but it overfits: tune the prompt against it and it stops representing the real distribution. How do you keep it from rotting?

Real and serious — the mechanism is identical to ML train/test leakage. Three countermeasures: (1) continuously top up the eval set from production, especially folding back cases that broke live — keep it tracking the real distribution, not a static fossil. (2) Hold out a set: separate the dev set you tune against from the test set you gate on; the test set doesn't take part in iteration, seen once before release. (3) Periodically audit coverage: does it still represent what users are actually asking? Overfitting's root cause is often an eval set too small and too old — it should have its own version and refresh cadence, like a dataset, not built-once-used-for-a-year.

Canary / rollback are borrowed from code deploys, but a prompt's "metrics dropped" is often lagging and noisy — by the time you notice, many users were served. How do you catch decay faster?

The key is moving detection earlier than release + using faster proxies. Earlier: the shadow stage already catches most decay offline (no user impact, instantly comparable) — don't wait for canary's live feedback. Proxies: "real" metrics like thumbs-down / retention lag and are noisy, but machine-instant proxies don't lag — schema/format compliance, output-length distribution, refusal rate, whether the expected tool was called, semantic distance from the old version's output. These alert at request level in real time. Real metrics "confirm," proxies "front-run." Canary traffic should also scale by risk — high-risk scenarios start at 1% with tighter auto-rollback thresholds.

The cross-model matrix prompt × model explodes combinatorially as models multiply. Can an individual / small team afford it? Where should you cut?

You can't afford the full matrix, and don't need it. Cut by risk and traffic: only "high-traffic × high-risk" prompts get the full model matrix; long-tail prompts test only the primary model, with a small smoke set on the fallback as a backstop. Another cut: don't keep too many models — the prompt×model explosion is often rooted in scattered model choices; converging to 1 primary + 1 backup kills most combinations. What you can't cut is the primary model's upgrade regression: that column must re-run every time the provider ships, because it covers most of your traffic. Spend the budget on "high-traffic columns + upgrade regression," sample the rest.

How does this relate to Day 33 code governance, Day 34 HITL, Day 6 eval? Is prompt-as-code just software engineering ported to prompts?

It is indeed a "software-engineering discipline → prompt" migration, but with three things that can't be copy-pasted — which is exactly why this issue exists. Re Day 6 eval: eval is the "test framework" here; this issue is the version / release / rollback shell around it. Re Day 33: both govern artifacts (code vs prompt), sharing ratchet / provenance / risk-tiering, but the objects differ in "decidability of correctness" — code allows static analysis, a prompt only statistical eval. Re Day 34: HITL's "metric-retuned thresholds" and this issue's "prompt-metric regression" are the same engine. The three essential differences: non-determinism (use distributions, not assertions), model-dependence (the prompt×model matrix, a dimension code lacks), silent decay (must actively monitor, unlike a bug that surfaces itself). Hold those three and you won't treat a prompt like a throwaway config file.

// Further Reading

Langfuse · Prompt Management — an open-source reference for prompt versioning / deployment / rollback
Martin Fowler · Canary Release — the foundational definition of canary releases, same logic for shipping prompts
Chip Huyen · Building LLM Applications for Production — a systematic take on prompt drift, monitoring, and productionization
Anthropic · Prompt Engineering Overview — the official basis for "a prompt is a program for a specific model"
Google · Code Review Developer Guide — porting "what to look for" discipline to prompt review