A prompt without an eval is folk magic. Editing prompts on vibes is engineering by vibes.
Ask 100 prompt authors "how do you know your last edit was an improvement?" — 99 say "I tried a few examples and it felt better". That's why most AI products get worse after they ship: prompt engineering without evals isn't engineering, it's witchcraft. Hamel Husain's line that detonated across the industry — "Your AI product needs evals" — means something specific: the model is a commodity, your eval suite is the moat. This issue doesn't recap MMLU / HumanEval (mostly irrelevant to your real use case). It covers four things every senior practitioner should be doing: how to spin up a minimum viable eval from scratch, why LLM-as-judge must be debiased before you can trust it, how to treat prompts as code with regression tests, and how Anthropic itself uses evals to build Claude. By the end you should be able to set up a real regression-blocking eval for any prompt you own in an hour.
Most people don't ship evals because they think it requires infrastructure first. That's what landing pages for Weights & Biases / Braintrust / Langfuse made them believe. Hamel Husain's Your AI Product Needs Evals gives the dead-simple opening move: handpick 20 real inputs covering typical + edge cases, write Python to run the prompt and collect outputs, write Python to score each output. That's it. Don't add another file.
The core of an eval isn't the tool — it's dataset quality. 20 carefully selected examples (10 happy path + 5 edge cases + 5 adversarial) are worth more than 10,000 random rows scraped from production logs. Karpathy has said the same in multiple talks — "dataset is the new code". Your golden set is the source of truth for the eval; maintain it like core code (diff, review, version).
Scorers have three tiers; always reach for the cheapest one first:
MVP eval for "extract meeting time from an email" — fewer than 60 lines, runnable end-to-end:
# eval_meeting_extractor.py — minimal eval that actually runs
import json, re, anthropic
from dataclasses import dataclass
from datetime import datetime
client = anthropic.Anthropic()
# ① Golden Set — 20 handpicked cases, each is (input + expectation)
GOLDEN = [
{"id":"happy_01", "input":"Weekly sync tomorrow 3pm in room A",
"expect":{"time":"15:00", "location":"room A"}},
{"id":"edge_tz", "input":"Mon 9am PT sync",
"expect":{"time":"09:00", "tz":"PT"}},
{"id":"adv_noop", "input":"Nice weather today",
"expect":{"time":None}}, # must not hallucinate
# ... 17 more
]
PROMPT = """You are a meeting time extractor. Output strict JSON:
{"time": "HH:MM" or null, "location": str or null, "tz": str or null}
Input: {text}"""
def run(text):
r = client.messages.create(model="claude-opus-4-7", max_tokens=200,
messages=[{"role":"user","content":PROMPT.format(text=text)}])
return json.loads(r.content[0].text)
# ② Scorer — all code-based; cheap and stable
def score(pred, expect):
for k, v in expect.items():
if pred.get(k) != v: return 0
return 1
# ③ Runner — one for loop
def evaluate():
fails = []
for case in GOLDEN:
try:
pred = run(case["input"])
if score(pred, case["expect"]) == 0:
fails.append((case["id"], pred, case["expect"]))
except Exception as e:
fails.append((case["id"], "ERROR", str(e)))
pass_rate = 1 - len(fails) / len(GOLDEN)
print(f"PASS {pass_rate:.0%} ({len(GOLDEN)-len(fails)}/{len(GOLDEN)})")
for f in fails: print(" FAIL", f)
return pass_rate
if __name__ == "__main__": evaluate()
This is enough to ship behind. Run it on every prompt change; dropping from 17/20 to 14/20 is a regression. Three months in you'll have a 200-case golden set — then start thinking about Braintrust / Langfuse.
When the task is open-ended generation (summary quality, answer relevance, faithful rewrite), code-based scoring breaks down and most people switch to LLM-as-judge: give a judge an instruction and have it score or pick a winner. The problem: LLM judges carry systematic biases. If you use them raw, you measure the judge's preferences, not your model's capability.
Zheng et al. 2023's Judging LLM-as-a-Judge (the MT-Bench paper, NeurIPS Datasets & Benchmarks) systematically demonstrated four biases that almost all subsequent work builds on:
None of these are prompt issues — they're side effects of RLHF objectives, and you can't eliminate them by telling the judge "don't be biased". You need structural debiasing.
# —— 1. Position Swap ——
# Always run pairwise twice: (A,B) and (B,A). Winner must win both. Disagreement → TIE.
score_ab = judge(q, ans_a, ans_b) # A first
score_ba = judge(q, ans_b, ans_a) # B first
winner = "A" if score_ab == "A" and score_ba == "B" else \
"B" if score_ab == "B" and score_ba == "A" else "TIE"
# —— 2. Verbosity Control ——
# Hard-constrain in the judge prompt: when length differs >30%, prioritize substance check
JUDGE_PROMPT = """Compare A and B. CRITICAL: do not reward verbosity.
If lengths differ by >30%, explicitly check whether the longer one
adds substance or just padding. Penalize padding."""
# —— 3. Cross-family Judging ——
# Judge Claude outputs with GPT; judge GPT outputs with Claude
# Or use 3 different models as judges and take the majority vote
judges = ["claude-opus-4-7", "gpt-5", "gemini-2.5-pro"]
votes = [judge_with(m, q, a, b) for m in judges]
winner = max(set(votes), key=votes.count) # majority
# —— 4. Rubric-ize (instead of a single score) ——
# Don't ask for an "overall score"; ask for independent 0/1 dimensions
RUBRIC = """Score each independently (0 or 1):
- factual_correct: all stated facts verifiable from source?
- completeness: covers all parts of the question?
- format_valid: matches required JSON schema?
- no_hallucination: no info absent from source?
Return JSON: {factual_correct: 0/1, ...}"""
# —— 5. Human Calibration Set ——
# 50 human-labeled samples; before trusting the judge, measure agreement on these
# judge ↔ human agreement < 80% → judge prompt is unreliable, iterate
human_labels = load("calibration_50.json")
judge_labels = [judge(c) for c in human_labels]
agreement = sum(h == j for h, j in zip(human_labels, judge_labels)) / 50
assert agreement > 0.80, f"Judge unreliable: {agreement:.0%}"
Mental model: treat the judge like an intern you hired to score things. You wouldn't let an intern read a paragraph and slap a single score on it — you'd give them a rubric, have them write reasoning, and spot-check their grading. LLM judges are the same.
Change one line in a prompt, change the behavior entirely. That's not a flaw — it's a fundamental property of LLMs. The problem is most teams' relationship with prompts is "scattered across Slack screenshots and Jupyter cells", and the result is: a prompt change last week silently flipped an edge case from pass to fail, nobody noticed, until a customer complained.
The senior approach is to treat prompts as full-stack code:
prompts/*.md or prompts/*.txt, in git. Variables use {{var}} placeholders (Jinja / f-string). Never embed prompts in Python string literals — diffs lose newline changes.tests/test_*.py with the §1 golden set. No prompt change ships without running regression.(prompt_version, input_hash, output, score) into SQLite / DuckDB so you can precisely attribute "between v3 and v4, what kind of cases changed".Mental model — prompts are not config, they're behavior code. Config changes alter business parameters; prompt changes can flip model decisions entirely. So they deserve stricter review than code, not looser.
# —— Repo layout ——
prompts/
meeting_extractor.md # prompt text with {{var}} placeholders
meeting_extractor.meta.yml # model, temperature, max_tokens, min_pass_rate
tests/
golden_meeting.jsonl # golden set
test_meeting.py # pytest invoking the §1 evaluate()
.github/workflows/
eval.yml # CI
# —— prompts/meeting_extractor.meta.yml ——
model: claude-opus-4-7
temperature: 0
max_tokens: 200
min_pass_rate: 0.90 # dropping below this turns CI red
allow_regression: [] # explicitly approved failing case_ids
# —— tests/test_meeting.py ——
import pytest, yaml
from evaluator import evaluate
def test_meeting_regression():
meta = yaml.safe_load(open("prompts/meeting_extractor.meta.yml"))
result = evaluate("meeting_extractor")
assert result.pass_rate >= meta["min_pass_rate"], \
f"Pass rate dropped: {result.pass_rate:.0%} < {meta['min_pass_rate']:.0%}"
# ensure no new failure has slipped in
new_fails = set(result.failed_ids) - set(meta["allow_regression"])
assert not new_fails, f"New regressions: {new_fails}"
# —— .github/workflows/eval.yml ——
name: prompt-eval
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -r requirements.txt
- run: pytest tests/ -v
env: { ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} }
With this minimum in place, the next time anyone changes a prompt in a PR, CI automatically runs the 20 golden cases and surfaces both the pass rate and the diff. Within a week your team won't be going back to vibes-based prompt edits.
allow_regression growing endlessly — that's a tech-debt alarm; every allowed failure needs a tracking issue. (4) Co-evolving the golden set with the model — optimizing the model against data the model itself generated is circular.
Anthropic has sketched the shape of its internal eval system in blogs and docs (Claude's Constitution, Core views on AI safety, eval examples in the Anthropic Cookbook). Key insight: split evals into two layers.
The common mistake at most teams is to do only one:
The right approach is two layers + traceability: when a task eval fails, you can drill down to which specific capability went red. Example: customer-support agent dropping orders — is it instruction-following (didn't follow SOP)? entity extraction (missed the order number)? refusal (wrongly refused)? You need a capability eval for each.
evals/
├── capability/ # atomic capabilities, run on model/prompt change
│ ├── instruction_following.jsonl # 200 "follow the format" cases
│ ├── entity_extraction.jsonl # 150 "extract JSON" cases
│ ├── refusal_accuracy.jsonl # 80 "refuse when you should" cases
│ ├── long_context_recall.jsonl # 100 needle-in-haystack cases
│ └── tool_selection.jsonl # 120 "pick the right tool" cases
└── task/ # end-to-end, run on every release
├── customer_support.jsonl # 50 real tickets → expected category + reply
└── code_review.jsonl # 30 PRs → expected bug list
# —— two-way traceability: when a task fails, drill to capabilities ——
def diagnose(task_failure):
"""task eval fail → auto-run related capability evals to find the weak link"""
related = TASK_TO_CAPABILITY[task_failure.task_name]
# e.g. customer_support relates to [instruction_following, entity_extraction]
diag = {}
for cap in related:
diag[cap] = run_capability_eval(cap).pass_rate
return sorted(diag.items(), key=lambda x: x[1]) # lowest is the suspect
# —— usage ——
if task_eval.pass_rate < 0.85:
print("Task regressed. Capability diagnosis:")
for cap, rate in diagnose(task_eval): print(f" {cap}: {rate:.0%}")
# Output:
# instruction_following: 0.62 ← the culprit
# entity_extraction: 0.94
A hidden bonus: this structure makes model migration explainable. Upgrading Claude Opus 4.7 → 4.8, or switching to Sonnet, you see "instruction_following +2%, entity_extraction -5%" instead of a blurry "overall it seems similar". Model selection stops being vibes.
Wire the four sections into a deliverable workflow. Pick the prompt you run most often — say "email summary + todo extraction" — and walk through these 7 steps:
tests/golden_xxx.jsonl.90 minutes later, what you have isn't "one eval script" — it's this prompt's moat for the next year. Next model upgrade, next prompt rewrite, next team member change — the eval stays as your gatekeeper. That's the biggest capability gap between Anthropic / OpenAI / Cursor engineering teams and average teams: not using the model better, but building the eval earlier.