DAY 06 / PHASE 1 · ENGINEERING

Eval Engineering

Golden Set · LLM-as-Judge Debiasing · Prompt Regression · Anthropic Evals

2026-05-24 · BigCat

A prompt without an eval is folk magic. Editing prompts on vibes is engineering by vibes.

Foundation concepts → ai-ml-daily Day 15: Evaluation and Benchmarks (MMLU, HumanEval, MT-Bench)

// WHY THIS MATTERS

Ask 100 prompt authors "how do you know your last edit was an improvement?" — 99 say "I tried a few examples and it felt better". That's why most AI products get worse after they ship: prompt engineering without evals isn't engineering, it's witchcraft. Hamel Husain's line that detonated across the industry — "Your AI product needs evals" — means something specific: the model is a commodity, your eval suite is the moat. This issue doesn't recap MMLU / HumanEval (mostly irrelevant to your real use case). It covers four things every senior practitioner should be doing: how to spin up a minimum viable eval from scratch, why LLM-as-judge must be debiased before you can trust it, how to treat prompts as code with regression tests, and how Anthropic itself uses evals to build Claude. By the end you should be able to set up a real regression-blocking eval for any prompt you own in an hour.

// 01

Minimum Viable Eval: 20 Examples + One Scorer

Claim: your first eval doesn't need a framework, an LLM judge, or 1000 rows — 20 real-scenario cases + one Python function block 80% of regressions.

Background & Principles

Most people don't ship evals because they think it requires infrastructure first. That's what landing pages for Weights & Biases / Braintrust / Langfuse made them believe. Hamel Husain's Your AI Product Needs Evals gives the dead-simple opening move: handpick 20 real inputs covering typical + edge cases, write Python to run the prompt and collect outputs, write Python to score each output. That's it. Don't add another file.

The core of an eval isn't the tool — it's dataset quality. 20 carefully selected examples (10 happy path + 5 edge cases + 5 adversarial) are worth more than 10,000 random rows scraped from production logs. Karpathy has said the same in multiple talks — "dataset is the new code". Your golden set is the source of truth for the eval; maintain it like core code (diff, review, version).

Scorers have three tiers; always reach for the cheapest one first:

Eval engineering tiers (cost / latency / variance increase upward) ┌────────────────────────────────────────────────┐ │ Human eval · gold standard · not scalable │ ├────────────────────────────────────────────────┤ │ LLM-as-judge · fallback for subjective tasks │ │ · must be debiased & calibrated │ ├────────────────────────────────────────────────┤ │ Heuristic · length / format / keyword │ │ · effectively free │ ├────────────────────────────────────────────────┤ │ Code-based · regex / schema / unit test │ │ · always preferred when possible │ └────────────────────────────────────────────────┘ Rule: drop one tier if you can; don't use a higher one for vibes

Hands-on Example

MVP eval for "extract meeting time from an email" — fewer than 60 lines, runnable end-to-end:

# eval_meeting_extractor.py — minimal eval that actually runs
import json, re, anthropic
from dataclasses import dataclass
from datetime import datetime

client = anthropic.Anthropic()

# ① Golden Set — 20 handpicked cases, each is (input + expectation)
GOLDEN = [
    {"id":"happy_01", "input":"Weekly sync tomorrow 3pm in room A",
     "expect":{"time":"15:00", "location":"room A"}},
    {"id":"edge_tz",  "input":"Mon 9am PT sync",
     "expect":{"time":"09:00", "tz":"PT"}},
    {"id":"adv_noop", "input":"Nice weather today",
     "expect":{"time":None}},  # must not hallucinate
    # ... 17 more
]

PROMPT = """You are a meeting time extractor. Output strict JSON:
{"time": "HH:MM" or null, "location": str or null, "tz": str or null}
Input: {text}"""

def run(text):
    r = client.messages.create(model="claude-opus-4-7", max_tokens=200,
        messages=[{"role":"user","content":PROMPT.format(text=text)}])
    return json.loads(r.content[0].text)

# ② Scorer — all code-based; cheap and stable
def score(pred, expect):
    for k, v in expect.items():
        if pred.get(k) != v: return 0
    return 1

# ③ Runner — one for loop
def evaluate():
    fails = []
    for case in GOLDEN:
        try:
            pred = run(case["input"])
            if score(pred, case["expect"]) == 0:
                fails.append((case["id"], pred, case["expect"]))
        except Exception as e:
            fails.append((case["id"], "ERROR", str(e)))
    pass_rate = 1 - len(fails) / len(GOLDEN)
    print(f"PASS {pass_rate:.0%} ({len(GOLDEN)-len(fails)}/{len(GOLDEN)})")
    for f in fails: print("  FAIL", f)
    return pass_rate

if __name__ == "__main__": evaluate()

This is enough to ship behind. Run it on every prompt change; dropping from 17/20 to 14/20 is a regression. Three months in you'll have a 200-case golden set — then start thinking about Braintrust / Langfuse.

Failure modes: (1) sampling 1000 rows from production logs as your golden set — 80% are trivial cases, coverage is worse than 20 handpicked ones; stratify by cluster / difficulty. (2) 100% pass rate — means your golden set is too easy; an eval that never fails isn't an eval. (3) Only watching pass rate, never the fail distribution — which fails are the same category? Edge case? Adversarial? Prompt gap? Stratified reports are 10× more useful than a single number.
Going deeper · Hamel Husain Your AI Product Needs Evals, hamel.dev/blog/posts/evals · Eugene Yan Task-Specific LLM Evals, eugeneyan.com/writing/evals · Anthropic Create strong empirical evaluations, docs.anthropic.com/.../develop-tests
// 02

LLM-as-Judge Biases and How to Debias

Claim: an uncalibrated LLM judge isn't an eval — it's an amplifier. It magnifies your prompt's own bias into a 90% fake pass rate.

Background & Principles

When the task is open-ended generation (summary quality, answer relevance, faithful rewrite), code-based scoring breaks down and most people switch to LLM-as-judge: give a judge an instruction and have it score or pick a winner. The problem: LLM judges carry systematic biases. If you use them raw, you measure the judge's preferences, not your model's capability.

Zheng et al. 2023's Judging LLM-as-a-Judge (the MT-Bench paper, NeurIPS Datasets & Benchmarks) systematically demonstrated four biases that almost all subsequent work builds on:

None of these are prompt issues — they're side effects of RLHF objectives, and you can't eliminate them by telling the judge "don't be biased". You need structural debiasing.

Hands-on Example: 5 ready-to-use debias tricks

# —— 1. Position Swap ——
# Always run pairwise twice: (A,B) and (B,A). Winner must win both. Disagreement → TIE.
score_ab = judge(q, ans_a, ans_b)   # A first
score_ba = judge(q, ans_b, ans_a)   # B first
winner = "A" if score_ab == "A" and score_ba == "B" else \
         "B" if score_ab == "B" and score_ba == "A" else "TIE"

# —— 2. Verbosity Control ——
# Hard-constrain in the judge prompt: when length differs >30%, prioritize substance check
JUDGE_PROMPT = """Compare A and B. CRITICAL: do not reward verbosity.
If lengths differ by >30%, explicitly check whether the longer one
adds substance or just padding. Penalize padding."""

# —— 3. Cross-family Judging ——
# Judge Claude outputs with GPT; judge GPT outputs with Claude
# Or use 3 different models as judges and take the majority vote
judges = ["claude-opus-4-7", "gpt-5", "gemini-2.5-pro"]
votes = [judge_with(m, q, a, b) for m in judges]
winner = max(set(votes), key=votes.count)  # majority

# —— 4. Rubric-ize (instead of a single score) ——
# Don't ask for an "overall score"; ask for independent 0/1 dimensions
RUBRIC = """Score each independently (0 or 1):
- factual_correct: all stated facts verifiable from source?
- completeness:   covers all parts of the question?
- format_valid:   matches required JSON schema?
- no_hallucination: no info absent from source?
Return JSON: {factual_correct: 0/1, ...}"""

# —— 5. Human Calibration Set ——
# 50 human-labeled samples; before trusting the judge, measure agreement on these
# judge ↔ human agreement < 80% → judge prompt is unreliable, iterate
human_labels = load("calibration_50.json")
judge_labels = [judge(c) for c in human_labels]
agreement = sum(h == j for h, j in zip(human_labels, judge_labels)) / 50
assert agreement > 0.80, f"Judge unreliable: {agreement:.0%}"

Mental model: treat the judge like an intern you hired to score things. You wouldn't let an intern read a paragraph and slap a single score on it — you'd give them a rubric, have them write reasoning, and spot-check their grading. LLM judges are the same.

Failure modes: (1) single-shot pairwise with a single judge — position bias eats 5–10% of your real signal. (2) Same-family judging (Claude judging Claude rewrites) — self-preference makes your prompt look like it's "always getting better" while it isn't. (3) Asking the judge for "A wins / B wins" without reasoning — judges pattern-match most of the time; requiring reasoning before the verdict lifts accuracy 5–15pp (Zheng 2023). (4) Generating the calibration set with an LLM — the calibration set must be human-labeled, otherwise you're proving yourself with yourself.
Going deeper · Zheng et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (NeurIPS 2023), arxiv.org/abs/2306.05685 · Panickssery et al. LLM Evaluators Recognize and Favor Their Own Generations, arxiv.org/abs/2404.13076 · Anthropic Reducing bias in LLM-graded evaluations, docs.anthropic.com/.../test-and-evaluate
// 03

Prompt as Code: Regression Tests and Versioning

Claim: a prompt is a form of source code. It must be diffed, reviewed, and exercised by CI regression. Prompts copy-pasted across notebooks is 2023 hygiene.

Background & Principles

Change one line in a prompt, change the behavior entirely. That's not a flaw — it's a fundamental property of LLMs. The problem is most teams' relationship with prompts is "scattered across Slack screenshots and Jupyter cells", and the result is: a prompt change last week silently flipped an edge case from pass to fail, nobody noticed, until a customer complained.

The senior approach is to treat prompts as full-stack code:

Mental model — prompts are not config, they're behavior code. Config changes alter business parameters; prompt changes can flip model decisions entirely. So they deserve stricter review than code, not looser.

Hands-on Example: minimal CI regression setup

# —— Repo layout ——
prompts/
  meeting_extractor.md       # prompt text with {{var}} placeholders
  meeting_extractor.meta.yml # model, temperature, max_tokens, min_pass_rate
tests/
  golden_meeting.jsonl       # golden set
  test_meeting.py            # pytest invoking the §1 evaluate()
.github/workflows/
  eval.yml                   # CI

# —— prompts/meeting_extractor.meta.yml ——
model: claude-opus-4-7
temperature: 0
max_tokens: 200
min_pass_rate: 0.90       # dropping below this turns CI red
allow_regression: []      # explicitly approved failing case_ids

# —— tests/test_meeting.py ——
import pytest, yaml
from evaluator import evaluate

def test_meeting_regression():
    meta = yaml.safe_load(open("prompts/meeting_extractor.meta.yml"))
    result = evaluate("meeting_extractor")
    assert result.pass_rate >= meta["min_pass_rate"], \
        f"Pass rate dropped: {result.pass_rate:.0%} < {meta['min_pass_rate']:.0%}"
    # ensure no new failure has slipped in
    new_fails = set(result.failed_ids) - set(meta["allow_regression"])
    assert not new_fails, f"New regressions: {new_fails}"

# —— .github/workflows/eval.yml ——
name: prompt-eval
on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements.txt
      - run: pytest tests/ -v
        env: { ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} }

With this minimum in place, the next time anyone changes a prompt in a PR, CI automatically runs the 20 golden cases and surfaces both the pass rate and the diff. Within a week your team won't be going back to vibes-based prompt edits.

Failure modes: (1) running eval with temperature ≠ 0 — results jitter and no one trusts the numbers; regression tests must be temperature=0 with fixed seed/model_version. (2) Watching only the pass-rate number, never the fail diff — same pass rate can hide a swapped set of failures; always list "newly failing case_ids / newly passing case_ids". (3) allow_regression growing endlessly — that's a tech-debt alarm; every allowed failure needs a tracking issue. (4) Co-evolving the golden set with the model — optimizing the model against data the model itself generated is circular.
Going deeper · Eugene Yan What We've Learned From A Year of Building with LLMs (Part I: Evals & Monitoring), applied-llms.org · Shreya Shankar Operationalizing ML Tests, shreya-shankar.com · OpenAI Evals source, github.com/openai/evals
// 04

The Anthropic Evals Framework: Capability vs Task Eval

Claim: the eval framework Anthropic actually uses to build Claude isn't MMLU — it's thousands of capability slices combined with product-level task evals. You should split your eval the same way.

Background & Principles

Anthropic has sketched the shape of its internal eval system in blogs and docs (Claude's Constitution, Core views on AI safety, eval examples in the Anthropic Cookbook). Key insight: split evals into two layers.

The common mistake at most teams is to do only one:

The right approach is two layers + traceability: when a task eval fails, you can drill down to which specific capability went red. Example: customer-support agent dropping orders — is it instruction-following (didn't follow SOP)? entity extraction (missed the order number)? refusal (wrongly refused)? You need a capability eval for each.

Hands-on Example: two-layer eval layout

evals/
├── capability/                   # atomic capabilities, run on model/prompt change
│   ├── instruction_following.jsonl   # 200 "follow the format" cases
│   ├── entity_extraction.jsonl       # 150 "extract JSON" cases
│   ├── refusal_accuracy.jsonl        # 80 "refuse when you should" cases
│   ├── long_context_recall.jsonl     # 100 needle-in-haystack cases
│   └── tool_selection.jsonl          # 120 "pick the right tool" cases
└── task/                         # end-to-end, run on every release
    ├── customer_support.jsonl        # 50 real tickets → expected category + reply
    └── code_review.jsonl             # 30 PRs → expected bug list

# —— two-way traceability: when a task fails, drill to capabilities ——
def diagnose(task_failure):
    """task eval fail → auto-run related capability evals to find the weak link"""
    related = TASK_TO_CAPABILITY[task_failure.task_name]
    # e.g. customer_support relates to [instruction_following, entity_extraction]
    diag = {}
    for cap in related:
        diag[cap] = run_capability_eval(cap).pass_rate
    return sorted(diag.items(), key=lambda x: x[1])  # lowest is the suspect

# —— usage ——
if task_eval.pass_rate < 0.85:
    print("Task regressed. Capability diagnosis:")
    for cap, rate in diagnose(task_eval): print(f"  {cap}: {rate:.0%}")
# Output:
#   instruction_following: 0.62  ← the culprit
#   entity_extraction:     0.94
Two-layer eval topology (Anthropic-style, simplified) ┌─────────────── Task Eval (product) ─────────────────┐ │ customer_support · code_review · research_agent │ │ 20–200 cases · real business outcome │ │ run on every release │ └────────────────────┬───────────────────────────────┘ │ drill down on failure ┌────────────────────▼───────────────────────────────┐ │ Capability Eval (atomic capability slices) │ │ instruction_following · entity_extraction │ │ refusal · long_context · tool_selection · math │ │ 100–1000 per capability · run on prompt/model │ └────────────────────────────────────────────────────┘ Rule: every task fail should map to one or more capability regressions. capabilities all green but task red → your capability slicing doesn't yet cover the real workload

A hidden bonus: this structure makes model migration explainable. Upgrading Claude Opus 4.7 → 4.8, or switching to Sonnet, you see "instruction_following +2%, entity_extraction -5%" instead of a blurry "overall it seems similar". Model selection stops being vibes.

Failure modes: (1) running only public benchmarks — MMLU / HumanEval / GSM8K are mostly irrelevant to your business and largely train-set contaminated. (2) Capability evals that are too broad — one "capability" covering 5 sub-skills; when it breaks you don't know which sub-skill. Slice atomically; err on the side of finer cuts. (3) Task and capability not linked — when something breaks, you can't drill down; the eval degrades into a dashboard. (4) Ignoring cost / latency — product eval isn't only quality; record tokens / latency / cache hit alongside. Launch decisions look at the Pareto frontier, not a single point.
Going deeper · Anthropic Cookbook · Evals examples, github.com/anthropics/anthropic-cookbook · Anthropic Develop empirical tests, docs.anthropic.com/.../develop-tests · Liang et al. HELM: Holistic Evaluation of Language Models, crfm.stanford.edu/helm

// Putting it together · Build a complete eval for one prompt in 90 minutes

Wire the four sections into a deliverable workflow. Pick the prompt you run most often — say "email summary + todo extraction" — and walk through these 7 steps:

  1. Layer (§4, 10 min): write down what the task is, list the 2–4 capabilities it touches (long-context recall, entity extraction, format compliance). Decide which deserve their own slice.
  2. Build the golden set (§1, 30 min): handpick 20 real inputs — 10 happy, 5 edge, 5 adversarial. Annotate the expected JSON for each. Save as tests/golden_xxx.jsonl.
  3. Write a code-based scorer (§1, 10 min): anything that can be tested with JSON-field equality / regex / schema does not get an LLM judge. Write scorer logic per case.
  4. Run baseline (§1, 5 min): run on the current prompt; record pass rate and fail distribution. This is the floor — never drop below.
  5. Add a judge for the open-ended bits (§2, 20 min): for summary quality and other things that can't be string-equated, add an LLM judge: rubric-ized (3–4 independent 0/1 dimensions) + position swap + cross-family judge. Calibrate against 20 human labels until ≥ 80% agreement.
  6. Wire into CI (§3, 10 min): connect the prompt file, meta.yml, pytest, and GitHub Actions. Auto-run regression on every PR.
  7. Run a deliberate break (5 min): change one key instruction in the prompt and confirm CI turns red. If it doesn't, your golden set isn't sharp enough — add cases until breaking the prompt is detectable.

90 minutes later, what you have isn't "one eval script" — it's this prompt's moat for the next year. Next model upgrade, next prompt rewrite, next team member change — the eval stays as your gatekeeper. That's the biggest capability gap between Anthropic / OpenAI / Cursor engineering teams and average teams: not using the model better, but building the eval earlier.

// Deep Thinking

20 examples is clearly not production scale — but how much signal does it give? When do you need to expand to 200+?
20 cases reliably detect quality swings of ±20% (coarse-grained, good enough to block most regressions). Scale to 200+ when: (1) prompt iteration moves into fine-tuning (need to detect ±5%); (2) the product covers multiple languages / scenarios (each sub-pattern wants ≥ 30 cases); (3) you're running A/B experiments and need statistical power (typically ≥ 100 per arm). Anthropic's internal evals usually run 100–1000 cases per stratum.
Which LLM-as-judge bias is hardest to remove — position or verbosity?
Verbosity. Position bias is essentially eliminated by swap-and-average (run each pair twice with reversed order). Verbosity needs: (1) explicit "do not prefer longer answers" in the judge prompt (limited effect); (2) length normalization post-hoc; (3) multi-judge ensemble. It's stubborn because "longer = better" is a strong prior in GPT/Claude RLHF data.
Treating prompts as source code with regression tests — but LLM outputs are non-deterministic. How do you handle false positives?
Three-layer defense: (1) fixed temperature=0 + fixed seed (kills most non-determinism); (2) fuzzy match rather than exact match ("contains key concept" instead of byte-equal); (3) allow N-of-M (a 10-case set tolerates 1 failure; only past that does CI break). Residual false positives go through human triage. Anthropic's evals framework has these baked in.
"Capability eval vs Product eval" — concrete example?
Capability eval measures the model in isolation (can Claude detect SQL injection patterns?). Product eval measures end-to-end UX. Example for a support agent: capability eval = "given a ticket, can Claude find the correct knowledge-base article" (tests retrieval); product eval = "did the user, 30s after sending the message, get a response that solved their problem" (covers retrieval + ranking + reply style + latency). You need both — capability helps debug, product aligns to business.
Why doesn't Anthropic use MMLU to evaluate Claude? Why does the industry still use it?
Anthropic doesn't because: (1) MMLU is largely fit into training sets (heavy contamination); (2) it only tests multiple-choice, not generation; (3) the questions are dated. The industry still uses it because: (1) publications need a standard benchmark for comparability; (2) marketing wants a single number; (3) it still carries weak signal for model selection. Anthropic internally uses thousands of handcrafted capability slices + real product tasks.

// Further Reading