DAY 29 / PHASE 3 · FRONTIER

Eval Beyond Benchmark

Contamination · Construct Gap · Slice-Based · Agentic Eval

2026-06-12 · BigCat

A model's SOTA leaderboard score and its reliability in your pipeline are two nearly unrelated numbers.

Prerequisite → ai-ml-daily Day 15: Evaluation & Benchmarks (MMLU, HumanEval, MT-Bench)

// WHY THIS MATTERS

You pick a model by its SWE-bench, MMLU, and Arena rank — then it falls over in your scenario after launch. That isn't bad luck, it's structural. Public benchmarks measure "average score on a fixed set that's been gamed endlessly and very likely leaked into pretraining," while you need "no failures across 10 consecutive multi-turn interactions on my real input distribution." Three chasms sit between them: contamination, saturation, construct gap. Day 6 covered the mechanics of eval (golden sets, judge debiasing, regression tests); this issue covers the upstream judgment: why leaderboards systematically mislead, how to build a trustworthy eval set by mining your own production traces, why aggregate scores are the most dangerous metric, and the agent-era paradigm shift of "evaluate the trajectory, not the answer." Bottom line up front: a usable eval is your own, sliced, and adversarially refreshed.

// 01

The Three Sins of Public Benchmarks: Contamination, Saturation, Construct Gap

Claim: leaderboard scores systematically overestimate real capability — they don't measure the distribution you face in production.

Background & Principle

Three mechanisms decouple public scores from production reliability:

Contamination: long-lived public sets like MMLU and HumanEval have almost certainly entered new models' pre/post-training data in some form. More insidiously, for SWE-bench — analyses of the Verified subset found roughly one-third of issues describe the fix directly in the issue text or comments, so models "copy" rather than "solve," inflating scores.
Saturation: top models cluster in an 88%–92% ceiling band, so the benchmark loses discriminative power — score gaps fall into the noise and can't tell you whether A beats B in your setting.
Construct Gap: a benchmark measures a proxy for the target, not the target. HumanEval tests "complete an isolated function," your job is "fix a cross-file bug in a 500k-line repo." Research shows that swapping SWE-bench's long GitHub issues for the terse, vague phrasing real users write drops relative success rates by 20–40%.

Underneath it all is Goodhart's law: when a metric becomes a target, it ceases to be a good metric. The whole field is overfitting public benchmarks, and the correlation between score and real capability decays year over year.

Public benchmark score ────────────▶ Your production reliability 88% ? ┌──────────────────── three leaks ──────────────────────┐ │ 1) Contamination: answer in training data / issue text │ inflated │ 2) Saturation: top cluster, falls into noise │ useless for picking │ 3) Construct gap: proxy task != your real distribution │ −20~40% drop └────────────────────────────────────────────────────────┘ ▼ Leaderboard #1 != most reliable in YOUR scenario

Hands-on

Turn "can I trust this benchmark" into a 30-second checklist pinned to your model-selection doc:

# See a shiny score? Ask these 4 first
1. How long public?      > 1 yr   → likely contaminated, discount it
2. Gap among top models? < 3pp    → it's noise, don't pick on this
3. Task shape like mine? single-fn/MCQ != multi-turn agent / long repo
4. Strict oracle?        weak tests pass "looks-right-but-wrong" answers

# Fails any one → that score can't drive a production decision

Failure mode: picking a model straight off the leaderboard and shipping it. The right order is — use public leaderboards for coarse filtering (cut the clearly weak), use your own eval for the final pick. Public scores answer "roughly what tier is this model"; they can never answer "is it reliable enough on my task."

Going deeper · SWE-bench (Jimenez et al. 2024, arXiv:2310.06770), arxiv.org/abs/2310.06770 · HumanEval / Codex (Chen et al. 2021), MMLU (Hendrycks et al. 2021) originals

// 02

Build Your Eval Set Backward from Production Traces

Claim: the only trustworthy eval set comes from your own real traffic — not benchmarks, not synthetic examples conjured from nothing.

Background & Principle

Hamel Husain hammers one point in Your AI Product Needs Evals: failed AI products almost always failed at having no evaluation system, yet a usable one starts shockingly small — a few dozen real examples plus one scorer is enough to run. The key isn't volume, it's distribution match: your eval inputs must look like production inputs, or a high score is just self-deception.

The method is error analysis / open coding: pull a batch of real traces, read outputs by hand, and label the failures (don't fix a rubric first — look at the data, then induce failure modes). The EvalGen paper (Shankar et al. 2024) names this phenomenon criteria drift — you need criteria to grade outputs, but grading outputs is exactly what helps you define the criteria. So an eval isn't a write-once static doc; it co-evolves with your data.

Hands-on

# Minimal pipeline: build an eval set backward from prod logs
import json, random

# 1) Stratified sampling: don't only grab successes; oversample failures
traces = load_prod_traces("2026-05")
hard = [t for t in traces if t.thumbs_down or t.retried or t.fallback]
sample = random.sample(traces, 60) + random.sample(hard, 40)  # deliberately skew to hard cases

# 2) Open coding: read + label failure modes (data first, rubric second)
#    e.g. wrong_tool / hallucinated_field / ignored_constraint / too_verbose

# 3) Freeze into a ground-truth eval set, commit to the repo
for t in sample:
    eval_set.append({"input": t.input, "context": t.context,
                     "expected": human_label(t),     # you label it, not the model
                     "slice": t.category})           # ← needed for §3 slicing
json.dump(eval_set, open("evals/v1.json", "w"))

Note the oversampling of hard cases in step 1: 95% of production is easy requests, so natural-distribution sampling dilutes the eval with easy examples and masks critical failures. The eval set should deliberately skew toward where you most fear being wrong.

Failure mode: synthesizing eval data purely with an LLM. Synthetic sets have a systematic gap from the production distribution (cleaner phrasing, less ambiguity, missing tails); tuning to 95% on them still crashes on launch. Synthetic data can supplement tail coverage, but the skeleton must be real traces.

Going deeper · Hamel Husain Your AI Product Needs Evals, hamel.dev/blog/posts/evals · EvalGen Who Validates the Validators? (arXiv:2404.12272), arxiv.org/abs/2404.12272

// 03

Slice-Based Eval: Aggregate Scores Lie to You

Claim: 90% overall accuracy can hide a critical subgroup sitting at 40% — watching the aggregate is blindfolding yourself.

Background & Principle

This is the most counterintuitive — and most valuable — rule in real-world eval. A "90% pass" agent might be 95% on "refund requests" and only 45% on "address-change requests" — and address changes happen to be your highest-volume or highest-risk scenario. The aggregate averages away this in-distribution catastrophe.

The fix is slice-based evaluation: cut along business-meaningful dimensions, report each slice's score separately, and watch the worst slice rather than the mean. Slice dimensions usually come from the failure modes you open-coded in §2, or business category, input length, language, user segment. It's also a natural defense against Goodhart — when you optimize the aggregate it's easy to sacrifice a small slice for the headline; slicing leaves that trade nowhere to hide.

Hands-on

from collections import defaultdict

def sliced_report(eval_set, run_agent, scorer):
    buckets = defaultdict(lambda: [0, 0])   # slice -> [pass, total]
    for ex in eval_set:
        ok = scorer(run_agent(ex), ex["expected"])
        b = buckets[ex["slice"]]
        b[0] += int(ok); b[1] += 1
    # key: print ascending by score, worst slice on top
    rows = sorted(((s, p/n, n) for s,(p,n) in buckets.items()),
                  key=lambda r: r[1])
    for s, acc, n in rows:
        flag = "  ⚠️ below bar" if acc < 0.8 else ""
        print(f"{s:<22} {acc:5.0%}  (n={n}){flag}")

# Example: aggregate 90%, but worst slice only 45% — that's what to fix
# change_address        45%  (n=20)  ⚠️ below bar
# multi_intent          63%  (n=15)  ⚠️ below bar
# refund                95%  (n=30)
# ── aggregate ──       90%

Your launch gate should hang on the worst slice ("every slice ≥ 80% to ship"), not the aggregate mean. A 45% slice masked by the aggregate is exactly where post-launch complaints come from.

Failure mode: (1) tracking only the aggregate — on an iteration it rises 2pp while a small slice quietly collapses 20pp, invisible until users complain. (2) slicing too fine, n=3 per slice, scores are all noise. Slice granularity should leave at least a dozen-plus samples per slice to be statistically meaningful.

Going deeper · Eugene Yan Task-Specific LLM Evals that Do & Don't Work, eugeneyan.com/writing/evals · Product Evals in Three Simple Steps, eugeneyan.com/writing/product-evals

// 04

Agentic Eval: Evaluate the Trajectory, Not the Answer

Claim: single-turn benchmarks can't measure agents. The agent-era metrics are multi-run consistency and trajectory quality, not pass@1 on the final answer.

Background & Principle

Traditional benchmarks ask once and check once; a real agent is a multi-turn, tool-calling trajectory interacting back-and-forth with a (possibly misleading) user. Two paradigm shifts:

From outcome to trajectory: only checking whether the final answer is right lets through fragile trajectories that "got lucky but called the wrong tools in the wrong order." Evaluate the intermediate steps — which tools were called, were policies violated, were there redundant actions.
From pass@1 to pass^k (consistency): τ-bench (Yao et al. 2024) proposes pass^k — the probability that the same task succeeds on all of k consecutive runs. In the paper, SOTA function-calling agents already fall below 50% single-task success, and in the retail domain pass^8 drops below 25%. That gap is the chasm between "the demo works" and "production is reliable." Its scoring is hardcore too: it ignores conversational phrasing and compares the final database state against the annotated goal state.

For individuals and small teams, two ideas from τ-bench are enough: state-based scoring (verify side effects, not phrasing) + pass^k consistency (run the same task several times to see if it's stable).

single-turn benchmark agentic eval (trajectory) ┌───────────┐ ┌──────────────────────────────────┐ │ Q ─▶ A ? │ │ multi-turn: U⇄Agent, tools, misleads │ │ pass@1 │ │ ├ trajectory: tool seq / policy ok? │ └───────────┘ │ ├ state: final DB == goal state? │ check once │ └ consistency: pass^k (k-in-a-row) │ luck counts as win └──────────────────────────────────┘

Hands-on

def pass_hat_k(task, run_agent, check_final_state, k=8):
    # pass^k: stable only if all k runs pass; any failure → False
    return all(check_final_state(run_agent(task)) for _ in range(k))

def check_final_state(result):
    # state-based: verify side effects, ignore how it phrased things
    return (db.get("order#123").address == "new addr"
            and not charged_extra_fee())     # also verify "did no harm"

# Trajectory compliance: is the PROCESS right, not just the result
def check_trajectory(trace):
    if "refund" in trace.tools_called and not trace.asked_confirmation:
        return "VIOLATION: refunded without confirmation"  # policy breach
    return "ok"

For an agent, "succeeded once" is meaningless — pass^8 is the launch signal. An agent with pass@1=80% but pass^8=30% means, on average, two out of every three users will hit a failure on some turn.

Failure mode: (1) judging only the final answer, never the trajectory — an agent that "stumbles into the right result" (skips confirmation but happens to do no harm) gets logged as a success and blows up after launch. (2) running only once. LLMs are stochastic; a single success doesn't mean stability — skipping consistency means reporting a 70% probability to your boss as 100%.

Going deeper · τ-bench (Yao et al. 2024, arXiv:2406.12045), arxiv.org/abs/2406.12045 · Code: sierra-research/tau-bench, github.com/sierra-research/tau-bench

// Capstone · Wire a launch gate onto your agent

Chain the four points into a CI-ready eval gate in half a day:

Coarse filter: use public leaderboards only to cut clearly weak models, never for the final pick (§1).
Build the set: stratified-sample 100 traces from the last month's production, oversample thumbs-down / retry / fallback hard cases, hand-label ground truth + slice tags, freeze into evals/v1.json (§2).
Slice: scorer reports per slice; the gate hangs on worst slice ≥ 80%, not the aggregate mean (§3).
Consistency: run core tasks at pass^5, state-based verify side effects + check trajectory for policy compliance (§4).
Anti-overfit: keep one held-out slice never used to tune prompts, and rotate 20% with fresh traces monthly — once a set has been optimized against repeatedly, it's been Goodharted and needs new blood.

From then on, "can we ship" is the red/green of pytest evals/, not someone declaring "feels fine." Every regression this gate blocks is a production incident that would have happened.

// KEY TERMS

Contamination: The test set (partly) entered training data, inflating scores. Almost inevitable for old public benchmarks.
Saturation: Top models cluster at the ceiling; the benchmark loses discriminative power and can't guide selection.
Construct Validity: Whether a metric actually measures the capability you intend. Benchmark proxy task != your real task.
Goodhart's Law: "When a metric becomes a target, it ceases to be a good metric." The field's overfitting of benchmarks.
Slice-Based Eval: Report scores per business dimension, watch the worst slice rather than the aggregate mean.
Open Coding / Error Analysis: Building evals by reading real data and inducing failure modes first, then defining the rubric.
Criteria Drift: From EvalGen: grading outputs itself helps redefine grading criteria; evals co-evolve with data.
Trajectory Eval: Evaluating an agent's intermediate steps (tool sequence, policy compliance), not just the final answer.
pass^k: τ-bench metric: probability the same task passes all k consecutive runs; measures reliability, not one-shot ability.
Held-out Set: A set never used to tune prompts, guarding the eval against overfitting (Goodhart).

// DEEP THINKING

If public benchmarks systematically overestimate, why does the whole industry still chase them and keep climbing?

Because they solve a coordination problem, not a measurement problem: the field needs a comparable, reproducible, zero-marginal-cost shared coordinate system to communicate "who is roughly stronger," and benchmarks fill that need even when they don't measure your scenario. The error is on the consumer side, mistaking a "shared coordinate" for a "production decision basis." The right stance is a split: benchmarks for industry-level coarse filtering and trend-watching, your own eval for the product-level final pick. Rising scores aren't necessarily fake either — they may be half real gains, half overfitting, and you can't separate the two from the score itself, which is exactly why you need your own held-out eval.

§3 says hang the gate on the worst slice. But slices can be subdivided infinitely; at n=1 per slice everyone "has a failing slice." How do you set the granularity?

A slice is a statistical unit, not a magnifying glass. Each needs at least a dozen-plus samples for the score to escape noise — 33% vs 67% at n=3 are indistinguishable. Principle: slice dimensions must be business-actionable (when this slice is bad you know what to fix) and sufficiently sampled. When you can't have both, merge adjacent slices. Another signal: if a slice has only 1-2 samples yet matters a lot, that's not a slicing problem but undersampling of that scenario in your eval set — go back to §2 and add samples, don't make decisions on statistical noise.

pass^k sounds hardcore, but it lumps a 99%-once task and a 50%-once task both as "unstable." Isn't it too harsh, killing genuinely usable agents?

pass^k's harshness is deliberate — it aligns with the "lower bound of user experience," not the average. An agent at pass@1=95% has only 0.95^8≈66% chance of eight clean uses in a row, meaning a failure hits at least once in 1/3 of multi-step sessions. In conversational products users perceive the worst turn, not the average. Whether it "over-kills" depends on the cost of failure: high-cost scenarios (payments, medical) warrant high k; low-cost, cheaply-retryable scenarios can relax to pass@1 or pass^2. k is your knob for failure tolerance, not an absolute truth.

Scoring trajectories with an LLM-as-judge — but the judge is itself a hallucinating LLM. Isn't this "letting the fox guard the henhouse"?

It's a real problem, which is precisely EvalGen's thesis: "validate the validators" — the judge must be aligned against human labels before use. Two lines of defense: 1) prefer a deterministic scorer whenever possible — §4's state-based checks (verify DB state, verify tool sequence) are deterministic and far more reliable than an LLM judge; agentic eval should lean on these as much as possible. 2) When an LLM judge is unavoidable (e.g. rating "is the answer helpful," a subjective axis), first measure the judge's agreement rate with humans on a labeled batch, only trust it if it passes, and re-measure periodically. The judge is a component that needs evaluating, not an exempt referee.

Rotating 20% of the held-out set monthly defends against Goodhart, but new samples aren't necessarily the same difficulty as old ones, so score swings mix in "set difficulty change" noise. What do you do?

This is eval engineering's classic tension: stability (fixed set is comparable) vs anti-overfit (rotation defends against Goodhart). In practice, split into two layers: a stable core (a long-unchanged anchor set for cross-version trends, accepting it'll slowly get overfit) + a rotation layer (monthly fresh blood, to catch new failures the anchor can't). Read the two scores separately: anchor up but rotation flat = overfitting; both up = real progress. Difficulty drift can be mitigated by stratified sampling (keep slice proportions constant when rotating), but not fully eliminated — which is why you need the anchor as a relative baseline rather than reading absolute scores alone.

// FURTHER READING

Hamel Husain · Your AI Product Needs Evals — the foundational practical guide to production eval
τ-bench (arXiv:2406.12045) — multi-turn tool-agent-user eval and the pass^k consistency metric
Who Validates the Validators? / EvalGen (arXiv:2404.12272) — criteria drift and judge alignment
Eugene Yan · Task-Specific LLM Evals that Do & Don't Work — designing evals by task type
SWE-bench (arXiv:2310.06770) — real GitHub issue eval and its contamination / construct limits