AI/ML Explained: Evaluation & Benchmarks

Day 15 · 2026-06-01
For: engineers with coding experience, outside the AI field
Engineering counterpart → super-individual D6: Eval engineering (building your own offline eval pipeline)

An unavoidable question: on what grounds do you say "this model is better than that one"? In software engineering you have a test suite—red/green, unambiguous. But an LLM's output is open-ended text with no single correct answer. Today we dissect four core evaluation mechanisms, and they happen to map onto the testing pyramid you already know: MMLU is a unit assertion, HumanEval is only counts if the tests pass, MT-Bench is a multi-turn integration test, and LLM-as-Judge is the automated referee—along with the referee's own bugs.

Massive Multitask Language UnderstandingMMLU

knowledge benchmarkmultiple choice
One-line analogy

MMLU is the standardized exam we give models—a multiple-choice question bank across 57 subjects (math, law, medicine, history…). In your world it maps to a regression suite: a fixed, unchanging set of assertions you run against every new model version to see if the score regressed. Each question is one assert.

The problem it solves + how it works

Pain point: early NLP had one benchmark per task, fragmented like a hundred incompatible test frameworks—nobody could answer "how much world knowledge does this model actually have?" Hendrycks et al. (2021) merged multiple-choice questions from 57 domains into a single score, spanning elementary math to professional law.

The key mechanism isn't free-form answering (you can't auto-grade that)—it's collapsing open generation into a classification problem: give the model the stem plus four options A/B/C/D and see which option it assigns the highest log-likelihood (how probable the model thinks that string is). Highest-probability option = the model's pick; correct = +1. That "dimensionality reduction" is exactly why the benchmark scales with zero human labor.

Honestly, it has a fatal weakness: contamination—the question text leaked into training data, so the model isn't "solving" but "reciting," inflating the score. It's like test cases sneaking into the code under test.

Code example
# MMLU scoring is really about comparing option log-likelihoods, not free generation
from datasets import load_dataset

ds = load_dataset("cais/mmlu", "high_school_physics", split="test")
q = ds[0]
# q["question"], q["choices"] = ["option A",...], q["answer"] = correct index

prompt = f"{q['question']}\nA. {q['choices'][0]}\nB. {q['choices'][1]}\n..."

# For each candidate, compute log P(option | prompt); pick the max
def score_choice(prompt, letter):
    # Real frameworks (lm-eval-harness) take that token's logprob
    return model.logprob(prompt + " " + letter)

pred = max("ABCD", key=lambda L: score_choice(prompt, L))
correct = ("ABCD"[q["answer"]] == pred)  # auto-graded, zero human labor
Common misconception + use case
Misconception: "high MMLU = the model is genuinely smart." Wrong. MMLU measures knowledge retrieval + multiple-choice test-taking skill—not reasoning depth, not open-generation quality, and definitely not multi-turn dialogue. A model that's good at reciting can score high on MMLU yet write poor code and make a poor agent.
📌 Super-individual scenario: when picking a model, treat MMLU as a "smoke test"—quickly eliminate the obviously inadequate, but never as the final verdict. To actually pick a model for "help me read papers / make decisions," look at evals close to your task (reasoning, long-context comprehension), not one aggregate number.
Takeaway + question
💡 MMLU's genius is "collapsing an open problem into auto-gradable multiple choice"; its limit comes from the same source—what can be auto-graded is often not the ability you most want to measure.
🤔 In your CI, which checks are "multiple-choice" deterministic assertions, and which need a human to judge quality? How do you test the latter today?

HumanEval and pass@kFunctional Correctness

code evalprobability estimation
One-line analogy

HumanEval is the model's TDD acceptance test—164 Python problems, each with a set of hidden unit tests; generated code only counts if the tests pass, regardless of how pretty it looks. And pass@k is the thing you know best: configure retries for a flaky downstream service—what's the probability of at least one success in k retries?

The problem it solves + how it works

Pain point: code can't be graded by string matching. Two programs can be character-for-character different yet both correct, or nearly identical yet one right and one wrong. So BLEU and edit distance all fail. The answer from Chen et al. (2021, the Codex paper) is blunt: functional correctness—run the unit tests, passing = correct.

The mathematically interesting part is pass@k. It answers: "sampling k solutions, what's the probability at least one is correct?" The naive form 1−(1−p)^k has huge variance when per-problem samples are few. The Codex paper uses an unbiased estimator: per problem, first sample n solutions (n far larger than k), count the c correct ones, then:

pass@k = 1 − C(n−c, k) / C(n, k)

Unpacked symbol by symbol it's intuitive:
· C(n, k): number of ways to draw k from n solutions (the binomial coefficient).
· C(n−c, k): number of ways to draw k from the n−c wrong solutions—i.e., the count of "all k drawn are wrong."
· Their ratio = probability that all k drawn are wrong; 1 minus it = probability of at least one correct.
Estimate the true success rate with a large n, then compute this combinatorial formula—far more stable than directly retrying k times. In essence: trade more samples for lower estimation variance.

One problem, n=8 sampled solutions, c=2 correct:

pass@1 ≈ 25% pass@4 = 1 − C(6,4)/C(8,4) = 1 − 15/70 ≈ 79%
Code example
# pass@k unbiased estimator (the standard implementation from the Codex paper)
import numpy as np

def pass_at_k(n, c, k):
    # n=total samples, c=correct count, k=the k to evaluate
    if n - c < k:                 # fewer than k wrong → must hit a correct one
        return 1.0
    # product form of C(n-c,k)/C(n,k), avoids large-factorial overflow
    return 1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))

# Evaluate one problem: sample n solutions, run hidden unit tests on each
samples = [model.generate(problem["prompt"]) for _ in range(8)]
c = sum(run_unit_tests(s, problem["test"]) for s in samples)

print(pass_at_k(n=8, c=c, k=1))   # prob. of getting it right in one shot
print(pass_at_k(n=8, c=c, k=4))   # prob. of at least one correct in 4
Common misconception + use case
Misconception: "low pass@1 means the model is useless." Not necessarily. When you have a verifier (you can auto-run tests), high pass@10 is good enough—let the model try repeatedly and the verifier picks the right one. That's exactly how coding agents work: generate → run tests → retry on failure. When auto-verification is cheap, "sampling multiple times" is an underrated strategy.
📌 Super-individual scenario: when evaluating a coding agent, don't just look at pass@1. If your workflow lets the agent run its own tests and iterate, pass@k (larger k) is the true usability metric—it directly decides whether you can let it auto-fix bugs hands-off.
Takeaway + question
💡 The paradigm shift in code eval is "from looks-right to runs-right"; pass@k then tells you: when you have a verifier, the model's value isn't "right on the first try" but "can it hit the answer in a few tries."
🤔 Which of your tasks have a cheap automated verifier (tests, compilation, schema validation)? Those are the best candidates for "AI samples many times + auto-filters."

MT-Bench Multi-Turn BenchmarkMT-Bench

open generationmulti-turn
One-line analogy

If MMLU/HumanEval are unit tests, MT-Bench is the integration test—it tests whether a model can keep context consistent and not drop the ball on follow-ups across multiple (2) turns. Like testing a stateful session service: passing the first request doesn't count; you need the session state to still be correct across several requests.

The problem it solves + how it works

Pain point: multiple-choice and code problems can't measure "open dialogue quality"—writing, role-play, explaining reasoning, rewriting. These have no single correct answer; you can't assert them. MT-Bench (from Zheng et al. 2023, the MT-Bench / Chatbot Arena paper) uses 80 high-quality two-turn questions across 8 categories (writing, reasoning, math, coding, etc.), specifically stress-testing "can it still keep up on turn two."

The core difficulty follows: how do you auto-grade open answers? MT-Bench's answer is to outsource grading to a strong model—LLM-as-Judge—scoring each answer 1–10 (single-answer grading) or comparing two models' answers head-to-head (pairwise). That leads to today's most important, and most dangerous, mechanism → next card.

Code example
# MT-Bench-style single-answer grading: a strong model scores 1-10 by rubric
import anthropic
client = anthropic.Anthropic(api_key="sk-ant-...")  # placeholder

JUDGE = """You are an impartial judge. Score the answer below 1-10 on
[helpfulness, relevance, accuracy, depth]. Give brief reasoning, then on
a single final line output "Rating: [[score]]".

[Question] {q}
[Turn-1 answer] {a1}
[Follow-up] {q2}
[Turn-2 answer] {a2}"""

msg = client.messages.create(
    model="claude-opus-4-8", max_tokens=512,
    messages=[{"role":"user",
        "content": JUDGE.format(q=q, a1=a1, q2=q2, a2=a2)}])
# Regex out the score from "Rating: [[8]]"—structured output is easy to parse
Common misconception + use case
Misconception: "MT-Bench scores can be compared in absolute terms across papers." No. Scores depend heavily on which judge model, which version, and what prompt. Swap the judge and the whole score shifts. It's suited for relative comparison under one fixed setup (how much better A is than B), not as an absolute scale across studies.
📌 Super-individual scenario: when evaluating your own writing assistant, fix one judge + one rubric and have it score "old prompt vs new prompt" outputs head-to-head. What you care about is relative progress, not the absolute number.
Takeaway + question
💡 Once the target becomes "open quality," no assertion is available—you can only bring in a judge, and the judge itself needs to be evaluated.
🤔 How do you know "the turn-two answer was good"? Writing the implicit rubric in your head as 5 explicit criteria reveals it's harder than it seems.

LLM-as-JudgeLLM-as-Judge

auto evalbiasElo
One-line analogy

Using a strong model to score another model's output is like "one service monitoring another service," or an automated code-review bot. But here's the twist: the judge itself has systematic biases—just as your monitoring system can have bugs that produce false alarms or miss real ones. An uncalibrated judge is an untested probe you've chosen to trust.

The problem it solves + how it works

Pain point: human annotation is expensive and slow, and doesn't scale. Zheng et al. (2023) gave LLM-as-Judge its legitimacy: GPT-4 as a judge agrees with human preferences over 80% of the time—the same level of agreement humans have with each other. In other words, a strong judge already approaches the "human ceiling."

But the same paper names three systematic biases you must know:

  • Position bias: in pairwise comparison, it tends to prefer the answer that appears first, regardless of content.
  • Verbosity bias: it tends to score longer answers higher, even when they aren't better.
  • Self-enhancement bias: it tends to prefer answers generated by itself or same-family models.

The most practical fix targets position bias: swap the order and run twice—each of A/B gets to go first once, and only a win in both runs counts as a win, otherwise it's a tie. This eliminates "position" as a confound.

Position bias and the swap fix:
Run 1 Answer AvsAnswer B judge picks first one
Run 2 Answer BvsAnswer A judge still picks first one?
→ two runs contradict = judge looks at position not content, scored as a tie

How do you quantify agreement? Use the agreement rate, or more rigorously Cohen's kappa—which subtracts out the "agree by random chance" part, leaving only genuine agreement. Another route is Chatbot Arena: real users vote between two anonymous models' answers, then an Elo / Bradley-Terry model converts massive pairwise win/loss into a single ranking score (the same math as chess ladder ratings).

Code example
# Pairwise judging + position swap to remove position bias
def judge_pair(q, ans_a, ans_b):
    def ask(first, second):
        p = f"Question: {q}\nAnswer 1: {first}\nAnswer 2: {second}\n" \
            "Which is better? Reply only '1' or '2'."
        return client.messages.create(
            model="claude-opus-4-8", max_tokens=5,
            messages=[{"role":"user","content":p}]).content[0].text.strip()

    r1 = ask(ans_a, ans_b)   # A first
    r2 = ask(ans_b, ans_a)   # B first (swapped)
    # A wins only if both runs pick A, else tie—cancels position bias
    if r1 == "1" and r2 == "2": return "A"
    if r1 == "2" and r2 == "1": return "B"
    return "tie"
Common misconception + use case
Misconception: "an LLM judge's score = objective truth." No. What it amplifies are the preferences in its training distribution—longer, more polite, more like-its-own-style answers score higher, even when no better in substance. Optimizing a model directly against the judge score as a KPI teaches it to "please the judge" rather than "actually improve" (a form of reward hacking).
📌 Super-individual scenario: using an LLM judge to auto-evaluate your prompt iterations is efficient, but pair it with two safeguards: (1) pairwise comparison + swap to remove position bias; (2) periodic human spot-checks of a small batch to calibrate whether the judge has drifted. The judge must itself be judged.
Takeaway + question
💡 LLM-as-Judge makes evaluation scalable, but it's not a truth detector—it's a probe with known biases that needs continuous calibration.
🤔 Does your monitoring system monitor itself? Same here: when you use AI to evaluate AI, who evaluates the evaluator? Where does this chain bottom out?

Further ReadingFurther Reading

Deep QuestionsDeep Questions

1. Why do nearly all public benchmarks eventually "go stale"? Is this the same thing as "test overfitting" in software testing?
Essentially the same, but more insidious. In software testing you might write an implementation that "only passes a specific test"; an LLM's staleness has two layers: (1) contamination—benchmark questions flow into training data, and the model degrades from "able to solve" to "has memorized," inflating the score without representing ability; (2) overfitting to the distribution—the whole industry optimizes against the same leaderboard, so the model learns "the test-taking pattern of these questions" rather than the underlying capability, like every school teaching the same exam tricks. The result: once a benchmark is widely known, its information content decays over time—which is why MMLU (2021) has been followed by MMLU-Pro, GPQA, and other "harder, newer, more contamination-resistant" benchmarks. The engineering analogy: periodically rotate a private holdout set, just as a security team doesn't publish all its penetration cases. The takeaway for BigCat: when you see a model "crush a leaderboard," the first reaction should be "how much information does this board still carry," not "it must be this strong."
2. Statistically, what's the difference between "sample n then estimate" in pass@k and "just retry k times"? Why does the paper bother?
The difference is estimation variance. Retrying k times gives you a single 0/1 outcome (did these k succeed), very low information per problem, and the noise is large after averaging across problems—especially when a problem's success rate is something like 5%: retrying 5 times you very likely fail every time, misjudging it as "pass@5 = 0," though the true value isn't 0. The paper's approach is to first use a large n (e.g., 100 or 200) to estimate that problem's true success rate c/n accurately, then analytically compute the expected pass@k via the combinatorial formula 1 − C(n−c,k)/C(n,k). This is "buying a more stable estimate with more samples." The statistical intuition: you're not "simulating one k-retry"—you're "estimating the problem's difficulty parameter, then deriving the theoretical pass rate at any k." Analogy to your load testing: rather than manually replaying 5 requests to see a success rate, you precisely measure the single-request success rate p, then analytically derive availability at any retry count—the latter has far higher sample efficiency and a much more stable conclusion.
3. Given LLM-as-Judge has 80% agreement with humans, is the remaining 20% disagreement "the judge being wrong" or "humans being wrong"? Why is there no clean answer?
Because "the correct answer" itself doesn't exist—open-generation quality is a preference, not a fact, and human annotators agree with each other only about 80% of the time (i.e., two human labels also clash on 20% of samples). So in that 20% disagreement, three cases are mixed together and can't be cleanly separated: (a) the judge, influenced by bias (preferring longer answers), judged wrong; (b) annotator attention/preference differences make the "human gold answer" itself unstable; (c) the question genuinely has no consensus right or wrong (e.g., two equally reasonable writing styles). This means 80% isn't "the judge's accuracy" but "the judge's match with an equally noisy reference"—and the reference itself has 20% internal noise. The deeper implication: when evaluating open quality, don't pursue "approaching truth"—pursue "alignment with your target population's preferences." For BigCat: when you use a judge to evaluate "which writing style is better," first be clear about better for whom—the judge should be calibrated toward your real audience, not some abstract "objective quality."
4. Stringing today's four mechanisms together: why is evaluation difficulty such that "the closer an ability is to real value, the harder it is to auto-evaluate"? What does this mean for designing AI workflows?
The four mechanisms line up along an inverse curve of "degree of automatability vs closeness to real value." MMLU is fully auto-gradable (multiple choice) but farthest from real value (only tests knowledge retrieval); HumanEval is auto-gradable (run tests) with higher fidelity (it can actually write correct code); MT-Bench measures open quality, close to real usage, but can no longer be graded by assertion, forcing a judge in; LLM-as-Judge tries to automate "quality judgment" itself, but brings bias and the infinite regress of "who evaluates the evaluator." The rule: abilities you can write as an assert are often not the ones you care most about; the abilities you care most about often can't be written as an assert. Implications for workflow design: (1) in steps with a cheap verifier (code, data validation, formatting), automate aggressively and let AI sample many times and self-filter; (2) in open-quality steps (writing, decisions, cross-disciplinary insight), keep a human in the loop—use an LLM judge for first-pass screening and a human for final review; (3) beware "over-optimizing an ability just because it's easy to measure"—this is the eval-driven local-optimum trap, where you over-invest where it's quantifiable and go blind where it truly matters but is hard to quantify.