An unavoidable question: on what grounds do you say "this model is better than that one"? In software engineering you have a test suite—red/green, unambiguous. But an LLM's output is open-ended text with no single correct answer. Today we dissect four core evaluation mechanisms, and they happen to map onto the testing pyramid you already know: MMLU is a unit assertion, HumanEval is only counts if the tests pass, MT-Bench is a multi-turn integration test, and LLM-as-Judge is the automated referee—along with the referee's own bugs.
MMLU is the standardized exam we give models—a multiple-choice question bank across 57 subjects (math, law, medicine, history…). In your world it maps to a regression suite: a fixed, unchanging set of assertions you run against every new model version to see if the score regressed. Each question is one assert.
Pain point: early NLP had one benchmark per task, fragmented like a hundred incompatible test frameworks—nobody could answer "how much world knowledge does this model actually have?" Hendrycks et al. (2021) merged multiple-choice questions from 57 domains into a single score, spanning elementary math to professional law.
The key mechanism isn't free-form answering (you can't auto-grade that)—it's collapsing open generation into a classification problem: give the model the stem plus four options A/B/C/D and see which option it assigns the highest log-likelihood (how probable the model thinks that string is). Highest-probability option = the model's pick; correct = +1. That "dimensionality reduction" is exactly why the benchmark scales with zero human labor.
Honestly, it has a fatal weakness: contamination—the question text leaked into training data, so the model isn't "solving" but "reciting," inflating the score. It's like test cases sneaking into the code under test.
# MMLU scoring is really about comparing option log-likelihoods, not free generation from datasets import load_dataset ds = load_dataset("cais/mmlu", "high_school_physics", split="test") q = ds[0] # q["question"], q["choices"] = ["option A",...], q["answer"] = correct index prompt = f"{q['question']}\nA. {q['choices'][0]}\nB. {q['choices'][1]}\n..." # For each candidate, compute log P(option | prompt); pick the max def score_choice(prompt, letter): # Real frameworks (lm-eval-harness) take that token's logprob return model.logprob(prompt + " " + letter) pred = max("ABCD", key=lambda L: score_choice(prompt, L)) correct = ("ABCD"[q["answer"]] == pred) # auto-graded, zero human labor
HumanEval is the model's TDD acceptance test—164 Python problems, each with a set of hidden unit tests; generated code only counts if the tests pass, regardless of how pretty it looks. And pass@k is the thing you know best: configure retries for a flaky downstream service—what's the probability of at least one success in k retries?
Pain point: code can't be graded by string matching. Two programs can be character-for-character different yet both correct, or nearly identical yet one right and one wrong. So BLEU and edit distance all fail. The answer from Chen et al. (2021, the Codex paper) is blunt: functional correctness—run the unit tests, passing = correct.
The mathematically interesting part is pass@k. It answers: "sampling k solutions, what's the probability at least one is correct?" The naive form 1−(1−p)^k has huge variance when per-problem samples are few. The Codex paper uses an unbiased estimator: per problem, first sample n solutions (n far larger than k), count the c correct ones, then:
Unpacked symbol by symbol it's intuitive:
· C(n, k): number of ways to draw k from n solutions (the binomial coefficient).
· C(n−c, k): number of ways to draw k from the n−c wrong solutions—i.e., the count of "all k drawn are wrong."
· Their ratio = probability that all k drawn are wrong; 1 minus it = probability of at least one correct.
Estimate the true success rate with a large n, then compute this combinatorial formula—far more stable than directly retrying k times. In essence: trade more samples for lower estimation variance.
# pass@k unbiased estimator (the standard implementation from the Codex paper) import numpy as np def pass_at_k(n, c, k): # n=total samples, c=correct count, k=the k to evaluate if n - c < k: # fewer than k wrong → must hit a correct one return 1.0 # product form of C(n-c,k)/C(n,k), avoids large-factorial overflow return 1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1)) # Evaluate one problem: sample n solutions, run hidden unit tests on each samples = [model.generate(problem["prompt"]) for _ in range(8)] c = sum(run_unit_tests(s, problem["test"]) for s in samples) print(pass_at_k(n=8, c=c, k=1)) # prob. of getting it right in one shot print(pass_at_k(n=8, c=c, k=4)) # prob. of at least one correct in 4
If MMLU/HumanEval are unit tests, MT-Bench is the integration test—it tests whether a model can keep context consistent and not drop the ball on follow-ups across multiple (2) turns. Like testing a stateful session service: passing the first request doesn't count; you need the session state to still be correct across several requests.
Pain point: multiple-choice and code problems can't measure "open dialogue quality"—writing, role-play, explaining reasoning, rewriting. These have no single correct answer; you can't assert them. MT-Bench (from Zheng et al. 2023, the MT-Bench / Chatbot Arena paper) uses 80 high-quality two-turn questions across 8 categories (writing, reasoning, math, coding, etc.), specifically stress-testing "can it still keep up on turn two."
The core difficulty follows: how do you auto-grade open answers? MT-Bench's answer is to outsource grading to a strong model—LLM-as-Judge—scoring each answer 1–10 (single-answer grading) or comparing two models' answers head-to-head (pairwise). That leads to today's most important, and most dangerous, mechanism → next card.
# MT-Bench-style single-answer grading: a strong model scores 1-10 by rubric import anthropic client = anthropic.Anthropic(api_key="sk-ant-...") # placeholder JUDGE = """You are an impartial judge. Score the answer below 1-10 on [helpfulness, relevance, accuracy, depth]. Give brief reasoning, then on a single final line output "Rating: [[score]]". [Question] {q} [Turn-1 answer] {a1} [Follow-up] {q2} [Turn-2 answer] {a2}""" msg = client.messages.create( model="claude-opus-4-8", max_tokens=512, messages=[{"role":"user", "content": JUDGE.format(q=q, a1=a1, q2=q2, a2=a2)}]) # Regex out the score from "Rating: [[8]]"—structured output is easy to parse
Using a strong model to score another model's output is like "one service monitoring another service," or an automated code-review bot. But here's the twist: the judge itself has systematic biases—just as your monitoring system can have bugs that produce false alarms or miss real ones. An uncalibrated judge is an untested probe you've chosen to trust.
Pain point: human annotation is expensive and slow, and doesn't scale. Zheng et al. (2023) gave LLM-as-Judge its legitimacy: GPT-4 as a judge agrees with human preferences over 80% of the time—the same level of agreement humans have with each other. In other words, a strong judge already approaches the "human ceiling."
But the same paper names three systematic biases you must know:
The most practical fix targets position bias: swap the order and run twice—each of A/B gets to go first once, and only a win in both runs counts as a win, otherwise it's a tie. This eliminates "position" as a confound.
How do you quantify agreement? Use the agreement rate, or more rigorously Cohen's kappa—which subtracts out the "agree by random chance" part, leaving only genuine agreement. Another route is Chatbot Arena: real users vote between two anonymous models' answers, then an Elo / Bradley-Terry model converts massive pairwise win/loss into a single ranking score (the same math as chess ladder ratings).
# Pairwise judging + position swap to remove position bias def judge_pair(q, ans_a, ans_b): def ask(first, second): p = f"Question: {q}\nAnswer 1: {first}\nAnswer 2: {second}\n" \ "Which is better? Reply only '1' or '2'." return client.messages.create( model="claude-opus-4-8", max_tokens=5, messages=[{"role":"user","content":p}]).content[0].text.strip() r1 = ask(ans_a, ans_b) # A first r2 = ask(ans_b, ans_a) # B first (swapped) # A wins only if both runs pick A, else tie—cancels position bias if r1 == "1" and r2 == "2": return "A" if r1 == "2" and r2 == "1": return "B" return "tie"
1 − C(n−c,k)/C(n,k). This is "buying a more stable estimate with more samples." The statistical intuition: you're not "simulating one k-retry"—you're "estimating the problem's difficulty parameter, then deriving the theoretical pass rate at any k." Analogy to your load testing: rather than manually replaying 5 requests to see a success rate, you precisely measure the single-request success rate p, then analytically derive availability at any retry count—the latter has far higher sample efficiency and a much more stable conclusion.