Article
Evaluation and Benchmarks
How to measure what LLMs can actually do — and what the leaderboards won't tell you
Seven articles in, you have built the model, served the model, and prompted the model. Now comes the question every production team trips on the same way: is any of this actually working?
The reflex is to grab a benchmark. MMLU. SWE-bench. Chatbot Arena Elo. Pick a number, compare it to the last release, decide. That is how every model launch post reads — a table of three-letter acronyms with bigger numbers next to the newer model.
The problem is that the numbers are lying to you. Not maliciously — structurally. Benchmarks saturate. Test data leaks into training data. The leaderboard you cited last quarter has been deprecated by its own authors. And none of the benchmarks measure the thing your users actually care about, which is whether the model is useful on your inputs, in your product, against your quality bar.
Evaluation is the part of the LLM stack engineers underestimate most. It is harder than training, more thankless than serving, and more important than prompting. This article covers the current frontier benchmark set as of May 2026, the mechanics and biases of LLM-as-judge, the production frameworks people are actually using, and the only evaluation that matters in the end — the one you build for your own workload.
01
Why LLM Evaluation Is Hard
Software testing has a simple contract. You write an input, you assert an output, the test passes or fails. The contract is binary, deterministic, and cheap.
LLMs break every part of that contract. Outputs are sampled from a distribution, not computed. The same prompt produces different completions on different runs. “Correct” is multidimensional — a response can be accurate but verbose, helpful but evasive, factual but stylistically wrong. There is no assertEqual for “this is a good answer.”
The deeper problem is Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure (Strathern, 1997, generalising Goodhart, 1975). Every benchmark that gains traction becomes a training target. Once it is a training target, scores climb — not because models got better at the underlying capability, but because they got better at the benchmark. The signal degrades. By the time the benchmark hits 90%, it has stopped discriminating between models.
Frontier capability shape — snapshot May 2026
Values normalised 0–100 from public benchmark scores. The shapes matter, not the precise numbers — no single model wins on every axis.
The right question is not “which model is best” but “which model is best for this workload, against my constraints, with my failure tolerance.” That answer never lives on a leaderboard.
02
The Frontier Benchmark Set
The benchmarks the frontier labs actually quote in May 2026 are not the ones you learned about two years ago. MMLU, HumanEval, and GSM8K — the staples of the 2022–2024 launch posts — have all saturated. Top models score in the 90s on each. The score range has compressed below evaluation noise, and the benchmarks no longer rank frontier work in any meaningful way (Stanford AI Index 2025).
What replaced them is a frontier set that is itself saturating fast.
Frontier benchmark snapshot — as of May 2026
Scores cited from Artificial Analysis, ARC Prize, Epoch AI, Scale AI, and OpenAI deprecation notices. The list ages fast — the saturation bar on the right is the part you should read carefully.
Benchmark
What it measures
Top score
Top model
Status
FrontierMath
Research-level mathematics
52.4%
GPT-5.5 Pro
SWE-bench Pro
Real-world software engineering
mid-40s
Frontier set
ARC-AGI-3
Novel visual reasoning (post-reset)
<1%
Frontier set
GPQA Diamond
Graduate physics, chemistry, biology
94.2%
Claude Opus 4.7
AIME 2025
Olympiad-style math
96%
Kimi K2.5 / GLM-4.7 / GPT-5.2
ARC-AGI-2
Fluid reasoning on visual puzzles
85%
GPT-5.5
MMLU-Pro
Multitask academic knowledge (10-option)
89.8%
Gemini 3 Pro
Chatbot Arena (LMArena)
Human preference (Bradley-Terry)
~1500 Elo
Claude Opus 4.6
SWE-bench Verified
Curated 500 GitHub issues
—
Deprecated Feb 2026
MMLU
Multitask academic knowledge
>90%
All frontier models
HumanEval
Function-completion code
>95%
All frontier models
GSM8K
Grade-school math
>95%
All frontier models
SWE-bench Verified measures real-world software engineering: 500 GitHub issues from 12 Python repositories, each manually reviewed by 93 contracted developers (OpenAI, August 2024). It dominated coding evaluation through 2025. Then in February 2026, OpenAI deprecated it — frontier models could reproduce verbatim gold patches on certain instances, which is a contamination signature, not a capability signal (OpenAI, February 2026). SWE-bench Pro, with a standardised scaffold and multi-language coverage, is the successor; current frontier sits in the mid-40s, well below saturation.
GPQA Diamond is the hardest 198 questions of the Graduate-level Physics, Chemistry, and Biology benchmark. PhDs in the question’s own field score about 65%; non-experts with full web access manage 34%. Frontier models now sit above 94% — Claude Opus 4.7 at 94.2%, Gemini 3.1 Pro Preview at 94.1%, GPT-5.4 at 92.0% as of May 2026 (Artificial Analysis). The gap between frontier models and domain PhDs has widened from +7 points in late 2024 to +24 points in early 2026. Saturation is imminent.
AIME 2025 uses the 30 problems from the 2025 American Invitational Mathematics Examination. The top models cluster between 95% and 96% — Kimi K2.5, GLM-4.7, GPT-5.2 (xhigh) all within half a point. When the top three are tied to within rounding error, the benchmark is no longer ranking them; it is measuring noise.
FrontierMath is the exemplar of “still has headroom.” Epoch AI commissioned 350 original research-level math problems from mathematicians including IMO gold medalists and Fields medalists. Each problem requires hours to days of work from a domain researcher. The top score as of April 2026 is GPT-5.5 Pro at 52.4%. At that level, the benchmark still discriminates — and likely will for a year or two.
ARC-AGI-2 tests fluid reasoning on novel visual puzzles. Frontier models broke through the 85% grand prize threshold in early 2026 — GPT-5.5 at 85%, GPT-5.4 Pro at 83.3%, Gemini 3.1 Pro at 77.1% (ARC Prize). The Prize team responded by releasing ARC-AGI-3, on which the same frontier models score below 1%. The frontier of generalisation got reset.
Chatbot Arena — rebranded LMArena in January 2026 — is the only major eval driven by human preference rather than ground-truth labels. Users pick between two anonymised responses; the system fits a Bradley-Terry rating from the pairwise comparisons (Chiang et al., ICML 2024). The catch: a 2025 Cohere/Princeton analysis found that Meta, OpenAI, Google, and Amazon had been submitting many private model variants and publishing only the highest-scoring — a selection effect worth up to 100 Elo points of inflation. Crowdsourced preference is real signal, but it has the same Goodhart problem as every other ranking once labs optimise for it.
The pattern is unmistakable. Every benchmark that gains industry traction enters a saturation curve. Two-year-old benchmarks are historical context. Six-month-old benchmarks are the ones being optimised for right now. Anything in production should treat benchmark scores as a coarse capability filter — not a substitute for evaluation on your actual workload.
03
Contamination, Saturation, and Goodhart in Practice
Saturation has two causes. The benign one is genuine capability progress — models really do get better, and a fixed test eventually maxes out. The malignant one is contamination: test data leaks into training data, and the model is no longer being evaluated, it is being asked to recite.
Both are happening simultaneously, and they are hard to separate.
The contamination evidence is concrete. A 2023 analysis of GSM8K found that removing examples whose exact wording appeared in common training corpora dropped some models’ accuracy by up to 13 percentage points (GSM8K-Platinum, 2025). OpenAI’s stated reason for deprecating SWE-bench Verified was that frontier models could reproduce verbatim gold patches — exact memorisation, not generalisation. A survey from EMNLP 2025 catalogues contamination as the central methodological challenge in the field, with neither static decontamination nor dynamic benchmark regeneration providing a clean fix.
The defences that actually work are structural, not procedural:
- Timestamp-gated benchmarks. LiveCodeBench tags every problem with its publication date. You evaluate a model only on problems published after the model’s training cutoff. New problems flow in monthly; old ones become reference but not signal.
- Held-out evaluation. Build private eval sets that never leave your infrastructure. Public benchmarks tell you the floor; private benchmarks tell you the truth.
- Adversarial perturbation. Rewrite known benchmark questions just enough that memorisation fails but the underlying capability still applies. The performance gap between original and perturbed versions is a contamination proxy.
The deeper lesson is Goodhart. The instant a benchmark becomes the metric a model launch is judged on, it becomes a training target — and a training target is no longer a measurement. You cannot escape this dynamic by inventing a harder benchmark. You can only outrun it — which is what the frontier set has been doing every six months for two years.
04
LLM-as-Judge
If human evaluation does not scale and rigid metrics like BLEU miss the point of generative text, the obvious move is to use one LLM to grade another. The technique is now standard across production eval stacks. The foundational result is from Zheng et al., NeurIPS 2023: GPT-4 acting as judge agrees with human preferences over 80% of the time — comparable to the agreement rate between two human annotators on the same task.
That number is what made LLM-as-judge production-viable. It scales, it is consistent, it is two orders of magnitude cheaper than crowdsourced annotation, and — crucially — a strong judge model can grade tasks that are too specialised or too long-form for fast human review.
It also has known failure modes that you must design around.
LLM-as-judge — protocol and known biases
Candidate response
“Atlas uses Snappy compression by default and also supports zstd, zlib, and LZ4.”
Rubric (judge prompt)
- Extract every factual claim.
- Verdict each: supported / contradicted / absent.
- Score = supported / total.
Judge output
- supported · Snappy default
- supported · zstd, zlib
- absent · LZ4
score = 0.67
Three biases to design around
Present each pair in both orderings; average the result.
Add length constraints to the rubric or normalise by length.
Use a judge from a different model family than the candidate.
The three biases
Position bias. When given two responses to compare, judges prefer the response that appears first (or sometimes second — it varies by judge family). The effect is strongly modulated by the quality gap: when the two responses are close, position bias dominates (Shi et al., IJCNLP 2025). Mitigation: present each pair in both orderings and average the results.
Verbosity bias. Judges prefer longer responses, even when the additional length adds no informational content. This is an artifact of generative pre-training and RLHF — the same training process that makes models hedge and pad in their own outputs also biases them to reward those patterns in others (Justice or Prejudice?, 2024).
Self-preference. Judges prefer outputs that look like their own. Wataoka et al. (2024) traced the mechanism: judges assign higher scores to outputs with lower perplexity, regardless of authorship. A model’s own outputs naturally have low perplexity under its own distribution — so it preferentially rewards them. Mitigation: use a judge from a different model family than the candidate.
A worked rubric: RAG faithfulness
Let’s make this concrete with the eval most production teams encounter first — whether a RAG system is staying grounded in its retrieved context.
The metric. Faithfulness measures how factually consistent a generated answer is with the documents the retriever returned. It catches a specific failure — the model adds claims that are not supported by the context. The Ragas formulation: faithfulness = (claims supported by the context) / (total claims in the response). Range 0 to 1. Higher is better (Ragas docs).
Judge prompt — RAG faithfulness rubric
You are evaluating a RAG system's faithfulness to its retrieved context.
GIVEN:
- A user question
- The context chunks the retriever returned
- The generated answer
TASK:
1. Extract every factual claim made in the answer. A claim is any
statement that asserts something is true.
2. For each claim, determine whether it is supported by, contradicted
by, or absent from the provided context.
3. A claim is supported only if the context directly entails it.
Reasonable inference is not support. World knowledge is not support.
4. Compute: faithfulness = supported / total_claims.
OUTPUT FORMAT (strict JSON):
{
"claims": [
{ "text": "<claim>", "verdict": "supported|contradicted|absent",
"evidence": "<exact quote from context, or null>" }
],
"score": <float between 0 and 1>
}Why this rubric works. It atomises the judgment. Instead of asking the judge “is this answer faithful, on a scale of 1–5” — which invites verbosity bias and rubric drift — it asks “list the claims, check each one, return a ratio.” The output is structured. The score is computed, not chosen. The evidence quote is mandatory, which gives you an audit trail when you need to debug a low score.
A concrete example. Question: which compression algorithm does MongoDB Atlas use for collection data? Retrieved context says: “Atlas uses Snappy by default; you can switch to zstd or zlib at collection creation time.” The generated answer claims Atlas supports Snappy, zstd, zlib, and LZ4. The judge extracts three claims: (1) Snappy default — supported; (2) zstd and zlib — supported; (3) LZ4 — absent from context (and incidentally false). Faithfulness = 2/3 = 0.67. The audit trail tells you exactly which claim was unsupported, so you know whether to fix the retriever (missing context) or the prompt (model hallucinated).
That structure — atomic claims, per-claim verdicts, computed score — is what separates an evaluation from a vibe check. Apply the same pattern to other rubrics: code correctness becomes “list the test cases, check each one.” Answer relevance becomes “list the question’s sub-asks, check coverage of each.” The more atomic the judgment, the less room for bias.
When not to use LLM-as-judge
It is not a universal hammer. The judge needs the capability to grade the task. A judge weaker than the candidate is unreliable, especially on tasks the judge cannot itself solve (advanced math, complex multi-step reasoning, domain-specific code review). For these, you need a stronger judge model, ground- truth labels, or human review on a sampled tier.
LLM-as-judge also costs real money at scale. A faithfulness eval over 10K traces with a GPT-5-class judge is non-trivial in dollar terms. The usual production setup: cheap automatic checks (schema validity, exact match, regex) catch the obvious failures, LLM-as-judge handles the nuanced metrics, and a sampled human review tier calibrates the judge itself.
05
Building Your Own Evaluation
Generic benchmarks tell you the model has the capability in general. They do not tell you the model performs your task on your inputs. A model that scores 94% on GPQA Diamond may still fail to write a correct MongoDB aggregation against your specific schema. The only evaluation that answers that is the one you build.
The playbook is simpler than most teams think, but it requires discipline.
The eval flywheel — six steps that close on themselves
01
Collect production traces
Sample 50–200 real user inputs from the application.
Needs
Logged traces · sampling policy
02
Annotate ideal outputs
A domain expert writes the correct answer for each input.
Needs
Annotation guidelines · gold labels
03
Layer the graders
Automatic checks → LLM-as-judge → sampled human review.
Needs
Schema validators · judge prompt · review rota
04
Version everything
Prompt, model, retrieval config, eval set, judge model.
Needs
Git refs · config registry
05
Wire into CI
Eval runs on every prompt change, retrieval change, model swap.
Needs
CI hooks · pass/fail thresholds
06
Iterate (flywheel)
Every production failure becomes a new eval case.
Needs
Feedback loop · monotonic growth
Every failure flows back into step 01 — the eval set grows monotonically.
Start from production traces, not synthetic prompts. Sample 50 to 200 real user inputs from your application. Synthetic prompts written by your team will not match the distribution your users actually generate — they will be cleaner, better-formed, and miss the long tail of real-world phrasing.
Annotate ideal outputs. Have a domain expert write what the correct answer looks like for each input. This is the only step that cannot be automated. The eval set is only as good as these annotations — shortcuts here poison every downstream metric.
Layer the graders. Three tiers, applied in order: automatic deterministic checks (exact match, JSON schema validity, regex, latency, token count, cost) catch the largest class of failures cheaply; LLM-as-judge handles faithfulness, answer relevance, tone, rubric scoring; a rotating 5–10% human sample calibrates the LLM judge and surfaces failures both other graders miss.
Version everything. Prompt version, model version, retrieval config, eval set version, judge model and judge prompt. When a metric regresses, you need to know what changed.
Wire it into CI. Eval runs on every prompt change, every retrieval config change, every model swap. If it does not run automatically, it does not run. The cost of running a 200-example eval suite per change is dollars; the cost of shipping a regression to production is days of debugging plus user trust.
Iterate via the flywheel. Every production failure — bug report, support ticket, thumbs-down — becomes a new eval case. The eval set grows monotonically. Over time, it becomes a high-fidelity map of every failure mode your system has ever hit, and the regression suite catches them all before they ship again.
The eval set is your single most valuable artifact. It captures what “good” means for your product in a form that survives model upgrades, prompt rewrites, and team changes. Treat it the same way you treat your production code — version it, review it, protect it.
A brief word on safety
Safety evaluation is the same discipline applied to adversarial inputs and undesired outputs. The pattern does not change — you collect inputs (jailbreak prompts, prompt injection attempts, harmful requests), define failure modes, and run automated checks in CI. What is different is the input distribution. Adversarial inputs are not in your production traces; you have to either curate them yourself, use a published red-team suite, or generate them programmatically. The UK AI Security Institute’s Inspect framework is the most credible open-source toolkit for this — it includes a sandboxing layer for agent tool calls, which matters once your system can execute code or hit APIs. The recurring finding across published red-team studies is that role-play jailbreaks (fictional-character framing, hypothetical scenarios) succeed against frontier models more often than direct attacks. Treat safety the way you treat any other failure mode: a defined set of inputs, a defined set of bad outputs, automated checks on every change. The moment safety eval lives in a Word document instead of CI, it is decoration.
06
The Production Eval Framework Landscape
You do not have to build the eval infrastructure from scratch. The ecosystem matured significantly through 2024 and 2025. The frameworks split along three rough axes: open vs commercial, academic-benchmark-focused vs application-focused, and standalone vs platform.
Production eval framework landscape — snapshot May 2026
The ecosystem ages faster than this article does. Treat the row order as a starting catalogue, not a ranking.
Framework
Owner / Licence
Best for
Distinct strength
EleutherAI
MIT
Academic
Reproducible academic benchmarks
Powers the HF Open LLM Leaderboard
Braintrust
Proprietary
Platform
End-to-end eval platform
Prompts + traces + regression diagnosis
The recurring production pattern: a lightweight framework for CI/CD gating (Promptfoo, or Ragas for RAG) paired with an observability platform for traces and human annotation (Braintrust or LangSmith), plus Inspect for safety and agent work. There is no single tool that does all three jobs well, and trying to force it usually means missing one of them entirely.
Evaluation tells you whether your single-model system is working. But the frontier of LLM deployment has moved past single-turn question answering — the models are taking actions. They call tools, query databases, write and execute code, navigate the web, and chain dozens of steps to achieve a goal. Each of those steps multiplies the surface area of what can go wrong, and the same eval discipline you just learned has to extend across action sequences, tool calls, and multi-step trajectories.
That world is the subject of Article 9: Agents, Tool Use, and the Agentic Future.