Article

Evaluation and Benchmarks

How to measure what LLMs can actually do — and what the leaderboards won't tell you

Seven articles in, you have built the model, served the model, and prompted the model. Now comes the question every production team trips on the same way: is any of this actually working?

The reflex is to grab a benchmark. MMLU. SWE-bench. Chatbot Arena Elo. Pick a number, compare it to the last release, decide. That is how every model launch post reads — a table of three-letter acronyms with bigger numbers next to the newer model.

The problem is that the numbers are lying to you. Not maliciously — structurally. Benchmarks saturate. Test data leaks into training data. The leaderboard you cited last quarter has been deprecated by its own authors. And none of the benchmarks measure the thing your users actually care about, which is whether the model is useful on your inputs, in your product, against your quality bar.

Evaluation is the part of the LLM stack engineers underestimate most. It is harder than training, more thankless than serving, and more important than prompting. This article covers the current frontier benchmark set as of May 2026, the mechanics and biases of LLM-as-judge, the production frameworks people are actually using, and the only evaluation that matters in the end — the one you build for your own workload.

Why LLM Evaluation Is Hard

Software testing has a simple contract. You write an input, you assert an output, the test passes or fails. The contract is binary, deterministic, and cheap.

LLMs break every part of that contract. Outputs are sampled from a distribution, not computed. The same prompt produces different completions on different runs. “Correct” is multidimensional — a response can be accurate but verbose, helpful but evasive, factual but stylistically wrong. There is no assertEqual for “this is a good answer.”

The deeper problem is Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure (Strathern, 1997, generalising Goodhart, 1975). Every benchmark that gains traction becomes a training target. Once it is a training target, scores climb — not because models got better at the underlying capability, but because they got better at the benchmark. The signal degrades. By the time the benchmark hits 90%, it has stopped discriminating between models.

Frontier capability shape — snapshot May 2026

Values normalised 0–100 from public benchmark scores. The shapes matter, not the precise numbers — no single model wins on every axis.

Claude Opus 4.7

GPT-5.4

Gemini 3.1 Pro

GPQA Diamond, AIME 2025, SWE-bench Pro — Artificial AnalysisARC-AGI-2 — ARC Prize, May 2026

The right question is not “which model is best” but “which model is best for this workload, against my constraints, with my failure tolerance.” That answer never lives on a leaderboard.

The Frontier Benchmark Set

The benchmarks the frontier labs actually quote in May 2026 are not the ones you learned about two years ago. MMLU, HumanEval, and GSM8K — the staples of the 2022–2024 launch posts — have all saturated. Top models score in the 90s on each. The score range has compressed below evaluation noise, and the benchmarks no longer rank frontier work in any meaningful way (Stanford AI Index 2025).

What replaced them is a frontier set that is itself saturating fast.

Frontier benchmark snapshot — as of May 2026

Scores cited from Artificial Analysis, ARC Prize, Epoch AI, Scale AI, and OpenAI deprecation notices. The list ages fast — the saturation bar on the right is the part you should read carefully.

Benchmark

What it measures

Top score

Top model

Status

Benchmark

FrontierMath

What it measures

Research-level mathematics

Top score

52.4%

Top model

GPT-5.5 Pro

Status

Active

Benchmark

SWE-bench Pro

What it measures

Real-world software engineering

Top score

mid-40s

Top model

Frontier set

Status

Active

Benchmark

ARC-AGI-3

What it measures

Novel visual reasoning (post-reset)

Top score

<1%

Top model

Frontier set

Status

Active

Benchmark

GPQA Diamond

What it measures

Graduate physics, chemistry, biology

Top score

94.2%

Top model

Claude Opus 4.7

Status

Approaching ceiling

Benchmark

AIME 2025

What it measures

Olympiad-style math

Top score

96%

Top model

Kimi K2.5 / GLM-4.7 / GPT-5.2

Status

Approaching ceiling

Benchmark

ARC-AGI-2

What it measures

Fluid reasoning on visual puzzles

Top score

85%

Top model

GPT-5.5

Status

Approaching ceiling

Benchmark

MMLU-Pro

What it measures

Multitask academic knowledge (10-option)

Top score

89.8%

Top model

Gemini 3 Pro

Status

Approaching ceiling

Benchmark

Chatbot Arena (LMArena)

What it measures

Human preference (Bradley-Terry)

Top score

~1500 Elo

Top model

Claude Opus 4.6

Status

Approaching ceiling

Benchmark

SWE-bench Verified

What it measures

Curated 500 GitHub issues

Top score

—

Top model

Deprecated Feb 2026

Status

Deprecated

Benchmark

MMLU

What it measures

Multitask academic knowledge

Top score

>90%

Top model

All frontier models

Status

Saturated

Benchmark

HumanEval

What it measures

Function-completion code

Top score

>95%

Top model

All frontier models

Status

Saturated

Benchmark

GSM8K

What it measures

Grade-school math

Top score

>95%

Top model

All frontier models

Status

Saturated

Active

Approaching ceiling

Saturated

Deprecated

SWE-bench Verified measures real-world software engineering: 500 GitHub issues from 12 Python repositories, each manually reviewed by 93 contracted developers (OpenAI, August 2024). It dominated coding evaluation through 2025. Then in February 2026, OpenAI deprecated it — frontier models could reproduce verbatim gold patches on certain instances, which is a contamination signature, not a capability signal (OpenAI, February 2026). SWE-bench Pro, with a standardised scaffold and multi-language coverage, is the successor; current frontier sits in the mid-40s, well below saturation.

GPQA Diamond is the hardest 198 questions of the Graduate-level Physics, Chemistry, and Biology benchmark. PhDs in the question’s own field score about 65%; non-experts with full web access manage 34%. Frontier models now sit above 94% — Claude Opus 4.7 at 94.2%, Gemini 3.1 Pro Preview at 94.1%, GPT-5.4 at 92.0% as of May 2026 (Artificial Analysis). The gap between frontier models and domain PhDs has widened from +7 points in late 2024 to +24 points in early 2026. Saturation is imminent.

AIME 2025 uses the 30 problems from the 2025 American Invitational Mathematics Examination. The top models cluster between 95% and 96% — Kimi K2.5, GLM-4.7, GPT-5.2 (xhigh) all within half a point. When the top three are tied to within rounding error, the benchmark is no longer ranking them; it is measuring noise.

FrontierMath is the exemplar of “still has headroom.” Epoch AI commissioned 350 original research-level math problems from mathematicians including IMO gold medalists and Fields medalists. Each problem requires hours to days of work from a domain researcher. The top score as of April 2026 is GPT-5.5 Pro at 52.4%. At that level, the benchmark still discriminates — and likely will for a year or two.

ARC-AGI-2 tests fluid reasoning on novel visual puzzles. Frontier models broke through the 85% grand prize threshold in early 2026 — GPT-5.5 at 85%, GPT-5.4 Pro at 83.3%, Gemini 3.1 Pro at 77.1% (ARC Prize). The Prize team responded by releasing ARC-AGI-3, on which the same frontier models score below 1%. The frontier of generalisation got reset.

Chatbot Arena — rebranded LMArena in January 2026 — is the only major eval driven by human preference rather than ground-truth labels. Users pick between two anonymised responses; the system fits a Bradley-Terry rating from the pairwise comparisons (Chiang et al., ICML 2024). The catch: a 2025 Cohere/Princeton analysis found that Meta, OpenAI, Google, and Amazon had been submitting many private model variants and publishing only the highest-scoring — a selection effect worth up to 100 Elo points of inflation. Crowdsourced preference is real signal, but it has the same Goodhart problem as every other ranking once labs optimise for it.

The pattern is unmistakable. Every benchmark that gains industry traction enters a saturation curve. Two-year-old benchmarks are historical context. Six-month-old benchmarks are the ones being optimised for right now. Anything in production should treat benchmark scores as a coarse capability filter — not a substitute for evaluation on your actual workload.

Contamination, Saturation, and Goodhart in Practice

Saturation has two causes. The benign one is genuine capability progress — models really do get better, and a fixed test eventually maxes out. The malignant one is contamination: test data leaks into training data, and the model is no longer being evaluated, it is being asked to recite.

Both are happening simultaneously, and they are hard to separate.

The contamination evidence is concrete. A 2023 analysis of GSM8K found that removing examples whose exact wording appeared in common training corpora dropped some models’ accuracy by up to 13 percentage points (GSM8K-Platinum, 2025). OpenAI’s stated reason for deprecating SWE-bench Verified was that frontier models could reproduce verbatim gold patches — exact memorisation, not generalisation. A survey from EMNLP 2025 catalogues contamination as the central methodological challenge in the field, with neither static decontamination nor dynamic benchmark regeneration providing a clean fix.

The defences that actually work are structural, not procedural:

Timestamp-gated benchmarks. LiveCodeBench tags every problem with its publication date. You evaluate a model only on problems published after the model’s training cutoff. New problems flow in monthly; old ones become reference but not signal.
Held-out evaluation. Build private eval sets that never leave your infrastructure. Public benchmarks tell you the floor; private benchmarks tell you the truth.
Adversarial perturbation. Rewrite known benchmark questions just enough that memorisation fails but the underlying capability still applies. The performance gap between original and perturbed versions is a contamination proxy.

The deeper lesson is Goodhart. The instant a benchmark becomes the metric a model launch is judged on, it becomes a training target — and a training target is no longer a measurement. You cannot escape this dynamic by inventing a harder benchmark. You can only outrun it — which is what the frontier set has been doing every six months for two years.

LLM-as-Judge

If human evaluation does not scale and rigid metrics like BLEU miss the point of generative text, the obvious move is to use one LLM to grade another. The technique is now standard across production eval stacks. The foundational result is from Zheng et al., NeurIPS 2023: GPT-4 acting as judge agrees with human preferences over 80% of the time — comparable to the agreement rate between two human annotators on the same task.

That number is what made LLM-as-judge production-viable. It scales, it is consistent, it is two orders of magnitude cheaper than crowdsourced annotation, and — crucially — a strong judge model can grade tasks that are too specialised or too long-form for fast human review.

It also has known failure modes that you must design around.

LLM-as-judge — protocol and known biases

Candidate response

“Atlas uses Snappy compression by default and also supports zstd, zlib, and LZ4.”

Rubric (judge prompt)

Extract every factual claim.
Verdict each: supported / contradicted / absent.
Score = supported / total.

Judge output

supported · Snappy default
supported · zstd, zlib
absent · LZ4

score = 0.67

Three biases to design around

Position bias

Present each pair in both orderings; average the result.

Verbosity bias

Add length constraints to the rubric or normalise by length.

Self-preference

Use a judge from a different model family than the candidate.

The three biases

Position bias. When given two responses to compare, judges prefer the response that appears first (or sometimes second — it varies by judge family). The effect is strongly modulated by the quality gap: when the two responses are close, position bias dominates (Shi et al., IJCNLP 2025). Mitigation: present each pair in both orderings and average the results.

Verbosity bias. Judges prefer longer responses, even when the additional length adds no informational content. This is an artifact of generative pre-training and RLHF — the same training process that makes models hedge and pad in their own outputs also biases them to reward those patterns in others (Justice or Prejudice?, 2024).

Self-preference. Judges prefer outputs that look like their own. Wataoka et al. (2024) traced the mechanism: judges assign higher scores to outputs with lower perplexity, regardless of authorship. A model’s own outputs naturally have low perplexity under its own distribution — so it preferentially rewards them. Mitigation: use a judge from a different model family than the candidate.

A worked rubric: RAG faithfulness

Let’s make this concrete with the eval most production teams encounter first — whether a RAG system is staying grounded in its retrieved context.

The metric. Faithfulness measures how factually consistent a generated answer is with the documents the retriever returned. It catches a specific failure — the model adds claims that are not supported by the context. The Ragas formulation: faithfulness = (claims supported by the context) / (total claims in the response). Range 0 to 1. Higher is better (Ragas docs).

Judge prompt — RAG faithfulness rubric

You are evaluating a RAG system's faithfulness to its retrieved context.

GIVEN:
- A user question
- The context chunks the retriever returned
- The generated answer

TASK:
1. Extract every factual claim made in the answer. A claim is any
   statement that asserts something is true.
2. For each claim, determine whether it is supported by, contradicted
   by, or absent from the provided context.
3. A claim is supported only if the context directly entails it.
   Reasonable inference is not support. World knowledge is not support.
4. Compute: faithfulness = supported / total_claims.

OUTPUT FORMAT (strict JSON):
{
  "claims": [
    { "text": "<claim>", "verdict": "supported|contradicted|absent",
      "evidence": "<exact quote from context, or null>" }
  ],
  "score": <float between 0 and 1>
}

Why this rubric works. It atomises the judgment. Instead of asking the judge “is this answer faithful, on a scale of 1–5” — which invites verbosity bias and rubric drift — it asks “list the claims, check each one, return a ratio.” The output is structured. The score is computed, not chosen. The evidence quote is mandatory, which gives you an audit trail when you need to debug a low score.

A concrete example. Question: which compression algorithm does MongoDB Atlas use for collection data? Retrieved context says: “Atlas uses Snappy by default; you can switch to zstd or zlib at collection creation time.” The generated answer claims Atlas supports Snappy, zstd, zlib, and LZ4. The judge extracts three claims: (1) Snappy default — supported; (2) zstd and zlib — supported; (3) LZ4 — absent from context (and incidentally false). Faithfulness = 2/3 = 0.67. The audit trail tells you exactly which claim was unsupported, so you know whether to fix the retriever (missing context) or the prompt (model hallucinated).

That structure — atomic claims, per-claim verdicts, computed score — is what separates an evaluation from a vibe check. Apply the same pattern to other rubrics: code correctness becomes “list the test cases, check each one.” Answer relevance becomes “list the question’s sub-asks, check coverage of each.” The more atomic the judgment, the less room for bias.

When not to use LLM-as-judge

It is not a universal hammer. The judge needs the capability to grade the task. A judge weaker than the candidate is unreliable, especially on tasks the judge cannot itself solve (advanced math, complex multi-step reasoning, domain-specific code review). For these, you need a stronger judge model, ground- truth labels, or human review on a sampled tier.

LLM-as-judge also costs real money at scale. A faithfulness eval over 10K traces with a GPT-5-class judge is non-trivial in dollar terms. The usual production setup: cheap automatic checks (schema validity, exact match, regex) catch the obvious failures, LLM-as-judge handles the nuanced metrics, and a sampled human review tier calibrates the judge itself.

Building Your Own Evaluation

Generic benchmarks tell you the model has the capability in general. They do not tell you the model performs your task on your inputs. A model that scores 94% on GPQA Diamond may still fail to write a correct MongoDB aggregation against your specific schema. The only evaluation that answers that is the one you build.

The playbook is simpler than most teams think, but it requires discipline.

The eval flywheel — six steps that close on themselves

Collect production traces

Sample 50–200 real user inputs from the application.

Needs

Logged traces · sampling policy

Annotate ideal outputs

A domain expert writes the correct answer for each input.

Needs

Annotation guidelines · gold labels

Layer the graders

Automatic checks → LLM-as-judge → sampled human review.

Needs

Schema validators · judge prompt · review rota

Version everything

Prompt, model, retrieval config, eval set, judge model.

Needs

Git refs · config registry

Wire into CI

Eval runs on every prompt change, retrieval change, model swap.

Needs

CI hooks · pass/fail thresholds

Iterate (flywheel)

Every production failure becomes a new eval case.

Needs

Feedback loop · monotonic growth

↑

Every failure flows back into step 01 — the eval set grows monotonically.

Start from production traces, not synthetic prompts. Sample 50 to 200 real user inputs from your application. Synthetic prompts written by your team will not match the distribution your users actually generate — they will be cleaner, better-formed, and miss the long tail of real-world phrasing.

Annotate ideal outputs. Have a domain expert write what the correct answer looks like for each input. This is the only step that cannot be automated. The eval set is only as good as these annotations — shortcuts here poison every downstream metric.

Layer the graders. Three tiers, applied in order: automatic deterministic checks (exact match, JSON schema validity, regex, latency, token count, cost) catch the largest class of failures cheaply; LLM-as-judge handles faithfulness, answer relevance, tone, rubric scoring; a rotating 5–10% human sample calibrates the LLM judge and surfaces failures both other graders miss.

Version everything. Prompt version, model version, retrieval config, eval set version, judge model and judge prompt. When a metric regresses, you need to know what changed.

Wire it into CI. Eval runs on every prompt change, every retrieval config change, every model swap. If it does not run automatically, it does not run. The cost of running a 200-example eval suite per change is dollars; the cost of shipping a regression to production is days of debugging plus user trust.

Iterate via the flywheel. Every production failure — bug report, support ticket, thumbs-down — becomes a new eval case. The eval set grows monotonically. Over time, it becomes a high-fidelity map of every failure mode your system has ever hit, and the regression suite catches them all before they ship again.

The eval set is your single most valuable artifact. It captures what “good” means for your product in a form that survives model upgrades, prompt rewrites, and team changes. Treat it the same way you treat your production code — version it, review it, protect it.

A brief word on safety

Safety evaluation is the same discipline applied to adversarial inputs and undesired outputs. The pattern does not change — you collect inputs (jailbreak prompts, prompt injection attempts, harmful requests), define failure modes, and run automated checks in CI. What is different is the input distribution. Adversarial inputs are not in your production traces; you have to either curate them yourself, use a published red-team suite, or generate them programmatically. The UK AI Security Institute’s Inspect framework is the most credible open-source toolkit for this — it includes a sandboxing layer for agent tool calls, which matters once your system can execute code or hit APIs. The recurring finding across published red-team studies is that role-play jailbreaks (fictional-character framing, hypothetical scenarios) succeed against frontier models more often than direct attacks. Treat safety the way you treat any other failure mode: a defined set of inputs, a defined set of bad outputs, automated checks on every change. The moment safety eval lives in a Word document instead of CI, it is decoration.

The Production Eval Framework Landscape

You do not have to build the eval infrastructure from scratch. The ecosystem matured significantly through 2024 and 2025. The frameworks split along three rough axes: open vs commercial, academic-benchmark-focused vs application-focused, and standalone vs platform.

Production eval framework landscape — snapshot May 2026

The ecosystem ages faster than this article does. Treat the row order as a starting catalogue, not a ranking.

Framework

Owner / Licence

Best for

Distinct strength

Framework

lm-evaluation-harness

Owner / Licence

EleutherAI

MIT

Academic

Reproducible academic benchmarks

Distinct strength

Powers the HF Open LLM Leaderboard

Framework

OpenAI Evals

Owner / Licence

OpenAI

MIT

Academic

Custom benchmark research

Distinct strength

Rigour over ergonomics

Framework

Inspect

Owner / Licence

UK AISI + Meridian

MIT

Safety

Frontier safety + agentic evals

Distinct strength

Built-in agent sandbox

Framework

Promptfoo

Owner / Licence

Promptfoo / OpenAI

MIT

Production

CI/CD-gated prompt evals

Distinct strength

Declarative YAML, model-agnostic

Framework

Ragas

Owner / Licence

Exploding Gradients

Apache 2.0

RAG

RAG faithfulness + relevancy

Distinct strength

Pre-built RAG metrics

Framework

Braintrust

Owner / Licence

Braintrust

Proprietary

Platform

End-to-end eval platform

Distinct strength

Prompts + traces + regression diagnosis

Framework

LangSmith

Owner / Licence

LangChain

Proprietary

Platform

LangChain-stack observability

Distinct strength

Tightest fit for LangChain apps

Framework

HELM

Owner / Licence

Stanford CRFM

Apache 2.0

Academic

Holistic multi-metric comparison

Distinct strength

42 scenarios × 7 metrics

Academic

Production

Safety

RAG

Platform

The recurring production pattern: a lightweight framework for CI/CD gating (Promptfoo, or Ragas for RAG) paired with an observability platform for traces and human annotation (Braintrust or LangSmith), plus Inspect for safety and agent work. There is no single tool that does all three jobs well, and trying to force it usually means missing one of them entirely.

Evaluation tells you whether your single-model system is working. But the frontier of LLM deployment has moved past single-turn question answering — the models are taking actions. They call tools, query databases, write and execute code, navigate the web, and chain dozens of steps to achieve a goal. Each of those steps multiplies the surface area of what can go wrong, and the same eval discipline you just learned has to extend across action sequences, tool calls, and multi-step trajectories.

So start with the system you have. Pick the three failure modes that would hurt most in production, write evals that catch them, and wire those evals into CI before you ship the next prompt change. The teams that ship reliable models are not the ones with the best benchmark scores — they are the ones who measure the things their benchmarks miss.