The most dangerous eval mistake a shipping team makes is not having no evals.
It is having a leaderboard score and treating it as an eval.
Leaderboards measure a model's ability on tasks the benchmark authors chose.
Task suites measure a product's ability on tasks your users actually perform.
These are not the same measurement, and mixing them produces confident numbers that do not answer the question you are actually asking.
The picture
Two stacked diagrams with different axes.
Leaderboard (top): rows are models, columns are generic task categories drawn from published benchmarks — MMLU, GSM8K, HumanEval, SWE-bench Lite. Optimized for public comparability across vendors. The audience is researchers, analysts, and vendors competing for ranking. The lifespan is long — benchmarks are designed to be stable so ranks are comparable over time.
Task suite (bottom): rows are your JTBDs (job-to-be-done — the specific outcomes your users are trying to achieve), columns are your inputs sampled from real production traffic. Optimized for one product's ship/don't-ship decision. The audience is the team deploying the system. The lifespan is as long as the product's task distribution is stable — which is usually months, not years.
Different audiences. Different criteria. Different lifespans. Never averaged into a single number.
Why it matters now
The public-leaderboard era created a useful comparability surface and an unhelpful illusion at the same time.
The useful surface: you can, for the first time, orient quickly to how a new model family performs across a broad class of tasks. When GPT-4o drops, the benchmarks give you a starting hypothesis about whether it is worth evaluating. That is real value.
The illusion: that the model's leaderboard rank is a meaningful proxy for whether it works in your product.
It usually is not — for two reasons.
First, task distribution mismatch. SWE-bench (a benchmark built from real GitHub issues on open-source Python repos) measures something different from "can this agent fix bugs in our enterprise Java monolith." The surface resemblance is misleading. Your users are not filing issues on public repos; they are doing something that looks superficially similar and behaves entirely differently at the distribution level.
Second, contamination risk. Leaderboards with large training data overlaps may be partially contaminated — models can score well by memorizing formats rather than by generalizing capability. A published score may overstate performance on genuinely novel examples. SWE-bench Verified was released explicitly to address this on the original SWE-bench, and the updated scores shifted meaningfully.
When you make a model selection decision based on leaderboard rank and the decision turns out wrong, the cause is usually one of these two.
Not bad judgment.
Wrong input.
A source you should trust
- METR's writing on task suite design. METR (Model Evaluation and Threat Research) publishes operator-grade discipline on what a good internal suite looks like, distinct from a publishable benchmark. Their framing of "what does this number actually measure" is consistently useful.
- The SWE-bench Verified release notes (princeton-nlp, 2024). The clearest example of a public leaderboard being explicitly upgraded toward "reflects real engineering work, not contaminated shortcuts." The gap between original SWE-bench scores and Verified scores on the same agents is instructive about benchmark quality in general.
- BIG-bench (Beyond the Imitation Game benchmark). A useful counter-example: a benchmark deliberately designed to be hard to saturate, with task diversity that covers failure modes standard benchmarks miss. The design choices explain why it is harder to overfit.
- "Evaluating Large Language Models: A Survey" (Chang et al. 2023). Dense, but the taxonomy of evaluation axes — knowledge, reasoning, instruction following, factuality, calibration — is worth having in your vocabulary when you design your own suite.
A recipe
How to build your task suite starting from a published leaderboard, rather than starting from scratch:
- Pick the published leaderboard whose domain most resembles yours. SWE-bench for code agents. MMLU for knowledge retrieval. MT-Bench for open-ended response quality. Note the resemblance and the gaps — both matter.
- List the top three ways your JTBD distribution differs from the leaderboard's. Write these down explicitly. "Our users ask multi-turn follow-ups, the benchmark is single-turn." "Our content is in Hindi and Marathi, the benchmark is English-only." "Our task requires tool invocation, the benchmark grades text output."
- For each difference, write one task type in your suite that explicitly covers it. These are the load-bearing additions — the places where the leaderboard will give you a false read.
- For any leaderboard-like tasks you do include, gather inputs from your own users, not from the leaderboard. The leaderboard's inputs may be memorized. Your users' inputs are fresh.
- Score the suite separately from the public leaderboard. Never average them into a single quality number. They are measuring different things; mixing them destroys the signal in both.
The smell of it going wrong
- A team cites SWE-bench numbers as the reason to ship a coding-agent feature whose JTBD is internal ticket triage — which shares the word "code" with SWE-bench and essentially nothing else.
- The task suite uses public benchmark inputs verbatim. This creates contamination risk and defeats the purpose of the internal suite.
- The team has no articulated position on where their suite and the leaderboard agree or disagree. They have two numbers and no model connecting them.
- One team member tracks the leaderboard, a different team member runs the internal suite, and neither communicates. The numbers contradict each other at the next model upgrade and nobody can explain why.
- The suite was built once and the leaderboard was read as a "proxy update" instead of running the suite on the new model. This is the leaderboard-as-proxy failure that builds up silently until a model upgrade goes wrong.
A judgment call from real work
The PL course-content scoring pipeline faced this directly when evaluating which model to use for competency fit scoring — the task of deciding how well a lesson covers each of eight PM skill competencies.
The available leaderboards pointed clearly toward one model family on "reasoning" tasks. The internal task suite — built from real PL course content, scored by a human reviewer against the eight competency framework — returned a different answer. Three of the eight competencies showed significantly lower fit scores with the "stronger" model. On unpacking this, the issue was that the competency rubric for India-context product work did not map well to the reasoning patterns the benchmark tasks rewarded. The benchmark had no India-specific product cases. PL's content was largely built from them.
The team chose the model the internal suite preferred, not the one the leaderboard ranked higher. It was the right call, confirmed by six months of production fit-score quality.
The leaderboard was not wrong. It was answering a question the team was not actually asking.
This pattern repeats across every model selection decision where the team reaches for a leaderboard score as a shortcut. The shortcut is tempting because running an internal suite takes time and the leaderboard is already there. The cost of the shortcut is proportional to how different your task distribution is from the benchmark's — which is usually quite different, and usually more different than it looks from the surface description.
The practice worth building: before any model upgrade or model switch decision, run your internal suite on both the old and new model, on your own inputs, scored against your own expected outputs. The leaderboard is orientation. The suite is the decision. Keeping that distinction sharp saves the team from chasing benchmark improvements that do not translate into product improvements.
In the next lesson, we go one level deeper on SWE-bench specifically — because it is the benchmark most teams encounter first, it is the one most often misread, and understanding exactly what it measures and what it does not is a 2026 PM literacy item.
Rules from this lesson
- The leaderboard is for comparability across vendors. The suite is for ship/don't-ship decisions on your product. Do not conflate them.
- Your task suite is built from your inputs, graded against your expected outputs. The leaderboard's inputs are not yours.
- Score the suite separately from the leaderboard; never average them into one quality number. Mixing them destroys the signal in both.