No AI feature ships without a golden set. If you cannot measure it, you are guessing in production — and your users are doing your QA for you, badly, on the days that matter most.
After this page, you’ll be able to:
- Build a golden set in three deliberate waves — 10 cases, 100 cases, 1000 cases — and know what each wave is for
- Tell pass/fail evals from scored evals from LLM-as-judge evals, and pick the right one per task
- Run regression evals on every prompt change so you stop shipping silent quality regressions to users
- Decide between Braintrust, LangSmith, and a homemade harness without buying the wrong tool first
There is a meeting I have walked into too many times. The team has spent a quarter building an AI feature. The demo is crisp. The founder is happy. The launch is on Friday. I ask one question: "How do you know it works?" The answer is a shrug, a screenshot of three good examples, and a sentence that begins "in our experience…". That feature will ship. Six weeks later, someone tweaks the prompt to fix one customer complaint and silently breaks twenty other cases nobody is checking. The team will not notice for a month. The users will.
This is the chapter on how to stop being that team. The discipline has a name — evals — and it is the single largest delta between AI teams that ship product and AI teams that ship demos. If chapter 1 was the gate before AI hits the roadmap, and chapters 2 and 3 were how to pick the model and write the prompt, this chapter is the gate before any of that work reaches a real user.
The header rule: no AI feature ships without a golden set, and no prompt change merges without running it.
What an eval set actually is
An eval set — also called a golden set or regression set — is a curated list of inputs paired with expected outputs (or expected properties of outputs) that you run your AI feature against every time something changes. The prompt. The model version. The retrieval index. The temperature. Each of those is a deploy, and each deploy must run the golden set before it touches users.
The mental model is the one you already use for traditional software: evals are tests for non-deterministic code. The reason teams skip them is that the standard test paradigm — assert that f(x) == y — does not survive contact with an LLM. Two reasonable outputs will not be string-equal. The right conclusion is not that tests go away; it is that the assertion has to get smarter.
A golden set is also not a benchmark. Benchmarks (MMLU, HumanEval, MT-Bench) measure generic model capability. Your golden set measures your feature on your users' inputs against your quality bar. The lab benchmarks help pick a model rung (chapter 2). Your golden set tells you whether your feature is ready to ship today.
How to build one: 10 → 100 → 1000
Build the eval set in three waves. Each wave answers a different question. Most teams either skip the first wave (and waste weeks scaling something they should have killed) or stop at the first wave (and ship a feature that crumbles on the second day's traffic).
Wave one: the hand-curated ten. Sit down — you, one engineer, one domain expert — and write down ten inputs your AI feature must handle. Then write what the right output looks like for each. Not the literal string. The properties: must mention the policy number, must not invent a product SKU, must answer in fewer than 150 words, must refuse politely if out of scope. Ten cases, half an afternoon, no tooling. The goal is to discover whether the feature is feasible at all — running ten well-chosen inputs by hand usually reveals four or five obvious failure modes you would otherwise have shipped.
Wave two: the structured hundred. Once wave one says the feature has a pulse, expand to roughly a hundred cases stratified by category. The structure matters more than the count. For a support-deflection bot: thirty common-question cases, thirty edge cases (sarcasm, multi-question messages, code-switching across languages, typos), twenty refusal cases, twenty adversarial cases (prompt injection, jailbreaks). For a classifier, every class represented plus the hard cases at the boundaries. The hundred is where you stop trusting your gut and start running cases through an actual harness — a script, a spreadsheet, or a tool (Braintrust, LangSmith, Promptfoo). This is what you regress against on every prompt change for the feature's first six months.
Wave three: the production-mined thousand. After launch, real users generate the third wave for you. Sample real inputs — anonymized, fair across user segments, with bad-case oversampling so you do not drown signal in easy traffic. Label the right output (or output properties) for each. This is the wave that catches the failure modes nobody on the team thought of — weddings, regional festivals, half-finished sentences, PDFs photographed off a laptop screen at an angle.
The easiest way to staff wave three: have the support team — whoever is absorbing the AI's mistakes today — flag bad outputs in a one-click workflow. Their flags become next month's eval cases.
Pass/fail evals vs scored evals
There are two shapes for an eval case, and most features need a mix.
Pass/fail evals are binary. Did the output mention the policy number? Did it refuse the off-scope question? Did the JSON parse? Did the tool call have the right function name and required arguments? Pass/fail evals are cheap to run, cheap to debug, easy to put into CI. For classifiers, refusal logic, structured-output features, and tool-call features, the majority of your evals will be pass/fail.
Scored evals are numeric. On a 1–5 scale, how good is this summary? How relevant is this retrieved document? Scored evals are necessary when the quality you care about is genuinely continuous — when "good" and "better" both exist and you want the gradient. They are also more expensive to grade and noisier to interpret. The mistake teams make is using a scored eval where a pass/fail would have worked: scoring "helpfulness" 1–5 when the real question is "did the answer mention the refund policy or not." If you can decompose a scored eval into three or four pass/fail evals, do it.
Default: start with pass/fail. Add scored evals only when the pass/fail decomposition genuinely loses information.
LLM-as-judge: when it works, when it lies
You cannot grade a thousand evals by hand every time you change a prompt. So teams reach for LLM-as-judge — a second LLM call to grade the first one's output. Here is the honest version of when it works and when it does not.
It works when the grading task is bounded. "Does this output mention X?" "Does this output cite at least one source?" "Is this JSON valid against this schema?" "Did this reply correctly identify the department the user needs?" The judge model should usually be at least as strong as the model under test, often a rung above. Calibrate by hand-grading a hundred cases yourself, then comparing the judge's grades to yours. 90%+ agreement: usable. 70%: a hazard pretending to be a metric.
It lies when the grading task is vibes-coded. "Is this helpful?" "Is this good?" "Is this accurate?" The judge will produce a confident 4.2/5.0 and you will have learned almost nothing. Worse: the judge systematically prefers outputs that look like its own writing style — longer, more hedged, more politely scoped. You will optimize the prompt for the judge's taste rather than the user's job, and you will not notice until production engagement drops. The Anthropic and OpenAI safety teams have both published on this under various names (reward hacking, self-preference bias). The takeaway: LLM-as-judge must be calibrated against human labels on a sample, audited periodically, and never run as the only metric.
A robust setup: pass/fail evals on every prompt change. Scored LLM-as-judge evals on a sample, calibrated quarterly. Human grading on a 50-case sample monthly to recalibrate.
Regression evals, online vs offline, A/B
The largest source of AI quality regressions in production is the well-intentioned prompt tweak. Someone fixes a customer complaint by adding a sentence to the system prompt. It works on the one case. It silently breaks twenty others. This is the AI-era "I made a one-line CSS change, what could go wrong" — and the answer is the same: the whole page.
The fix is regression evals on every prompt change. Make it a CI check or a Slackbot — anything that runs the wave-two set and posts a diff before merge. "Pass rate 87/100 → 81/100. Regressions on cases 14, 22, 47, 81, 99, 100." That single line is the entire reason this discipline pays for itself. The same applies to model version changes (Sonnet 4 → 5; GPT-4.1 → 5), index changes (new chunking, embeddings, reranker), and tool-definition changes. Every one is a deploy. Every one runs the golden set first.
Offline evals are predictive — your golden set, run in CI. They tell you whether the feature should ship. Online evals are ground truth — deflection rate, thumbs-up rate, escalation rate, edit distance between the AI's draft and the human's final, refund rate after an AI-handled case. They tell you whether the offline prediction held. You need both, and they feed each other: online failures become wave-three eval cases next month.
Once offline evals say two prompts are roughly equivalent, A/B test at the prompt layer — same statistical discipline as any growth experiment, you are just testing prompt variants instead of button colors. One trap: do not A/B test if offline evals already say one prompt is clearly worse. You are not running a science experiment; you are deciding what to ship. Letting half your users see a known-worse experience is a tax, not a learning.
The natural endpoint is eval-driven prompt engineering — the AI-era analogue of TDD. Write the eval cases first. Write the prompt to pass them. Refactor knowing the eval catches you. Within a quarter it is faster than vibe-prompting, because vibe-prompting collapses the day someone asks "why did we change this prompt and what does it now do better and worse than before?"
Tooling: Braintrust, LangSmith, or homemade?
Homemade. Python script, a JSONL of cases, a CI job. Half a day. Best for under 200 cases and a single feature. As the set grows you will want a UI for inspecting failures and diffing prompt versions, and you will end up rebuilding one badly.
Braintrust. Hosted eval platform, opinionated about datasets/experiments/scorers, strong UI for diffs and regressions, integrates with the major model APIs. Fastest path from zero to working eval pipeline if you are willing to add a vendor. Per-trace pricing; budget accordingly.
LangSmith. LangChain's eval and tracing platform. Strong if you are already in LangChain; less compelling otherwise because the data model leans on LangChain primitives. Tracing is its real strength — production traces are exactly the wave-three candidates.
Worth knowing: Promptfoo (open-source, terminal-first, good for CI pass/fail), Helicone, Arize Phoenix.
Default: homemade for the first thirty cases, switch to Braintrust or Promptfoo when the set crosses 100 or a second team needs to read the results. Do not buy tooling before you have a discipline.
The "trust score" pattern
A pattern that composes everything in this chapter: for each output, compute a single confidence number — combining retrieval similarity scores, model log-probabilities, a refusal-detection check, a citation-presence check, possibly an LLM-as-judge spot grade — and expose it to product logic. High score: show the output with full confidence. Medium: show it with a "draft" badge or "verify before sending" affordance. Low: do not show it; escalate to a human, ask the user to clarify, or fall back to a deterministic path.
The trust score is the bridge between offline eval discipline and online product behaviour. It is the cleanest way to ship AI in high-stakes domains — finance, health, legal, tax — where the cost of a wrong answer is asymmetric and the right move is to gate the answer on a measurable signal rather than hope. See Hallucination as a Product Problem for the UI side; this chapter is the measurement substrate that makes the trust score honest.
Three worked examples
Customer-support deflection bot (B2C fintech). Wave one is ten cases the head of support writes down in an afternoon: "where is my refund," "I lost my card," "how do I change my UPI ID," "is my data sold to advertisers," "this charge is fraud," "cancel my subscription," "speak to a human," and so on. Wave one immediately reveals the prompt confidently invents fee policies the company does not have. Grounding gets fixed before anything else. Wave two stratifies a hundred cases: 40 routine, 20 edge (Hindi-English code-switching, multi-question messages), 20 refusal (investment advice, account fraud claims), 20 adversarial (jailbreaks, prompt injection). Pass/fail evals: did the bot ground in the help-centre corpus, refuse correctly, escalate on the four required cases? LLM-as-judge on tone. Two minutes per regression, a few cents. Wave three samples 1000 anonymized chats monthly; thumbs-down chats auto-promote to next month's eval. The team ships prompt changes weekly and has not had a silent regression in six months.
Code-gen assistant (internal dev-productivity team, TypeScript monorepo). Wave one: ten real tickets where developers asked an LLM for code. The "expected output" is properties, not strings: must compile, must use the team's internal logging library rather than console.log, must not invent imports that do not exist in the monorepo, must include the team's error-handling pattern. Wave two: 100 cases from the last six months of merged PRs, with the PR description as prompt and the merged code as one reference (not the only acceptable answer). Pass/fail: compile, lint, import-resolution, test-pass. Scored (judge calibrated against senior engineers): idiomatic style 1–5. Wave three: every accepted vs rejected completion is logged; accepted completions count as negative signal if edited heavily before commit (high edit distance = "the AI got it wrong, the dev fixed it"). That edit-distance metric is the team's most important online number.
Classification feature (B2B SaaS ticket router, 14 categories). Wave one: ten tickets per category, picked by the support manager. Running them through the prompt immediately reveals that two of the fourteen categories conceptually overlap; the AI flips between them on near-identical inputs. The fix is a taxonomy change, not a prompt change — the eval surfaced a product-design bug, not a model bug. This is the under-appreciated value of an eval set: it forces specification clarity. Wave two: 50 tickets per category with human labels. Pass/fail: correct category? Confusion matrix as scored eval: which categories does the AI confuse with which? Wave three: live tickets sampled, labeled, published as a per-category pass rate on the team dashboard. A drop in any category triggers a Slack alert with a 4-hour SLA. The result: the support team trusts the feature enough to auto-route 70% of tickets without human review.
What to do on Monday morning
If you have an AI feature in production right now without a golden set, get ten inputs your feature must handle and the properties their outputs must have. Run them by hand. Be honest about what you find. That afternoon's discomfort is the cheapest discomfort you will buy this quarter; the alternative is finding out from a user, on Twitter, on the day your CEO is in front of investors.
If a feature is on the roadmap, do not ship it until wave two exists, the regression eval runs on every prompt change, and the team has agreed in writing on the pass-rate threshold the feature must clear. Writing down the threshold before you measure is what stops the team from negotiating the goalposts down to whatever number the demo produces.
If your AI strategy has no eval discipline anywhere, somebody senior needs to own that discipline across every AI feature, the same way somebody senior owns security or accessibility. Without that owner, every feature team will skip the eval step under deadline pressure and you will discover the cumulative quality debt at exactly the wrong moment.
The next chapter (Hallucination as a Product Problem) is about designing the product around the failure modes the eval set just found. The two chapters together are how AI features stop being demos and start being products.
Rules
No AI feature ships without a golden set, and no prompt change merges without running it. This is the bar. Everything else in this chapter is implementation detail.
Build the eval set in three waves: ten hand-curated cases to test feasibility, one hundred stratified cases as your regression suite, one thousand production-mined cases as your distribution mirror.
Default to pass/fail evals. Add scored evals only when the pass/fail decomposition genuinely loses information. Most "is this helpful" scored evals should have been three crisp pass/fail checks.
LLM-as-judge works on bounded grading tasks. It lies on vibes. Calibrate the judge against a hundred human labels before you trust a single number it produces.
Regression evals run on every prompt change, every model version change, every retrieval index change. If it is a deploy, the golden set runs first. Make it a CI check or it will not happen under deadline pressure.
Offline evals predict. Online evals confirm. You need both. A team running one without the other is half-blind, and the half they are missing is the half users see.
Write down the launch threshold before you measure. "Ship when the wave-two pass rate exceeds 85%" is a decision. "Ship when the team feels good about it" is a vibe and it always passes.
Do not buy eval tooling before you have an eval discipline. Start with a script and thirty cases. Buy a vendor when the discipline outgrows the script, not the other way around.
Where to go next
- Chapter 1 — When AI is the right answer: the gate before this one. Eval discipline does not save a feature that should not exist. (When AI Is the Right Answer (and When It Isn't))
- Chapter 2 — The model-selection ladder: the eval set is how you decide which rung clears your bar at which cost. (The Model-Selection Ladder)
- Chapter 3 — Prompt design as product design: the prompt is the artifact. The eval set is what tells you whether the artifact is any good. (Prompt Design as Product Design)
- Chapter 5 — Hallucination as a product problem: the trust-score pattern composes the evals from this chapter into UI affordances. (Hallucination as a Product Problem)
- Companion: Working with Engineers — the eval set is a shared artifact across PM and engineering; treat it like a PRD, not like a test file.