Building a golden set from real user data — Eval Harnesses — how you know your agent isn't lying to itself

Spend one focused afternoon turning real production inputs into a fixed, hand-graded set you will never edit casually.

That constraint — "never edit casually" — is the discipline that makes a golden set valuable.

Most teams treat their eval inputs as a working document: add examples when a bug surfaces, delete ones that seem outdated, adjust expected outputs when the system changes. Each edit feels reasonable in isolation. Collectively, they destroy the set's value as a longitudinal signal.

If v17 scores 91% and v3 scored 84%, you can only compare those numbers if the set has not changed between them. An edited set is not a measurement instrument. It is a wishlist.

The picture

A two-layer funnel.

Layer one (the frozen set): thousands of raw production interactions → filtered for diversity and difficulty → down-sampled to 100–200 examples → each one hand-graded with a target behavior → frozen under version control. This is the set that never changes except through a deliberate append or branch. Every release is scored against this version.

Layer two (the live set): refreshed monthly by pulling the last 30 days of production traffic and running it through the same grader. Not frozen. Used for catching drift — the failure modes the frozen set will miss because users changed their behavior or the product evolved in ways the original examples do not cover.

The frozen set answers: "is this version better or worse than the last version we measured?" The live set answers: "are there new failure modes the frozen set is not testing?"

Both are necessary. Teams usually build only the first, notice drift too late, and respond by adding to the frozen set in a way that breaks historical comparability. Design for both from the start.

Why it matters now

Most teams do not have a golden set. They have a list of failing examples they have been re-running since the last incident.

That is reactive evaluation. It is useful — you should absolutely re-run known failures. But it will not catch the failure modes you have not seen yet. The golden set is the proactive complement: a curated, stable sample that lets you measure prompt changes against a fixed target and notice capability loss before customers do.

The difference compounds. A team with a golden set built in Q1 can compare Q4 performance against Q1 and detect degradation over the full arc — including degradation caused by incremental prompt tweaks that each looked safe individually but accumulated into regression. A team without a golden set has no Q1 baseline to compare against. They are navigating without a map.

The investment is one afternoon.

Not a sprint. Not a research project.

One focused afternoon with two people and access to production logs. The cost of not doing it is borne across every release cycle that follows.

A source you should trust

Anthropic's prompt-engineering documentation on building evaluations. Practical, operator-grade guidance on what expected outputs should look like and how to write graders that capture the right decision. The section on distinguishing "correct behavior" from "preferred output" is worth re-reading carefully — the confusion between the two is the source of most inconsistent grading.
Hamel Husain's "Your AI Product Needs Evals." The most cited shipping-team perspective on golden set construction. His point about "labeling is a skill, not a task" — that grading outputs is judgment work that requires training and calibration, not checkbox work that can be delegated without context — changed how many teams structure the grading session.
Anthropic's model card methodology documentation. Useful for understanding how a lab thinks about stratification — the discipline of ensuring a test set covers the important dimensions of variation, not just the dimensions that are easy to sample.

A recipe

A one-afternoon golden-set protocol:

Pull 500 real user inputs spanning at least one week of production traffic. If the system is pre-launch, pull from the closest adjacent live surface — user interviews, beta-tester sessions, internal dogfooding. Synthetic is a last resort; synthetic inputs are cleaner than production inputs and will miss your actual failure modes.
Stratify by JTBD (job-to-be-done — the goal the user is trying to accomplish), input length, language, and likely difficulty. The stratification does not have to be formal. Write down four categories and make sure you have examples in each. The goal is to catch the failure modes that a random sample would miss because they are rare but load-bearing.
Down-sample to 100–200 examples that span the strata. Keep diversity over raw count. Fifty diverse examples beat two hundred repetitive ones.
For each example, write the expected behavior in the simplest form that works. A target string for structured outputs. A set of required elements that must appear. A three-level rubric (acceptable / good / excellent) for quality judgments. Not a paragraph of nuance. A decision rule.
Two people grade independently, recording any disagreement. Reconcile disagreements with a written resolution note. The resolution note is more valuable than the final grade — it captures where the task boundary lies in the ambiguous cases.
Freeze the set. Commit it to version control with a date-stamped tag. Never edit examples in place. Appending is always fine. Editing is never fine unless a factual error is discovered.

The smell of it going wrong

The golden set reflects one person's intuition with no second grader. You have no idea whether a different evaluator would grade the same examples the same way. Grading disagreement is the most common source of phantom regressions.
The set was built from synthetic inputs — either generated by an LLM or written by an engineer trying to imagine what users would ask. It is cleaner than production and will miss the patterns that make production hard.
Eighty percent of examples are happy-path, short, English queries. The set will never catch the failure modes in long multi-turn inputs, non-English queries, or edge-case task variants.
The set has been edited in place across three releases. v17's score is 89%. v3's score was 84%. Those numbers cannot be compared because the set changed between them. The longitudinal signal is gone.
There is no version-controlled record of what the set contained at any given release. The "v3 baseline" exists only as institutional memory in whoever wrote the first version.

A judgment call from real work

The PL course-content fit scoring pipeline was built around a hand-graded set of approximately 60 lessons from the ELP-PM program, scored against eight competency dimensions. The grading sessions happened over two afternoons, with two people scoring independently and a reconciliation document capturing every disagreement.

One early decision shaped the entire trajectory: whether to add new examples to the frozen set as new courses launched, or to start a separate set for each course family.

The first option was tempting because it would give a larger set faster. It would also mix the difficulty and style distributions from different course families — ELP-PM lessons are structured differently from the shorter-form reading-the-discourse lessons — in a way that would make it impossible to compare scores across course types.

The team chose separate sets for each course family. More work up front, cleaner signal long-term.

The moment that validated this decision came six months later, when a prompt change improved fit scores on the reading-the-discourse family by four points and degraded them on the ELP-PM family by two points — in the same release. If the sets had been merged, those would have been averaged into a net two-point improvement and the regression on the flagship course would have been invisible. Because they were separate, the team caught the ELP-PM degradation in CI and investigated before shipping.

The two-afternoon investment and the discipline of keeping the sets separate protected six months of measurement validity.

A note on the temptation to use synthetic data when real examples are hard to collect. The argument for synthetic is speed: an LLM can generate a hundred plausible-looking examples in minutes. The problem is that LLM-generated examples are drawn from the LLM's prior over what reasonable inputs look like, not from your actual user population's behavior. Real production inputs are stranger, more specific, more adversarial, and more varied than any LLM's prior. The failure modes they surface are precisely the ones that synthetic data misses — which is why the golden set built from synthetic data produces a false sense of coverage until the first production incident reveals the gap.

Use real data. If real data is not available pre-launch, use the closest adjacent real data you have access to: internal dogfooding sessions, user interviews, beta-user sessions. The test for "real enough" is whether the inputs would surprise you. If every example in the set is exactly what a reasonable user would ask, you have not included enough tail behaviors.

In the next lesson, we cover the grader — the component that decides whether each golden-set example passes or fails. The golden set is useless without a grader that makes the right decision. Most teams get the grader wrong in the same direction, and it costs them months of measurement credibility before they notice.

Rules from this lesson

The golden set is drawn from real user data, not synthetic prompts. Real data is harder to collect and dramatically more reliable as a measurement instrument.
Two graders agree before an example enters the set. Unilateral grading produces phantom regressions and makes reconciliation impossible retroactively.
The set is version-controlled and append-only. Never edit examples in place. When you need to correct the set, branch it with a date-stamped tag and document why.