Building the Golden Eval Set -- Twelve Sharp Examples Beat Ten Thousand Fuzzy Ones

Reading time

7 min

7 min left0%

building the golden eval set -- twelve sharp examples beat ten thousand fuzzy ones0%

7 min left

Most AI teams collect examples the way anxious students collect notes. They keep adding more because more feels safer.

That instinct is wrong.

The point of an eval set is not to look comprehensive. The point is to be diagnostic. A dozen sharp examples that expose the real edge cases in your feature are worth more than ten thousand fuzzy examples that mostly prove the system can handle easy traffic. This is the discipline at the center of Eval Before Launch, and it is one of the fastest taste tests for whether a team is building product or performing diligence theater.

If you remember one sentence from this lesson, make it this one: start by collecting examples that can embarrass the feature, not examples that flatter it.

Why do teams get this wrong? Because bulk feels objective. A huge spreadsheet of examples looks like rigor. A carefully chosen set of twelve tough cases looks subjective. In reality, the opposite is often true. Bulk datasets in early AI product work are usually unlabeled, weakly defined, and dominated by routine inputs. They tell you the system works on the traffic that was never going to kill it anyway. Sharp examples force the team to say what "good" means and where failure would actually matter.

Here is the right build order.

Start with the scar tissue. Pull examples from the places where a wrong answer would create support pain, compliance risk, user embarrassment, or visible loss of trust. The first eval rows should come from customer complaints, from near-miss cases, from the weird documents the team keeps referencing in meetings, from the multilingual or messy edge cases that everyone knows are hard. The goal is not representativeness yet. The goal is pressure.

This is why Harvey's legal AI is such a useful case to study. In a domain like legal work, the impressive demo is not the point. The dangerous edge case is the point. A legal drafting system earns trust by surviving the nasty cases -- citation risk, ambiguity, inconsistent precedent, auditability -- not by performing well on a hundred generic contract summaries. High-stakes products force you to see what all AI products should have been doing anyway: build the eval around the cost of being wrong.

Your first twelve examples should usually cover four buckets.

One: canonical wins. The system must succeed on the obvious core jobs or the feature does not deserve to exist.

Two: edge ambiguity. Inputs where a reasonable but sloppy model will make the wrong leap.

Three: refusal or abstain cases. Situations where the system should not answer confidently, should ask for clarification, or should hand off to a human.

Four: adversarial or messy inputs. Typos, contradictory instructions, irrelevant context, formatting noise, prompt injection attempts if the feature touches retrieved or user-provided text.

Notice what is happening here. You are not collecting "examples." You are defining the boundaries of the product.

That is why the golden set is not a testing artifact alone. It is a product artifact. It tells the team what the feature is for, what it is not for, and what kinds of wrongness are unacceptable.

Another important discipline: write expected outcomes as properties before you get attached to any single wording. This matters especially for generative features. If your eval expects an exact sentence match, you will penalize healthy variation and overfit to one phrasing. Instead write things like:

The answer must mention the refund timeline.

The answer must not invent a discount policy.

The answer must cite the relevant clause.

The answer must ask one clarifying question before making a recommendation.

The answer must refuse to answer and route the user to support.

These are product expectations, not string expectations.

The quality of those properties is where product judgment shows up. Weak teams write vague properties like "be helpful" or "give a good answer." Strong teams write observable ones. "Helpful" is not measurable. "Includes all action items and avoids invented dates" is.

Klarna's support rollout is a good cautionary example here. If your eval rows overindex on successful deflection and underweight resolution quality, you will teach the system the wrong lesson. A bot that closes tickets fast while quietly increasing unresolved cases will pass the wrong eval and fail the actual product. The golden set has to reflect the user outcome you care about, not the operational shortcut the business wishes counted as success.

Teams also underrate how few examples it takes to expose category confusion. In ticket routing, for instance, twelve carefully chosen borderline tickets can tell you more than five thousand routine ones. In summarization, six examples with subtle but crucial details often reveal more about hallucination risk than a giant corpus of repetitive notes. In code generation, a handful of repository-specific examples can expose style drift or nonexistent import habits that a broad benchmark would never catch.

This is why I prefer starting with twelve rather than one hundred. Twelve forces taste. One hundred tempts bureaucracy.

The next move after the first twelve is not "scale mindlessly." It is "stratify deliberately." Once the sharp set reveals the obvious failure modes, expand by category. Build the first real golden set around the product's important slices: easy cases, hard cases, abstain cases, adversarial cases, segment-specific cases, and cases from your highest-value users. Growth in eval size should follow clarity, not precede it.

There is a second reason sharpness beats size early: teams learn faster when each row has a story. "This is the enterprise customer whose contract summary omitted the indemnity clause." "This is the bilingual support message where the model answered the wrong half." "This is the invoice where the OCR noise made the total look like tax." Named rows create organizational memory. Giant anonymous datasets create distance.

You should also treat every sharp example as a forcing function for specification quality. If the team cannot agree on what the correct output looks like for a high-value edge case, the problem is not that the model is hard to evaluate. The problem is that the product itself is underspecified. Good eval work reveals product ambiguity early, which is one reason it feels uncomfortable. It removes the illusion that the team knows what "good" means.

This is also why PMs should be personally involved in the first eval wave. Not forever, not in every row, but in the first serious set. If you delegate the whole exercise to engineering or data science, you often end up with technically neat checks that miss the core judgment of the feature. The PM usually knows which failures will become roadmap pain, support pain, or executive pain. Put that judgment into the set.

One more rule that saves time: do not let the first eval set become a generic AI benchmark for your company. Keep it product-specific. The moment someone says "let's also measure creativity" or "let's compare against public leaderboards," you are drifting away from the job to be done. Your feature does not need to win the internet. It needs to clear your bar for your users.

By the time the golden set reaches fifty or a hundred rows, the team should be able to answer three questions cleanly.

What kinds of cases are included and why?

What properties define success for each case type?

What kinds of failure should block launch versus simply create a backlog item?

If you cannot answer those, the set is still too fuzzy no matter how many rows it contains.

The through-line is simple. The golden set is not about volume. It is about honesty. Twelve sharp cases force the team to confront the product they are actually shipping. Ten thousand fuzzy ones let them hide inside aggregate pass rates and feel progress where there is only noise.

And once you have the sharp set, you have the seed of the real operating asset: a regression suite that can catch when the system gets worse after a seemingly innocent change. That is the next lesson.

Rules from this lesson

Build the first eval set from the cases that can embarrass the feature, not the cases that make the demo look strong.
Twelve sharp examples with clear expected properties are more valuable than large fuzzy datasets that mostly contain easy traffic.
Write expected outcomes as observable properties, not vague judgments or exact strings.
Treat the golden set as a product artifact. If the team cannot agree on the right outcome for a hard case, the product is underspecified.
Expand the eval set only after the first sharp rows have clarified the real failure modes.

Building the Golden Eval Set -- Twelve Sharp Examples Beat Ten Thousand Fuzzy Ones

Reading time

7 min

7 min left0%

building the golden eval set -- twelve sharp examples beat ten thousand fuzzy ones0%

7 min left

Most AI teams collect examples the way anxious students collect notes. They keep adding more because more feels safer.

That instinct is wrong.

If you remember one sentence from this lesson, make it this one: start by collecting examples that can embarrass the feature, not examples that flatter it.

Here is the right build order.

Your first twelve examples should usually cover four buckets.

One: canonical wins. The system must succeed on the obvious core jobs or the feature does not deserve to exist.

Two: edge ambiguity. Inputs where a reasonable but sloppy model will make the wrong leap.

Three: refusal or abstain cases. Situations where the system should not answer confidently, should ask for clarification, or should hand off to a human.

Four: adversarial or messy inputs. Typos, contradictory instructions, irrelevant context, formatting noise, prompt injection attempts if the feature touches retrieved or user-provided text.

Notice what is happening here. You are not collecting "examples." You are defining the boundaries of the product.

That is why the golden set is not a testing artifact alone. It is a product artifact. It tells the team what the feature is for, what it is not for, and what kinds of wrongness are unacceptable.

The answer must mention the refund timeline.

The answer must not invent a discount policy.

The answer must cite the relevant clause.

The answer must ask one clarifying question before making a recommendation.

The answer must refuse to answer and route the user to support.

These are product expectations, not string expectations.

This is why I prefer starting with twelve rather than one hundred. Twelve forces taste. One hundred tempts bureaucracy.

By the time the golden set reaches fifty or a hundred rows, the team should be able to answer three questions cleanly.

What kinds of cases are included and why?

What properties define success for each case type?

What kinds of failure should block launch versus simply create a backlog item?

If you cannot answer those, the set is still too fuzzy no matter how many rows it contains.

Rules from this lesson

Build the first eval set from the cases that can embarrass the feature, not the cases that make the demo look strong.
Twelve sharp examples with clear expected properties are more valuable than large fuzzy datasets that mostly contain easy traffic.
Write expected outcomes as observable properties, not vague judgments or exact strings.
Treat the golden set as a product artifact. If the team cannot agree on the right outcome for a hard case, the product is underspecified.
Expand the eval set only after the first sharp rows have clarified the real failure modes.