A lot of teams think eval work is finished once the launch meeting goes well.
That is when the real work starts.
The reason Eval Before Launch matters is not only that it helps you decide whether to ship. It is that AI features keep changing after launch in ways traditional software teams underestimate. Prompts change. Models get swapped. retrieval indexes get rebuilt. Tool definitions evolve. Safety rules get tightened. Temperature settings get nudged. Every one of those is a product change, whether or not anybody writes a release note for it.
If you do not run the golden set every time one of those changes happens, you are shipping regressions invisibly.
This is the biggest qualitative difference between "AI feature in demo mode" and "AI feature in production mode." Demo-mode teams think the prompt is just text. Production-mode teams treat the prompt, retrieval behavior, and routing logic as code that can break things.
Start with a simple principle: if a change could alter output quality, it must run against the eval set before reaching users.
That includes prompt edits.
That includes swapping from one model to another.
That includes changing chunking, ranking, or document-selection logic in a retrieval system.
That includes adding a tool, removing a tool, or changing the instructions around tool use.
That includes context-window trimming rules, output formatting rules, and confidence thresholds.
In ordinary software terms, the eval set is your regression suite. In AI product terms, it is the thing standing between "we fixed one annoying edge case" and "we silently broke fourteen core flows."
The classic failure pattern is painfully predictable. A customer complains that the assistant is being too verbose. Someone adds a sentence to the system prompt telling it to be concise. The system becomes shorter, but it also stops including one important caveat in the legal or policy answers. Nobody notices because the team only re-ran the one example that triggered the complaint. Two weeks later support volume rises and the connection is invisible.
This is not a hypothetical. Versions of it happen constantly in internal copilots, support bots, search assistants, and AI drafting products. The surface reason changes. The structural reason does not: the team treated a prompt tweak like copywriting instead of like a deploy.
Regression discipline becomes even more important when retrieval is involved. Retrieval-augmented generation, usually shortened to RAG, means the model answers using documents fetched from a knowledge source. Teams often focus their eval attention on the final answer and forget that retrieval changes can be just as dangerous as prompt changes. A different chunking strategy, a new embedding model, or an adjusted ranking threshold can quietly change what information reaches the model at all. If the golden set does not run on index changes, you are missing half the problem.
This is one reason Anthropic's Claude app is worth studying beyond the model itself. The lab did not just ship raw capability. It shipped product boundaries, safety behaviors, and operational discipline around those behaviors. Labs know the wrapper matters because the wrapper is where regressions become user pain.
Here is the minimum viable operating loop.
One: define a stable eval set that covers the important product slices.
Two: run it on the current production version and store the results.
Three: any proposed change runs against the same set before release.
Four: compare not just the overall pass rate but the per-category shifts.
Five: block release if launch-critical categories regress beyond the threshold you agreed in advance.
That last clause matters. If you do not define the threshold in advance, the team will negotiate with the result. They will explain why the drop is acceptable, why the new behavior is directionally better, why the old metric was imperfect. Some of that may even be true. It is still how discipline dies.
A strong regression suite usually uses at least three layers of checks.
The first layer is structural. Did the output parse? Did the JSON validate? Did the tool call use the correct schema? Did the answer stay within the expected format? These checks are cheap and should run constantly.
The second layer is product-property checks. Did the answer include the refund timeline? Did it avoid inventing a feature? Did it abstain when required? These are the core of your golden set.
The third layer is category-level reporting. Even if the global pass rate looks flat, did the bilingual cases regress? Did the enterprise-document cases improve while the short-answer cases got worse? AI systems often trade one slice for another. You need to see the trade clearly.
This is especially important in domains where errors are not symmetrical. Harvey's legal AI is a useful mental anchor again. A small regression in formatting may be tolerable. A regression in citation fidelity is not. If the suite reports a single overall score without weighting failure classes properly, it will tell a dangerously incomplete story.
Another discipline advanced teams use: maintain a "recent regressions" bucket inside the eval set. Every real production failure that mattered should be turned into a permanent row. Once a bug hurts a user badly enough to be remembered, it should never be allowed to sneak back in quietly. Over time this bucket becomes one of the highest-value parts of the whole suite because it reflects reality rather than design imagination.
The suite should also be part of review culture, not a hidden technical ritual. PMs, engineers, and whoever owns the domain should all be able to read the diff between the current version and the proposed version. Not every person needs to inspect every row, but the idea that evals are only "for ML people" is a category error. If the change affects user quality, product leadership should understand what improved and what got riskier.
There is a political benefit here too. Regression discipline turns model debates into evidence debates. Instead of arguing abstractly about whether a cheaper model is "good enough," you can show that it passes 47 out of 50 required cases, fails the same two hard legal cases every time, and degrades one multilingual row that matters for a key segment. Now the conversation is precise. Precision de-escalates drama.
The same is true for vendors and version upgrades. Every provider will keep releasing new versions. Some will be better globally and worse on your exact product. Without a regression suite you will default to hype or fear. With one, you can decide cleanly whether the new version is actually better where it counts.
A final point on speed. Teams sometimes resist regression gating because they think it will slow iteration. In the very short term, yes, it adds a step. In the medium term, it makes iteration faster because the team stops rediscovering old failures and stops arguing from memory. The suite becomes shared memory. Shared memory is what allows rapid change without chaos.
If you are serious about shipping AI as product rather than decoration, this has to become a release muscle. A change that affects prompt, model, retrieval, or tool use should feel incomplete until the eval diff is visible. Not because process is fashionable. Because otherwise your users are doing QA for you in production, and they are doing it on the cases that matter most.
Rules from this lesson
- Treat prompts, model swaps, retrieval changes, and tool-definition changes as deploys. If it changes output quality, it must run against the eval set.
- A launch-time eval set becomes valuable only when it is used as a regression suite after launch.
- Measure category-level shifts, not just one overall pass rate. AI regressions often hide in slices.
- Turn every meaningful production failure into a permanent eval row so the same bug cannot return quietly.
- Regression gating may add a step, but it speeds product learning by replacing vague memory with shared evidence.