Eval drift — when your eval set stops catching regressions — Eval Harnesses — how you know your agent isn't lying to itself

Audit your eval set on a calendar. If it has not been updated in three months, it is probably lying to you.

Not lying maliciously — the examples are still graded correctly, the graders still run, the CI gate still triggers. The lie is more subtle: the set is passing, the team feels protected, and the failures accumulating in production are of a kind the set was never designed to catch.

The set is telling you the old product is working.

The new product is shipping without coverage.

This is eval drift. It is almost always invisible until a production incident makes it obvious in retrospect.

The picture

A timeline over twelve months.

The top track shows real production failure modes appearing month by month — some are continuations of patterns the original set covered, but new ones accumulate as the product evolves: a new task type, a new user segment with different query patterns, a new content domain, an edge case that did not exist when the set was built.

The bottom track shows which production failures the eval set would have caught if it had been running when they happened. In month one, the coverage is near-complete — the set was just built. By month six, there is a visible gap: one new failure class appears that the set does not test. By month twelve, two more have accumulated. The set is passing green CI on every release while those failure classes silently grow.

The drift is the area between the two tracks. The drift audit is the process of measuring and closing that gap on a regular schedule.

Why it matters now

The combination of fast-moving products, evolving model behavior, and shifting user populations means an eval set built in Q1 2026 will be partially obsolete by Q3.

Model providers update their models mid-stride. User behavior shifts as the product's surface area grows. The underlying task distribution drifts away from what was representative when the set was constructed.

Most teams discover this the day a high-profile production incident slips through a green CI run. The team looks at the CI logs, sees 91% pass rate on the most recent release, and cannot explain why a feature is broken in production. The investigation reveals that the broken behavior was never covered by the eval set — it is a new failure mode that emerged after the set was frozen.

The failure mode is not in the system under test.

It is in the measurement instrument.

A source you should trust

Dataset drift and covariate shift literature from ML-Ops (broadly, from the Chip Huyen, Evidently AI, and WhyLabs communities). The ML training-data-drift playbook translates almost directly to eval sets. Distribution shift, covariate shift, and label shift are the same phenomena; the eval set is just a smaller, human-curated dataset experiencing the same dynamics.
Hamel Husain's drift-detection posts. Shipping-team signal on how to notice eval staleness before it becomes a crisis — specifically the pattern of "always-improving pass rate as a staleness indicator," which is counterintuitive until you encounter it.
Anthropic's model card refresh methodology. Labs update their evals when capability claims change. The discipline of "re-grounding evaluations after significant product or model changes" is the right abstraction.

A recipe

A quarterly drift-audit protocol, designed to fit in two hours with one person owning it:

Pull the last quarter's production failures from support tickets, error logs, and user feedback. Categorize them by failure type, not by individual incident. The goal is a set of categories, not a list of tickets.
For each failure category, run a thought experiment: would the existing eval set have caught an instance of this failure before it shipped? If yes, which examples cover it? If no, this is a drift gap.
The "would-not-have-caught" categories are the drift. For each, write one representative example and add it to the eval set with the appropriate grader. One example per new category is enough to establish coverage. Add more if the category is high-frequency.
Re-baseline. After adding new examples, run the suite and record the new baseline. Old absolute numbers are not comparable after this update — the set changed, so the denominator changed. Document the rebase in the version history.
Decay-prune the oldest examples that have not caused a failure or been manually reviewed in eighteen months. A large set with stale examples is not more rigorous than a tight set with fresh ones. Remove examples where the underlying behavior no longer exists in the product, or where the task type has been deprecated. Keep the set sharp, not large.

The smell of it going wrong

The eval set's pass rate is monotonically climbing release over release. This should not happen. A set that is genuinely testing hard things should plateau — the team's ability to improve the system should hit the floor of the set's difficulty. A rate that keeps climbing means the set is not growing to match the system's evolving failure modes.
A production failure is "surprising" — the team cannot identify which eval example should have caught it. Surprise is a diagnostic: the failure mode was not covered. Every surprise is a drift gap that should generate a new example.
The eval set has not been audited in six months. In a product that ships weekly, that means roughly 25 releases have shipped without a drift check.
Nobody owns the drift audit. "The team" owns it, which means nobody does. The quarterly audit requires a person's name attached to it.
The drift audit happens informally in a team retrospective and produces action items that never get executed. The audit needs to produce a concrete artifact — new examples committed, baseline updated, drift report documented — not just discussion.

A judgment call from real work

The PL content-fit eval set was built in early 2026 to cover the ELP-PM course audience: experienced product managers in mid-to-senior roles at technology companies. The eight competency dimensions were calibrated against that audience. The graders and rubrics were validated against examples drawn from that population's learning behaviors.

Six months later, PL's reinvention vision broadened the target audience substantially — from experienced PMs to "anyone who builds or decides products." This was a deliberate strategic move, and the right one. But the eval set had not been updated to reflect it.

The drift showed up as a class of failures specific to non-PM learners: career-switchers, founders, engineers stepping into product responsibilities. The existing eval set had no examples from this population, because they had not been users when the set was built. The fit-score graders rated content as high-quality for this population using rubrics calibrated for a different population. The scores were confident and wrong.

The drift audit that caught it was the first quarterly review with an explicit "pull production failures by user segment" step. The reviewer noticed that fit scores for non-PM users were systematically higher than human-reviewer scores on the same content — a 12-point gap on the "accessibility to non-specialists" competency.

The fix was to add non-PM learner examples to the eval set, recalibrate the rubric on that competency, and rebase. The rebase produced a 3-point aggregate score drop — not because the product had gotten worse, but because the measurement instrument was now more accurate.

The audit took about three hours. The fix took a day. Finding it six months later, through user feedback, would have taken much longer and cost more credibility.

There is a meta-lesson here about the relationship between product evolution and measurement evolution.

Most teams treat the product as the thing that evolves and the measurement system as the thing that stays fixed.

The right model is that both evolve — the measurement system must grow alongside the product, in a deliberate and documented way, or it will progressively lose its ability to protect the product it was designed to measure.

The quarterly drift audit is the mechanism that keeps the measurement system current. Its output — new examples, updated rubrics, a rebase record — is not overhead. It is the maintenance required to keep a precision instrument calibrated. A thermometer that has not been calibrated in a year may still display a number. Whether that number is accurate is a different question.

Rules from this lesson

Eval sets go stale by default. Freshness is a calendar item, not a reactive response to incidents.
The drift audit needs an owner with a name. "The team" owns it means nobody does. A named calendar event with a named reviewer is the minimum viable governance structure.
Re-baseline after every meaningful drift update. Old absolute numbers are not comparable to post-rebase numbers. Document the rebase and reset expectations before reporting pass rates externally.

In the next lesson, we look at the Ostronaut retrieval harness end to end — a real production system that was built wrong, failed, was rebuilt around measurable metrics, and became the canonical example of what a mature eval harness looks like in practice.