Regression testing as a culture, not a one-time event — Eval Harnesses — how you know your agent isn't lying to itself

Wire your eval set into CI the same way you wire unit tests in.

This sentence sounds obvious after the fact. Before you have done it, the default assumption is usually that evals are different from tests — more expensive, more subjective, and therefore appropriately handled through manual review before major releases rather than automated gates on every commit.

That assumption is wrong in the ways that matter most.

The cost of a missed regression scales with how late it is caught. A regression that fails CI is a 10-minute investigation. A regression that reaches staging is a 2-hour investigation. A regression that reaches production is a half-day investigation with a customer email attached.

The difference between "evals in CI" and "evals by feel before release" is the difference between those three costs, repeated across every release cycle for the lifetime of the product.

The cultural shift — treating evals as infrastructure rather than research artifacts — is the second-order discipline this course is building toward.

Lessons 1–5 gave you the artifact. This lesson is about making the artifact run without anyone remembering to run it.

The picture

A CI pipeline with eval gates woven into the review surface.

Code change is pushed to a branch. Standard tests run — type check, unit tests, build. Then the eval suite runs: the smoke set (ten examples, exact match, under two minutes) runs first as a hard gate. If it passes, the pull request is opened. The full golden set (hundred examples, mixed graders, under ten minutes) runs on PR creation. The PR template shows the pass rate diff — not just the current number, but the delta from the baseline. Green means "no significant regression." Yellow means "small movement, reviewer judgment required." Red blocks merge.

The reviewer sees: baseline pass rate, current pass rate, which task types moved, and whether any failures are in known-regression categories. The diff is the artifact. The absolute number is the headline.

On merge to the main branch, the same suite runs again as a post-merge verification. The nightly run covers the broader coverage suite. The quarterly audit feeds a human into the loop.

Why it matters now

In 2024 the dominant frame was "evals are a research artifact" — something labs do before releasing a model, not something product teams maintain continuously. In 2026 the dominant frame is "evals are part of CI." The shift happened because teams that had not made it started discovering regressions through customer reports, and the gap between the customer report and the root cause was measured in weeks, not hours.

The shift is not technically difficult. A CI step that runs a Python script against a frozen eval set is the same CI step that runs a test suite. The engineering lift is low. The cultural lift — convincing the team that a failed eval gate is not an obstacle to shipping but a protection — is where most of the work is.

The teams that have made this shift report the same pattern: the first three times an eval gate blocks a merge, there is friction.

The fourth time, someone says "oh, that would have been bad in production" and the friction stops.

The gate becomes part of how the team reads its own work.

A source you should trust

The Aider repo's CI integration. Aider (an open-source AI coding assistant) runs its eval suite on every PR. The implementation is public and the design is clean: separate smoke gates from full gates, surface the diff prominently, and make the override mechanism deliberate and logged. Worth reading as a worked example before designing your own.
LangSmith's documentation on regression testing and prompt version tracking. Vendor documentation, but operator-grade on the integration patterns — specifically the design of baseline-diff reporting and the traceability between prompt versions and eval results.
"Continuous Delivery" (Humble and Farley). The foundational text on wiring quality gates into deployment pipelines. The chapters on test automation discipline translate almost directly to eval suite design. The core argument — that quality gates only work if they are fast enough not to be bypassed — applies here verbatim.

A recipe

A minimum-viable CI integration for eval gates:

Create a baseline. Run the golden set on the current production version of the system. Record the pass rate by task type. Commit this as eval-baseline.json. This is the reference point for every future run.
Add a CI step that runs the smoke set (ten examples, exact match, two minutes) on every commit. Red smoke blocks the PR from opening. Keep the smoke set extremely fast — if it takes more than two minutes, it will be routed around.
Add a CI step that runs the full golden set on PR creation. Surface the pass rate diff against the baseline in the PR template. Red (more than 5 percentage points below baseline) blocks merge. Yellow (1–5 points) requires the reviewer to explicitly confirm the regression is acceptable and document why.
Log every merge-override reason. The log is reviewed weekly. If the same override reason appears twice, it is a signal that either the eval set is too sensitive on that dimension or the underlying regression is real and needs a fix, not a pass.
Update the baseline when a deliberate quality change ships. A new prompt version that improves structured output by 8 points should update the baseline — otherwise every future run will show phantom improvements against a stale reference. The baseline update is part of the release checklist, not an afterthought.

The smell of it going wrong

Evals run "when someone remembers." That person is busy this week. The eval has not run in three weeks. Nobody knows if the last three releases regressed anything.
The eval set passes 100% on every run. This means either the set is too easy (it is only testing happy paths) or it has not been updated to track the evolving product (it is testing behaviors that no longer reflect the system). A 100% pass rate on a meaningful eval set should not be possible for more than a few weeks.
Regressions are caught in production through customer reports and support tickets. The CI gate exists but is configured as "advisory only" — it cannot block merge, so nobody treats a yellow or red result as an action item.
The eval pass rate is reported as a single aggregate number with no breakdown by task type. A reviewer sees "87%" and cannot tell whether structured output degraded or citation quality improved. The diff artifact has no resolution.
The CI step takes 45 minutes and developers have learned to merge without waiting for it. A gate that is too slow to be respected is not a gate; it is noise.

A judgment call from real work

The PL courses-content pipeline wired eval gates into the fit-scoring workflow in two phases. The first phase made all gates advisory — failures were logged and surfaced in a dashboard, but nothing blocked merge. The reasoning was pragmatic: the team did not yet trust the eval set enough to let it block publishing decisions, and the overhead of investigating every yellow signal felt too high for a team of three.

The advisory phase lasted about four months. During that period, the dashboard accumulated a log of advisory failures that nobody read consistently. Two regressions slipped through that, in hindsight, were clearly visible in the advisory log — they just were not blocking anything, so they were not treated as urgent.

The tipping point was a content update that improved one course's fit scores on five competencies and silently degraded fit scores on the "strategic framing" competency for a different course. The advisory failure had been in the log for eleven days before a user noted that the strategic framing lessons in that course "felt less grounded" than they used to. The investigation took two days and traced back to a prompt change that had landed almost two weeks earlier.

After that incident, two of the seven task types were promoted to blocking gates. The other five remained advisory but were added to the weekly review agenda with a named owner. The number of regressions that reached production dropped immediately. The number of gate-blocked merges in the first month was three, all legitimate, all investigated within a business day.

The design — some gates blocking, some advisory, all reviewed — was more workable than the binary choice of "block everything" or "block nothing."

The broader cultural shift this lesson is pointing at: eval gates become part of the engineering culture when they consistently provide value — when they catch something real, prevent something bad, or surface a signal that leads to a better decision. The first time an eval gate catches a regression that would have reached production, the team's relationship to the gate changes. It stops being an obstacle and starts being an ally. Until that first catch, the gate is overhead. After it, the gate is infrastructure.

This is why the initial investment in making gates fast and diff-visible matters so much.

A slow gate that nobody reads never gets its first catch. A fast gate with a clean diff will catch something real within weeks, and that catch is what converts the team from tolerating evals to relying on them.

Rules from this lesson

Eval gates belong in CI, wired the same way unit tests are. Advisory-only gates are not gates; they are dashboards that no one reads consistently.
The diff is the artifact. Surface pass rate movement by task type, not just the aggregate number. A reviewer who cannot see what moved cannot make a good override decision.
Override reasons feed a known-regressions log. The log is reviewed, not just stored. If it is not reviewed, it is not a control; it is paperwork.

In the next lesson, we cover the failure mode that eventually hits every well-run eval system: drift — the process by which a good eval set gradually stops catching the real failures as the product moves and the set does not.