The cost of evals — when not to grade everything — Eval Harnesses — how you know your agent isn't lying to itself

Tier your eval suite. Fast checks on every commit, full suite nightly, deep audits quarterly.

This is not a concession to laziness. It is the discipline that makes eval infrastructure sustainable. An eval gate that takes 45 minutes will be routed around by the third week. A gate that takes two minutes will be respected indefinitely. The discipline is designing for the human behavior you expect, not the ideal behavior you hope for.

The cost of evals has three components that are usually treated as one.

The token bill (inference cost to run LLM-as-judge graders at scale) is the obvious one and usually the smallest.

The latency cost (how long the CI gate blocks a developer) is the one that drives bypass behavior.

The attention cost (how much cognitive load the eval report requires from the reviewer) is the one most teams underestimate — a complex eval report that requires ten minutes to interpret will eventually stop being interpreted, and the signal in it will stop reaching decisions.

Design for all three.

The picture

A pyramid, with cost and rigor increasing upward.

Base (every commit): ten golden examples, exact-match graders, two-minute run. Catches the obvious: did this change break a structured output format? Did a JSON schema get corrupted? Does the smoke test for the primary task type still pass? Blocks merge immediately if red. Never grows larger than thirty examples — if it starts taking longer than two minutes, it gets pruned.

Middle (every PR): hundred golden examples, mixed graders (exact match, Python predicates, rubric on sample). Ten-minute run. Surfaces quality movement by task type. Blocks merge if pass rate drops more than five points. Yellow triggers reviewer judgment; reviewer can override with documented reason.

Top (nightly): thousand-example coverage suite, including LLM-as-judge on open-ended quality. Two-hour run. Posts a daily diff report. No blocking — humans triage the flags. The nightly tier exists to catch slow-moving regressions that the PR suite cannot see because they only appear with sufficient statistical power at scale.

Apex (quarterly): human-graded audit and drift review. The most expensive tier, done least often. Produces: new examples for the golden set, updated baselines, a drift report, and a decision on whether any existing graders need recalibration.

Each tier is designed to answer a specific question. Smoke: is this commit obviously broken? PR: did this change move quality? Nightly: are there slow-moving regressions in the long tail? Quarterly: is the measurement instrument still accurate?

Why it matters now

A naïve eval gate that runs the full suite on every commit has two predictable consequences.

First, developers learn to bypass it. Not through malice — through rational self-interest. A gate that takes 45 minutes and blocks a 5-line prompt fix becomes a friction point. The first bypass is justified ("it's just a typo fix"). The second is slightly less justified. By the tenth bypass, the bypass is the default and the gate is theater.

Second, the monthly inference bill spikes with no marginal safety gain. Running a thousand-example LLM-as-judge suite on every commit adds inference costs that are real money and achieve nothing beyond what the hundred-example PR suite already achieves. The marginal regressions caught by running the full suite on every commit, rather than nightly, are vanishingly rare. The cost is not.

Tiering solves both problems. Fast gates stay fast. Thorough coverage exists where it adds marginal value. The quarterly human audit catches what automation cannot.

A source you should trust

Aider's tiered CI design. Aider (an AI coding assistant) maintains a public CI pipeline that separates a fast smoke test from a full eval run, with clear documentation of what each tier is designed to catch. The design decisions are annotated and worth reading as a worked example before you design your own tiers.
Mike Cohn's test pyramid framework. The conceptual ancestor of eval tiering. Unit tests at the base (fast, cheap, many), integration tests in the middle, end-to-end tests at the top (slow, expensive, few). Evals are tests in a different costume; the pyramid logic applies verbatim.
"Accelerate" (Forsgren, Humble, Kim). The research-grounded argument that fast feedback loops are the most important determinant of engineering throughput. The measurement that "time from commit to feedback" is a key predictor of team performance is the quantitative backing for "keep the smoke gate under two minutes."

A recipe

A four-tier eval pyramid template you can implement in a week:

Smoke tier (target: under two minutes, every commit). Ten examples, exact-match graders only. Cover the primary task type and two edge cases that have broken the system before. No LLM-as-judge graders. If a change to the smoke set is needed, it should take fifteen minutes to implement. If it takes longer, the set is over-engineered.
PR tier (target: under ten minutes, every PR). Hundred examples from the golden set, mixed graders. The subset should cover all major task types. Surface the diff by task type in the PR template. Yellow (1–5 point drop) requires documented reviewer justification. Red (more than 5 points) blocks merge.
Nightly tier (target: under two hours, every night). Thousand examples if available; otherwise the full golden set run with LLM-as-judge graders on the open-ended task types. Posts a Slack message or email with the daily diff. Someone reads it — not everyone, but one named person per week on rotation.
Quarterly tier (target: half-day, once a quarter). Pull last quarter's production failures, run the drift audit, add representative new examples, rebase the golden set if needed, human-spot-check the LLM-as-judge graders, and write a one-page drift report. Own this with a named person and a calendar event.

The smell of it going wrong

The eval gate takes 45 minutes and developers route around it. This is not a people problem; it is a design problem. The gate is too slow for its position in the pyramid.
All tiers run at the same level. Every commit triggers the full suite. The marginal safety gain over the PR tier is small; the cost is large; the bypass rate climbs.
The nightly run lives in a notebook. Nobody is subscribed to the output. The daily diff exists and is not read. An unread report is not a control.
The quarterly tier is "supposed to happen" but did not last quarter. There is no named owner, no calendar event, and no artifact produced when it does happen. It is not a real tier; it is an aspiration.
The smoke tier grew over time and now contains 80 examples and takes 12 minutes. It has become the PR tier in disguise. The smoke tier must be ruthlessly pruned; if it grows, it stops being fast and stops being smoke.

A judgment call from real work

The PL content pipeline's fit-scoring eval system went through exactly this evolution.

The first version ran every content change against the full eight-competency rubric, with LLM-as-judge graders for all eight dimensions. A single content update triggered 800 LLM calls — 100 examples times 8 competencies — taking about 35 minutes and costing roughly $2 per run. For a content pipeline that updated several times per week, this was immediately unsustainable: too slow for developers to wait, too expensive to run on every minor fix, and too complex for the output report to be read consistently.

Two months in, the eval was running "when someone remembered to trigger it manually" — which happened about once a week, less frequently after busy periods. The gate had collapsed.

The redesign took the problem seriously as a system design issue, not a discipline issue.

The smoke tier was created: ten examples, exact match on structured outputs, under two minutes. It caught the obvious failures — JSON schema corruption, prompt injection in content, broken citation formatting.

The PR tier used the hundred-example golden set with exact match and rubric-based graders on a human-sampled 20%. No LLM-as-judge on the PR tier. Runtime dropped to eight minutes.

The LLM-as-judge tier was moved to a weekly cron job — not nightly, because the content update frequency was weekly. It ran on the full golden set, posted a Slack summary, and was owned by one person per week.

The quarterly human audit was scheduled on the team calendar with a named reviewer and a template for the drift report.

After the redesign, the bypass rate dropped to near zero. The weekly LLM-as-judge run was read consistently. Two regressions were caught in the PR tier in the next six months that had previously slipped through the "run when someone remembers" regime.

The cost reduction was substantial. Monthly inference cost for the eval suite dropped from approximately $80 to approximately $12. The safety improvement was also real — not because the redesigned system ran more evals, but because it ran the right evals consistently.

Rules from this lesson

Tier your eval suite. One-tier-fits-all either makes the gate too slow to respect or makes it too shallow to catch real regressions. Both failures are preventable by design.
The smoke tier is fast and shallow on purpose. Never let it grow into a full PR-tier suite. Prune it when it grows.
The quarterly tier needs a named owner and a calendar event. An untimed, unowned audit is not a tier; it is a good intention that will not survive a busy quarter.