Graders — string match, LLM-as-judge, human-in-the-loop — Eval Harnesses — how you know your agent isn't lying to itself

Pick the simplest grader that works. Escalate to the next rung only when the simpler rung lies to you.

Most teams invert this. They reach for the most flexible grader — LLM-as-judge — because it is the easiest to write and the hardest to argue with. You describe what you want in plain language, the judge model evaluates it, and you get a pass/fail signal in minutes. The setup takes an hour.

The problem with the approach only surfaces later: the judge model has systematic biases, those biases drift across model releases, and the pass rates the judge produces can move significantly without any underlying change in the system under test.

When that happens, you have built an eval that measures the judge's opinions as much as the system's quality.

The number is confident and wrong.

The discipline is the ladder.

The picture

A grader ladder, lowest cost to highest cost, with the failure mode that forces you up:

Exact match — the output must equal a target string exactly. Works for: entity extraction, structured field values, classification labels, tool-call parameter values. Failure mode that forces escalation: the correct output has legitimate surface variation (different word order, synonyms, minor formatting differences).

Regex / structured match — a pattern or schema constraint must hold. Works for: outputs that must contain required elements, numeric ranges, date formats, JSON with required fields. Failure mode: the structure is valid but the semantics are wrong.

Python predicate — a small function encodes the decision logic. Works for: unit-test-style output verification, numeric accuracy within tolerance, presence of required concepts. Failure mode: the concept you care about is not expressible as a clean predicate.

Rubric with human grader — a written rubric, scored by a person on a sample. Works for: open-ended response quality, tone, completeness on complex tasks. Failure mode: human grading is too slow to run on every PR; sampling is required.

LLM-as-judge — a judge model evaluates outputs against a rubric. Works for: high-volume quality assessment where human sampling is insufficient and exact match cannot apply. Failure mode: the judge has systematic biases, is correlated with the system under test, or drifts silently across model versions.

Human review — a person reads every output. Works for: high-stakes decisions, novel task types where automated graders have not been validated. Failure mode: cost and latency are prohibitive at scale.

Each rung is strictly more expensive than the previous one. Stay on the lowest rung that works.

Why it matters now

LLM-as-judge graders became fashionable in 2023–2024. They produced a wave of pass-rate numbers that turned out, on inspection, to be measuring the judge model's biases as much as the system under test.

The bias patterns were systematic. Longer responses were rated higher, independent of quality (length bias). Responses resembling the judge model's training distribution were preferred (self-similarity bias). Responses with confident, assertive language outperformed hedged, accurate responses on fluency metrics (confidence bias). Responses appearing first in pairwise comparisons were often preferred (position bias).

None of these are random noise. They are systematic distortions that can mask real regressions. If your system under test gets noisier and more verbose, an LLM-as-judge grader with length bias may register a quality improvement while your users are experiencing degradation.

The solution is not to avoid LLM-as-judge. It is to use it deliberately, validate it against a human-graded subset, and re-validate it every time the judge model changes.

A source you should trust

"Judging LLM-as-a-Judge with MT-Bench" (Zheng et al., 2023). The paper that surfaced systematic biases in LLM-as-judge grading and gave the community the methodology for auditing and calibrating them. The taxonomy of bias types is required reading before you deploy any judge-based eval.
Hamel Husain's writing on graders. Operator-grade discipline on when to use which rung, with worked examples from shipping teams. His framing that "the grader is a hypothesis about what quality means" — and therefore a thing that can be wrong and needs to be tested — is the right mental model.
OpenAI Evals framework documentation. The framework was designed to make grader choice explicit and auditable. Even if you do not use the framework directly, the design decisions baked into it are instructive about what a well-designed grader looks like.
HELM (Holistic Evaluation of Language Models). Notable for decomposing quality into multiple measurable axes — accuracy, calibration, robustness, fairness — rather than a single pass/fail. Useful reference for thinking about multi-dimensional grader design when a single number is insufficient.

A recipe

A grader-selection protocol for each task type in your suite:

Write the decision rule for this task type in one sentence. What makes an output correct? If you cannot write the rule in one sentence, the task is probably underspecified and the grader cannot be designed until the specification is tightened.
Can a string match make that decision? If yes, use that. Test it against five known-correct and five known-incorrect examples to confirm it works.
Can a small Python predicate make that decision? If yes, use that. Predicates are unit-testable and version-controllable. They are usually the right answer for anything more complex than exact match and less complex than semantic judgment.
Is the judgment semantic — about meaning, tone, completeness — in a way that predicate logic cannot capture? Then write a rubric. Score a human-graded sample against the rubric to validate it before automating it.
Only escalate to LLM-as-judge when steps 1–4 cannot decide, and only after validating the judge against a human-graded subset of at least 20 examples. If the judge's agreement rate with humans is below 80%, recalibrate before using the judge's numbers as a gate.
For every LLM-as-judge grader in production, sample 10% of outputs with human review on every release cycle. If the human-judge agreement rate drops below threshold, recalibrate the judge before the next release. Never let a judge drift silently through multiple model upgrades.

The smell of it going wrong

The team uses LLM-as-judge for every task type because it is the easiest to write. The grader ladder has been skipped entirely.
Pass rate climbed four points on the last release. Nobody checked whether the judge model was updated between the baseline run and the current run. The four points may be measuring the judge's improvement, not the system's.
The grader is the same model as the system under test. This is a clear conflict of interest. A model that generates more verbose, confident outputs will score better on a judge with length and confidence bias, creating a feedback loop that rewards the wrong qualities.
A human spot-check catches obvious failures that the automated grader is passing. The grader has become optimistic — it is no longer calibrated to human quality standards.
The rubric was written by one person, never validated against human grading, and has been in production for eight months. Nobody knows whether it still reflects what the team means by "quality."

A judgment call from real work

The PL course-content fit scoring originally used GPT-4-as-judge to score lesson quality against the eight PM competency rubric. The judge was set up on a Friday afternoon, validated against a small human-graded sample, and deployed as the primary grader. For several months, the scores looked reasonable and the team trusted them.

The trouble began when a model upgrade was applied to the judge without a revalidation step. The new version of the judge was, on most tasks, more capable and more consistent. It was also noticeably more lenient on India-context content — a class of examples involving local product cases, Hinglish terminology, and India-specific market dynamics. The new judge rated these examples as high quality at a rate roughly 15 percentage points above the human graders' historical scores.

The effect was not immediately visible in aggregate pass rates, because India-context content was about 20% of the suite. The aggregate number climbed by about 3 points — easily attributed to genuine quality improvement from a prompt refinement that had happened in the same release window.

The catch came during a quarterly human spot-check. A reviewer noticed that several India-context lessons were receiving excellent fit scores on the "competitive analysis" competency but contained recommendations that were generic and not grounded in Indian market conditions. The judge was rating them excellent because they discussed competitive analysis correctly in structure. A human reviewer marked them acceptable at best.

The recalibration that followed involved updating the rubric to explicitly anchor India-context quality expectations, re-validating the judge against the updated rubric, and adding a dedicated India-context example stratum to the human spot-check rotation.

The cost of that recalibration was about two person-days. The cost of not catching it sooner was six months of course recommendations slightly over-weighted toward India-context content that had been given inflated fit scores. Both were recoverable. The lesson was cheaper to learn from that example than from a public quality problem.

Rules from this lesson

Pick the lowest rung of the grader ladder that works. Escalate only when the simpler rung cannot make the decision correctly.
Never deploy an LLM-as-judge grader without validating it against a human-graded subset. A judge you have not validated is a hypothesis, not a measurement.
Re-validate judges on every model release. Judges drift. An unvalidated drift can mask real regressions or manufacture phantom improvements.

In the next lesson, we cover the cultural shift: wiring evals into CI so they run without anyone remembering to run them, and making the diff — not the absolute number — the artifact that reviewers actually read.