Structured eval logging — turning production into a regression source — Production Harnesses — observability, recovery, the bill

The move. Log production interactions in the format your eval suite already understands.

The majority of teams that build eval suites build them from synthetic data. They write cases that cover the happy path, the obvious edge cases, and the scenarios the team could imagine during a design session. Then they ship.

The first production failure almost always involves a case nobody imagined.

This is not a design failure; it is a sampling problem.

Your synthetic eval set samples from your imagination. Your production traffic samples from the real distribution of users, inputs, and environmental conditions.

These distributions overlap but are not the same. The tail of the real distribution is where most production incidents live.

The fix is obvious once stated: use production traffic to grow your eval set.

The obstacle is equally obvious: teams do not capture production interactions in a format their eval suite can read, so each production failure must be hand-translated into a test case.

Hand-translation at 2am after an incident, or the following week when the urgency has faded, is a step that almost always gets skipped.

Structured eval logging removes the translation step. Production logs and eval rows share a schema from day one.

The picture

A pipeline with five stations. Station one: a production interaction happens. The agent takes an input, produces an output, and the interaction is logged as a structured row with input, actual_output, expected_output: null, grader_hints, metadata (model version, timestamp, session ID, feature name). Station two: an automated classifier or a human reviewer scans flagged rows — flagged by low confidence score, by user feedback, or by a heuristic that marks unusual outputs. Station three: the reviewer fills in expected_output for flagged rows and promotes them to the eval-candidate pool. Station four: a second reviewer confirms the candidate is correctly labeled and promotes it to the frozen golden set. Station five: the frozen entry runs in CI on every deploy, forever.

The pipeline's distinguishing feature is that stations one through three require almost no extra work per interaction if the schema is shared. The cost of turning a production failure into a regression test is low; without the shared schema, it is high enough that teams skip it.

Why it matters now

The largest source of high-quality eval data is your own production traffic. Synthetic data from your imagination will always undersample the long tail.

The gap between synthetic and real is widest in the early months of a product, when user behavior is most surprising, and narrows over time as you understand the real distribution better.

By 2026 several vendors (LangSmith, Braintrust, Langfuse) have built "datasets from traces" workflows that formalize this pipeline. The tooling is available; the discipline of using it is what separates teams that compound their eval quality from teams that plateau at the synthetic-data ceiling.

The other driver is model version changes. When you upgrade a model or change a prompt, your eval suite tells you whether the outputs you care about have changed. But it can only tell you about the inputs it has seen. Production-sourced cases ensure your eval suite includes the inputs that actually matter to your users.

A source you should trust

LangSmith's "datasets from traces" workflow is a worked example of converting production traces into eval datasets. The workflow is documented with concrete UI steps and API calls. Read it before you design your own schema — borrowing their field names means your schema is compatible with the vendor surface out of the box.

Hamel Husain's writing on production-driven evals covers the operator's perspective on why synthetic evals are insufficient and how to build the review queue discipline that keeps production-sourced cases clean. His framing of "evals as the institutional memory of incidents" is the most precise description of why this pipeline compounds.

A recipe

A structured-log-to-eval pipeline you can implement in a day:

Define the eval-row schema: input (the full input to the agent), expected_output (null in production logs), actual_output, grader_hints (free-text notes for a future reviewer), metadata (model version, timestamp, feature, session ID), user_feedback (thumbs up/down, retry, abandonment signal). This schema lives in one place and is used by both the production logger and the eval runner.
Wire production logging to emit rows in this schema. expected_output is always null; everything else is filled from the interaction.
Define the flagging criteria that route rows to the review queue. Start with three: user feedback below a threshold, model confidence below a threshold (if available), and a random sample rate for general quality monitoring. More criteria can be added; the three cover most cases.
Assign a named human reviewer to the queue. Not "the team" — a person with a calendar block. The queue stagnates without an owner.
The reviewer fills in expected_output and annotates disagreements. Disagreements are more valuable than confirmations; they are the cases where the model and the human diverge.
A second pass promotes from candidate to frozen golden set. Two-reviewer discipline keeps the golden set clean; without it, the set drifts toward the first reviewer's idiosyncrasies.

The golden set is append-only. Entries are not removed; they are annotated as "superseded by" when a newer case covers the same ground more precisely.

The smell of it going wrong

Production logs and eval datasets have different schemas. Columns are named differently, the input format differs, metadata fields are missing. Every conversion from production log to eval row is a manual mapping task. The mapping is documented nowhere; each engineer who does it does it slightly differently.

Flagged failures sit in a review queue nobody works. The queue was set up three months ago; it has 847 entries. Nobody has a calendar block for it. The last entry was reviewed six weeks ago.

Production rows enter the golden set without a second reviewer. The golden set has accumulated cases from one engineer's Friday-afternoon review sessions. His definition of "correct" is embedded in 200 cases in ways nobody has audited.

The schema does not capture user feedback. Thumbs down, immediate retry, session abandonment — these are the cheapest and most reliable signals of a production failure, and they are not in the log.

A judgment call from real work

PL's course-fit scoring pipeline logs every score as a structured row: the input lesson content, the rubric breakdown (clarity score, depth score, practice-alignment score), the overall fit score, the model version used, and the timestamp. When a course owner reviews a scored lesson and disagrees with the overall score, they enter a disagreement through a simple form: their human-graded score and a sentence of justification.

That disagreement goes into the eval-candidate queue automatically. It carries the original input, the model output, and the human-graded expected output. A second reviewer — usually the course director — looks at the disagreement and either confirms it as a valid eval case or resolves it as a misunderstanding of the rubric.

The effect has been a slow but steady growth in the golden set toward the cases that actually matter. The earliest cases in the golden set were synthetic — lessons we imagined being edge cases. The most valuable cases are the ones that came from real disagreements between the model and a course owner who knows the material. The shape of those disagreements tells us where the rubric is ambiguous and where the model is systematically biased.

The queue needs an owner. When the course director had a heavy week and the queue sat for ten days, we lost the thread on three disagreements that we later could not reconstruct. The lesson: reviewer ownership is as important as schema design.

The next lesson turns from capturing what the system does to tracking what it costs. Cost logging follows the same discipline: a shared schema, a named reviewer, and a dashboard number the team can recite from memory.

Rules from this lesson

Production logs and eval datasets share one schema; two formats that require manual translation mean the translation never happens.
Reviewer queues need named owners with calendar blocks; a queue without an owner is an aesthetic object.
User feedback — thumbs, retries, abandonment — is the cheapest signal of an eval candidate and the first thing to add to the schema.
The golden set is append-only; mark superseded entries rather than deleting them, so the history of what you learned is preserved.
Disagreements between the model and a domain expert are more valuable than confirmations; the cases where they diverge are where the rubric is ambiguous and the model is most likely to be systematically wrong.