Postmortems as regression tests — every incident becomes an eval — Production Harnesses — observability, recovery, the bill

The move. Every postmortem ends with a new entry in the eval suite. No exceptions.

There is a well-understood reason why postmortems exist.

The knowledge that a failure exposes — about where the system breaks, under what inputs, for what reason — is expensive to acquire. It cost an incident to get it. If that knowledge evaporates, the cost will be paid again.

The blameless postmortem ritual is the mechanism for not letting the knowledge evaporate.

This is correct as far as it goes. The problem is that knowledge in a document is not quite the same as knowledge that runs in CI.

A postmortem document records what happened and what was changed. It lives in Notion or Confluence. Some number of people read it in the week after it is published. After that, it is searched for infrequently and referenced almost never.

It does not prevent recurrence. It preserves the memory of recurrence after the fact.

A regression test in the eval suite is different. It runs on every deploy. It fails if the failure condition is reproduced.

It is not a memory aid — it is an automated check. The incident that generated it cannot recur silently; the eval fails loudly before the code reaches production.

The discipline of turning every postmortem into an eval entry is the difference between a team that learns from incidents and a team that has a folder of incident documents.

The picture

A pipeline with five stations. Station one: an incident occurs, gets triaged, gets resolved via the on-call playbook. Station two: postmortem is written within 48 hours. It follows the four-part template (what happened, why, what changed, what blocks recurrence). Station three: the "what blocks recurrence" section requires a specific answer — which eval entry was added, with the exact input that would have caught this incident before it reached production. Station four: the eval entry enters the CI pipeline. It runs on every PR and every deploy. Station five: six months later, a code change would have reintroduced the failure condition. The eval catches it before the PR merges. The incident never recurs.

The pipeline closes the loop. An incident is not finished when the on-call engineer goes back to sleep. It is finished when the eval suite runs the input that would have caught it.

Why it matters now

The discipline of blameless postmortems is forty years old; Google, Netflix, and Etsy wrote the foundational documents. Teams use the discipline well.

The gap — which is newer — is the extension to eval entries.

By 2026 the best agent teams treat the eval suite as the institutional memory of every incident, not just as a pre-launch quality gate.

The distinction matters because agent systems fail in distribution-tail ways that synthetic evals do not anticipate. The production traffic that finds the failure is the only reliable source of the input that catches it.

If that input is not preserved in the eval suite, it is preserved only in the postmortem document.

And documents are not automated.

A source you should trust

Google's SRE book, chapter on postmortems, is the canonical frame for blameless retrospection — the discipline of separating incident analysis from blame so that engineers can describe failures accurately without fear of consequence. The blameless frame is the prerequisite; without it, postmortems produce sanitized accounts that omit the useful details.

Hamel Husain's writing on production-driven evals connects the postmortem discipline to the eval-entry extension. His framing of the eval suite as institutional memory is the clearest articulation of why "document it" is insufficient and why "add it to CI" is the higher standard.

A recipe

A four-part postmortem template that ends with an eval entry:

What happened. A factual timeline in neutral language. Not "the agent misbehaved" — "at 03:14, the agent entered a tool-call loop on input X. The loop continued for 23 turns before the per-session cost cap triggered a pause. User Y's session produced output Z, which was incorrect." Specific enough that someone who was not on-call can reconstruct the incident from the document.
Why it happened. Root cause, not symptom. Not "the agent looped" but "the stopping condition in the planner checked for 'task_complete' in the output, and the tool call that triggered the loop never produced output containing that string because the API response schema changed in version 3.2." One cause. Not a list of contributing factors unless the failure genuinely had multiple independent causes.
What we changed. Code changes (PR numbers), configuration changes (commit SHAs), process changes (playbook updates, monitoring additions). Specific. Linkable.
What blocks recurrence. This section is the forcing function. It cannot be answered with "we'll be more careful" — that is not a blocking mechanism. It must name: the eval entry that was added (with the input that reproduces the failure), the CI check that runs it, and the earliest point in the deployment pipeline where the check would catch a regression. If this section cannot be filled in, the postmortem is not done.

If the failure mode does not translate directly into an eval entry — for example, an infrastructure failure rather than a model behavior failure — the "what blocks recurrence" section should name the alert or monitoring check that was added, not an eval entry. The principle is the same: the blocker must be automated, not documentary.

The smell of it going wrong

Postmortems are written but never read. The team has a Notion page called "Incident Postmortems" with twelve entries. The last one was read by three people in the week it was published. Nobody has read the ones from six months ago.

The "what blocks recurrence" section is "we'll be more careful." This is the most common failure mode. It is not a blocking mechanism. The team is three months away from the same incident recurring, and the postmortem will read almost identically.

No eval entry is added. The postmortem is complete from a document standpoint. The failure input that triggered the incident is described in the document. It is not in the eval suite. A deploy six weeks later would have reintroduced the condition; the eval suite runs and passes; the regression ships.

The same incident recurs within six months. This is the signal that the postmortem ritual is not working. Not that incidents happen — incidents always happen — but that the specific failure mode from a previous incident recurs. This is the preventable failure.

A judgment call from real work

The Ostronaut named-vector retrieval P0 incidents produced exactly this pipeline over several months. The retrieval system uses named vector spaces to separate content types — course lessons, case studies, manual chapters — and the early versions had a quality gap: the orphan_gap_pct metric (the percentage of source document sections that were embedded but did not appear in any retrieved result for reasonable queries) was running at 36%. Meaning roughly a third of the content was effectively invisible to the retrieval system.

Each P0 incident followed the same pipeline. A retrieval failure was reported — a query returned no results, or returned results from the wrong content type. The postmortem traced the failure to a specific input and a specific retrieval behavior. The eval entry was the input that reproduced the failure, the expected retrieved content, and the orphan_gap_pct threshold that should have caught the embedding quality before the content reached the production index.

Over four incidents across three months, the eval suite grew from twelve retrieval cases to forty-one. The orphan_gap_pct threshold moved from "checked occasionally" to "blocking gate on every pipeline run." The time-to-detect-regression on retrieval quality dropped from "days after a user report" to "during the pipeline run before production promotion."

The moment the team noticed the eval suite was growing in a useful direction was when a model upgrade was proposed. Before the upgrade, the team ran the eval suite against the new model's embeddings. The suite caught two regressions that would not have been caught by any other mechanism. The upgrade was approved with modifications. The institutional memory had done its job.

The next lesson is the Ostronaut batch pipeline — a system where the postmortem-to-eval pipeline ran through three P0 incidents and produced forty-one eval entries in three months. That history is the clearest demonstration in this course of the eval suite operating as institutional memory under real operational pressure.

Rules from this lesson

Every postmortem ends with an automated blocking mechanism — an eval entry in CI, or a monitoring check; documents that require humans to remember and re-apply are not blocking mechanisms.
The "what blocks recurrence" section is not complete if it says "we'll be more careful"; a mechanism that relies on human care degrades under stress, exactly when it is most needed.
The eval suite is the institutional memory of every incident; protect it the way you protect production data — with access controls, naming conventions, and a changelog.
The same incident recurring within six months is the signal that the postmortem ritual is not producing automated blockers; audit the "what blocks recurrence" sections of the previous postmortems.
Schedule the postmortem within 48 hours of resolution; "when we have time" does not arrive, and the incident details that make the eval entry accurate fade quickly.