Replay — debugging in slow motion — Production Harnesses — observability, recovery, the bill

The move. If you cannot replay yesterday's incident in your dev environment, you cannot debug it.

There is a particular kind of debugging frustration that only agent systems produce.

You have a production incident. You have the logs. The logs show the agent's outputs, the tool call names, the error message at the end.

But the agent's behavior across twenty turns is not obvious from a flat log. You want to understand why the agent chose tool B at turn 12 instead of tool A, and the logs tell you what it did, not why.

For a single-function bug, you can reproduce it: call the function with the same inputs and step through the debugger.

For a multi-turn agent failure, reproduction requires re-running the entire session with the original context at each turn — and doing so without triggering the production side effects of the tool calls that ran during the original session.

This is the replay problem. It is not unique to AI agents; distributed systems engineers and game developers have been solving it for years.

What is new is the combination of probabilistic outputs (the same prompt may produce different outputs on re-run due to temperature) and expensive tool calls (a replay that actually sends emails or deletes files is not a safe debugging environment).

The teams that debug agent failures fastest have a replay surface. The teams that do not are left reasoning from logs, which works up to about a dozen turns and fails badly after that.

The picture

A replay UI mockup with three sections. Top: a timeline of the recorded run, showing each turn as a node with timestamp and cost attribute. A scrubber control allows stepping forward and backward. Center: at each step, the full context visible to the agent at that turn — conversation history, tool call results, system prompt. This is the context window the model saw when it made its decision. Bottom: a "what if" editor. The debugger can modify one element of the context at a chosen step — change the tool call result, alter a message — and run the session forward from that point. The modified run and the original run are shown in split view, with divergence highlighted.

This is not a luxury UI feature. It is the debugging surface that makes multi-turn agent failures tractable.

Why it matters now

The complexity of multi-turn agent runs makes "stare at the logs and reason about what happened" intractable past about twenty turns.

Twenty turns means twenty model decisions, each conditioned on the outputs of the previous ones. The failure at turn 18 may have been caused by a context error at turn 4.

Tracing that dependency chain from logs requires holding the whole session in working memory simultaneously — which a human engineer cannot do reliably for complex sessions.

By 2026 several vendors ship replay surfaces as a core feature: LangSmith's trace explorer, Phoenix's span viewer, Braintrust's evaluation replay.

Rolling your own replay is more expensive than adopting a vendor. The question is not whether to have replay; it is which surface to use and how to wire it correctly.

The irreversibility of agent actions also makes replay critical. If an agent deleted a file, sent a message, or updated a record, you need to understand why it took that action before the next session runs.

Replay gives you the why. Logs give you the what.

A source you should trust

LangSmith's replay documentation covers a representative production replay implementation. The workflow is: trace a production session, open it in LangSmith, step through turns, inspect context at each step, optionally re-run from any point with modified inputs. The implementation detail that matters is isolation: re-runs in LangSmith's playground use mock tool responses by default, so you can debug without triggering real tool side effects.

Honeycomb's tracing-replay writeups represent the non-AI observability literature on what makes replay useful. Honeycomb has been building trace analysis tooling since before LLMs were a topic; their thinking about what makes a trace navigable in a postmortem is more mature than most AI-native vendors' writing. Read them to understand what you are aiming for.

A recipe

A pre-launch replay readiness check — four questions that need affirmative answers:

Can you re-execute any production run from its trace? This requires your tracing setup to record enough information to reconstruct the context at each turn. If traces only record outputs and not the full context window, replay is not possible. Verify by attempting to replay the last three production sessions in your dev environment.
Can you modify one input and see the divergence? This is the "what if" capability. It requires the ability to inject modified tool responses into a replayed session. If your replay infrastructure re-calls production tools, you cannot safely modify inputs — you need a mock layer.
Can you step turn-by-turn and inspect context and tool calls? This is the step-debugger capability. A replay that only lets you run forward is less useful than one that lets you step. The debugger's mental model requires being able to pause and inspect.
Is the replay environment isolated from production? This is the safety requirement. A replay that sends real emails, makes real database writes, or calls real external APIs is not a debugging environment — it is a production environment with extra steps. Isolation can be achieved with mock tool responses, a separate test database, or a dry-run mode for external APIs.

If any of these questions has a "no" answer, address it before you need to debug a production incident.

The smell of it going wrong

Debugging an incident requires reading raw logs. The team opens a Kubernetes log stream, copies the output to a text file, and begins reading. This is the absence of replay. It works for single-turn failures; it becomes intractable at twenty turns.

The team has never re-executed a recorded run. The tracing infrastructure exists and traces are stored, but nobody has walked through the replay workflow. The first time the team uses replay is during an active incident, when time pressure makes learning a new tool painful.

Replay is possible in theory but nobody has practiced it. Similar to above, but with more self-awareness. The team knows replay exists in their vendor's UI; they have watched a demo. They have not done it themselves with a real session from their own system.

Replay re-executes against production tools, causing duplicate side effects. The team discovered they could replay sessions, but the replay triggered real external API calls — a message was sent twice, a database record was updated twice. The replay was turned off and not revisited.

A judgment call from real work

The Claude Code session transcript is itself a replayable artifact. Every tool call, assistant response, user message, and system message is recorded in the local JSONL file. For PL's agent work, this means any session can be reconstructed by reading the transcript: the team can scroll back through parent and subagent transcripts and reconstruct every decision in sequence.

This is read-only replay — the team can re-read the transcript and understand what happened, but they cannot re-execute the session with modified inputs and see the divergence. For most PL debugging tasks, read-only replay is sufficient. The failure mode is usually clear enough from the transcript that the team knows what happened and why.

The cases where true re-execution would have been valuable were the ones where the cause was subtle. In the polish-foundation-sprint deploy-mismatch incident, the transcript showed the tool calls the agent made but not the environmental state that caused those calls to produce unexpected results. A replay that could inject modified environment state — specifically, the correct URL for the staging app — would have reproduced the failure faster and made the fix more obvious earlier.

The lesson is: read-only replay is better than nothing and covers most debugging tasks. True re-execution with input modification is necessary for failures caused by environmental state that is not captured in the transcript. Design your tracing to capture enough environment state that the two categories are distinguishable.

The next lesson looks at PL's full observability stack — Sentry, Checkly, and Playwright — as a complete working example. Replay is one layer in that stack; understanding how it fits alongside error tracking and synthetic monitoring gives the complete picture of what "production observability" means in practice.

Rules from this lesson

Replay is an observability feature that pays its investment back on the first complex incident — treat it as non-optional, not as a luxury.
Adopt a vendor replay surface rather than building your own; the vendors have invested more than a team building for the first time can justify.
Practice replay in calm conditions, not during an incident; the first time you use a tool under pressure is not the time to learn it.
Ensure replay is isolated from production tools; a debug session that triggers real side effects is not a safe debugging environment.
Design tracing to capture enough environmental state that the difference between "agent misbehaved" and "agent saw wrong environment" is diagnosable from the trace alone.