Recovery and checkpoints — resuming work, not restarting it — Production Harnesses — observability, recovery, the bill

The move. If you cannot resume, you cannot run safely past hour two.

The first long agent run your team operates will probably crash.

This is not pessimism; it is statistics. Long runs encounter more tail conditions. A tool call that succeeds 99% of the time will fail once every hundred calls.

If your agent makes fifty tool calls in a session, the probability of at least one failure is roughly 40%. Add infrastructure unreliability, model rate limits, and transient network conditions, and "the six-hour run will crash at some point" is a reasonable planning assumption.

The question is not whether your run will crash. The question is what happens when it does.

If the answer is "it starts over from the beginning," you have a problem that scales with run length.

A two-hour run that restarts costs two hours. A six-hour run that restarts at hour five costs the five hours already spent plus six more.

At some point — and that point arrives faster than most teams expect — the expected cost of a session includes the cost of recovery, and recovery without checkpoints makes long runs economically unviable.

Checkpoint design solves this. It is the boring, unglamorous engineering work that makes long-running agents tractable.

The picture

A horizontal timeline representing a six-hour agent run. Diamond markers at regular intervals: checkpoint at T+30min, T+1hr, T+1.5hr, and so on. At T+5hr, a crash marker. An arrow from the crash marker back to the T+4.5hr checkpoint diamond, labeled "resume from here." Below the timeline, two cost comparisons: "restart from zero: 11 hours of compute, 11 hours of calendar time." "Resume from checkpoint: 1.5 hours of compute, 1.5 hours of calendar time."

Below that, a storage cost line. Checkpoints are small relative to the compute they save. A checkpoint that serializes working memory and turn history might be 50KB. A six-hour run with checkpoints every thirty minutes writes 12 checkpoints, totaling 600KB. The compute cost of that restart is thousands of times higher than the storage cost of the checkpoints.

Why it matters now

Long autonomous runs are common enough by 2026 that "restart from zero" is no longer a valid failure recovery strategy for anything past two hours.

The cost compounds across three dimensions: tokens (you re-run the same inference), calendar time (the user waits again), and user trust (they submitted a task, the system crashed, and now they are told to submit it again).

Checkpoint design is the cheap path. It requires engineering effort before launch, but that effort is bounded and predictable.

The alternative — "be more careful about not crashing" — is not a strategy. It is optimism with no implementation.

A source you should trust

LangGraph's checkpointer documentation covers the current best production primitive for agent-loop checkpointing in Python-based agent systems. LangGraph has invested significantly in this infrastructure because their users' agents are long-running by design. The documentation is concrete: it explains the Serde protocol, the storage backends (Postgres, Redis, SQLite), and the resume semantics.

Temporal.io's workflow-resume documentation represents the state of the art from workflow engine design, which has been solving this problem for distributed systems for over a decade. Agent loops are a specific case of the general problem Temporal solved for microservice orchestration. Reading their approach to exactly-once execution, idempotency keys, and saga patterns will raise your design vocabulary even if you do not adopt Temporal directly.

A recipe

A checkpoint-design protocol for any long-running agent:

Decide checkpoint frequency. The options are: every turn (highest safety, highest storage and I/O cost), every N minutes (time-based, simple but can have large gaps), at named state transitions (highest design effort, lowest storage cost, most meaningful resume points). For most systems, every turn is right for short sessions and every 5-10 turns or at major state transitions is right for long sessions.
Decide what is serialized. Minimum: the conversation history up to this point, the working memory the agent is maintaining, the current tool call queue. Optional: a summary of completed work (cheaper than full history, useful for very long runs). Do not serialize the full raw context of tool call responses — those can be re-fetched from source if needed, and serializing them bloats the checkpoint.
Decide where checkpoints are stored. Separating checkpoint storage from production data is good practice. Options: a separate Postgres table, a Redis keyspace, a file per session in object storage. The choice depends on retention requirements and access patterns during recovery.
Decide resume semantics. Exactly-once means the agent never re-executes a turn it has already completed. At-least-once means a turn may re-execute if the crash happened after the turn completed but before the checkpoint was written. Idempotent re-execution (same input always produces same output) makes at-least-once semantically equivalent to exactly-once for most purposes. Design for idempotent re-execution where possible; it is the safest semantic with the lowest implementation complexity.
Test recovery from each checkpoint level before launch. Not from a designed crash scenario — from a forced kill of the process during a live run. The resume path has bugs that only appear under real recovery conditions.

The smell of it going wrong

The team has never tested a recovery from a forced crash. The checkpoint system was designed and deployed; it has never been exercised.

This is the most common failure pattern. The first time the system resumes from a checkpoint is during a production incident, when the stakes are high and the on-call engineer is under pressure.

Checkpoints are written but resume has never been exercised. Slightly different from above: the team has run tests that verify checkpoints are being written, but they have never run the resume path. The resume path has three bugs they will discover at 3am.

Checkpoint storage is not garbage-collected. Every session writes checkpoints. Sessions are abandoned, completed, and retried. The checkpoint storage grows unboundedly. The bill for checkpoint storage eventually exceeds the bill for inference. (This is not hypothetical; it has happened.)

Resume semantics are implicit. The team is not sure whether re-running a checkpoint will produce the same result.

Some tool calls have side effects — they send emails, update databases, make API calls. If those tool calls re-execute on resume, the side effects happen twice. Nobody knows which tool calls are idempotent and which are not.

A judgment call from real work

The PL parallel-triage incident is a checkpoint failure in slow motion. During the polish-foundation-sprint, a batch of subagent tasks was submitted in parallel. When a rate-limit (429) error hit, several subagents were mid-task. They had no checkpoint protocol; their progress was not serialized at any turn boundary. When the rate limit resolved, the subagents had to start over from their initial prompts.

The cost was measured in hours, not in money — each affected subagent re-ran a task that had already been mostly completed. But the design lesson was clear: any system that runs subagents in parallel needs per-subagent checkpoints at each turn boundary, not just at job submission and job completion.

The fix added checkpoint files scoped to each worktree: after each tool call, the subagent writes its current state (conversation history, working memory, completed-step list) to a file at a predictable path. If the subagent is interrupted and restarted, it reads the checkpoint file, reconstructs its state, and resumes from the last completed step.

What is serialized: the conversation history (compressed to turn summaries after the first ten turns to keep the file small), the list of completed steps with their outputs, and the current step being attempted. What is not serialized: the full raw content of file reads and web fetches (those are re-fetched if needed). Resume semantic: at-least-once, with idempotent steps so re-execution is safe.

The garbage-collection policy: checkpoint files older than 72 hours are deleted by a cron job. Sessions that old are either completed or abandoned; in either case, the checkpoint has no value.

Checkpoints are the precondition for replay, which is the subject of the next lesson. You cannot replay a session from the middle if there is no serialized state at the middle. The two disciplines compound: checkpoints make replay useful, and replay makes checkpoints worth maintaining.

Rules from this lesson

Test recovery from forced crashes before launch, not during the first incident; the resume path has bugs that only appear under real recovery conditions.
Design for idempotent re-execution — same input, same output — because it makes at-least-once semantics safe and eliminates a class of double-action bugs.
Garbage-collect old checkpoints; storage grows unboundedly without a retention policy, and the bill is real.
Document which tool calls have side effects before wiring them into a resumable agent; non-idempotent tool calls require exactly-once semantics, which is harder to implement.
Separate checkpoint storage from production data; the separation simplifies the retention policy and prevents checkpoint noise from polluting production query patterns.