Every agent loop demo you have ever seen ran for less than five minutes.
This is not a coincidence. It is a selection effect. Demos are shown at the length where the interesting failures have not yet arrived. The presenter is not hiding anything — they genuinely do not know what happens at hour four, because they have never run the loop at hour four. Staging environments are sized for demos. Production runs are not.
The result is a systematic evaluation gap. Teams make build-buy decisions, vendor selections, and architectural commitments based on five-minute demos of systems they intend to run for six hours. The mismatch is not caught until the first overnight run. By that point, the decision has been made.
The failure modes that matter most in production are hour-two failures, not hour-one failures. If you are evaluating an agent system by watching demos, you are grading it on the wrong exam.
Here is a timeline of when specific failure regimes typically appear, in order.
Prompt drift (≈ hour one). The agent's behavior begins drifting from the original instruction. Not dramatically — subtly. The framing shifts slightly. Constraints that were explicit in the system prompt start being interpreted loosely. This is the first failure mode because it is the first thing that is hard to reproduce: prompt drift in a fresh context window looks like model inconsistency. In a long context window with a lot of tool-call history, it is usually a context-band problem — the system prompt is still technically present, but it is competing with accumulated turns and losing attention share.
Tool-loop ping-pong (≈ hours one to two). The agent reaches a state where it cannot make forward progress with the tools available, but does not recognize this and ask for help. Instead, it cycles between two or three tools, trying variations of the same calls, producing slightly different results each time, none of which resolve the ambiguity. The loop is busy. The work is not advancing. Without a no-progress guard, this burns budget efficiently and silently.
Context eviction of a load-bearing instruction (≈ hours two to three). A constraint or grounding fact that was near the start of the context gets evicted, summarized away, or pushed into the attended-less-reliably middle zone as tool-call history accumulates. The agent continues behaving correctly on new inputs but stops applying the evicted constraint. This failure is particularly hard to catch because it does not produce an error. It produces a plausible-looking result that violates a rule the agent was explicitly given hours ago.
Silent-success hallucination (≈ hours three to four). The agent reports completing a task it did not complete. The tool call returned an ambiguous result — a 200 status on an endpoint that should have returned a 201, or a partial write that did not error — and the model interpreted the ambiguity as success. In short runs, these are caught quickly because a human is watching. In long autonomous runs, the hallucinated success sits in the tool-call history as a fact, and subsequent reasoning is built on a false premise.
Budget exhaustion (≈ hours four to six). The token budget or cost cap fires. If the loop was designed with a clean exit, this produces a checkpoint and a summary. If it was not, it produces a stack trace and lost work. The timing varies widely based on task complexity, but unplanned budget exhaustion is almost always a stopping-condition design failure — either there was no cap, or the cap was checked at the end rather than per turn.
Deadlock on shared state (multi-agent, anywhere). When multiple agents have write access to shared state — the same file, the same database rows, the same in-progress PR — they can reach a state where each is waiting for the other to release a lock it holds, or more commonly, where they are overwriting each other's work without knowing it. This failure mode does not correlate with run length; it can appear at minute five in a poorly designed multi-agent system.
Why it matters now
The industry shifted from single-prompt features to multi-turn autonomous loops across 2024-2026. The evaluation discipline did not shift at the same speed. Most teams are still grading their agents at demo length and shipping them at production length. The gap between those two lengths is where the incident log lives.
METR's long-horizon task evaluations were among the first systematic attempts to grade autonomy at multi-hour lengths rather than single-prompt accuracy. The finding is consistently that performance curves which look flat and strong at short horizons show meaningful degradation at long horizons. The degradation patterns match the failure-mode timeline above.
This is not a critique of the model. It is a critique of evaluation methodology. A model that performs at 85% on a five-minute benchmark and 60% on a four-hour benchmark is not a worse model than its benchmark number suggests. It is a different product suitability question. The benchmark answers "can it do this task?" The long-horizon evaluation answers "can it sustain doing this task for the duration your production use case requires?" Those are not the same question, and they should not share the same answer.
A source you should trust
- Sweep AI's postmortems. Sweep ran an autonomous coding agent on GitHub issues for months and published operator-grade writeups of what broke at production length. The specifics are their codebase; the patterns are universal. Read them for the shape of the failures, not the fix.
- METR's long-horizon task evaluations. The systematic evidence base for why demo-length evaluation is the wrong signal. METR grades agents at one-hour, four-hour, and eight-hour task lengths. The gap in performance between horizons is the gap between "impressive demo" and "shippable product."
- Anthropic's "Building effective agents" (2024). The section on monitoring and observability directly addresses the telemetry signals you need to catch these failures before they bite.
A recipe
A six-hour shakedown protocol before any autonomous agent ships to production:
- Run the loop at 6× the expected production task length on a single representative task. Not a curated demo task — a representative production task with realistic ambiguity.
- Log per-turn across four metrics: tokens consumed (running total and per-turn delta), tool-call distribution (which tools are being called, how often), context-band sizes (how much of the window each band occupies), and no-progress streak (consecutive turns without a meaningful state change).
- Plot all four metrics as line charts. Prompt drift shows up as a step change in tool-call distribution. Tool-loop ping-pong shows up as an exponential no-progress streak. Context eviction shows up as a context-band chart where the system-prompt band shrinks. Silent-success hallucination does not show up on its own — you have to audit the output against ground truth.
- Reproduce any anomaly twice. A single anomaly is a flake. Two reproductions are a failure mode that ships eventually.
- Write the alert. The telemetry signal that caught the failure in the shakedown should be the same signal that fires in production observability. The shakedown is wasted if you do not wire it to an alarm.
The smell of it going wrong
- The team has never run the loop at production task length. "We tested it" means "we ran the demo three times."
- Recovery from a failed long run is "restart the run from the beginning." No checkpoints, no resume.
- The same hour-three failure mode appeared in the last two incidents and there is still no telemetry alarm for the signal that precedes it.
- The agent reports task completion and the output has not been checked against ground truth for the silent-success class of failures.
- The multi-agent system has two agents with write access to the same resource and no locking discipline.
A judgment call from real work
PL has lived the tool-loop ping-pong failure mode inside subagent runs on issue-triage batches.
The setup: a parent agent spawning parallel subagents to triage open GitHub issues, propose fixes, and draft PRs. Two subagents were assigned issues that both involved the same utility module — a shared helper that was referenced in multiple other files. Each subagent needed to read the module, understand its dependencies, and propose changes.
The failure was not contention on the module itself. It was in the retrieval logic. Each subagent's tool-call loop was: read the module, propose a change, read the PR context to check for conflicts, find a reference to the same module in the PR context, re-read the module with a slightly different query, get a slightly different result, re-propose, find the reference again. The loop was spending forty minutes per subagent reading the same file from slightly different angles, making no forward progress, before the parent agent's log showed the no-progress streak.
The signal that would have caught it: a no-progress counter on consecutive turns where the same file was being read without a write or a completion event. That counter became the regression test and later became a standard guard in the PL subagent harness.
The cost of the incident was not catastrophic — two subagents burning forty minutes of compute each. But it was the signal that made the monitoring discipline load-bearing rather than aspirational.
The broader principle it demonstrated: telemetry that catches failures in development is worth an order of magnitude more than telemetry added after the first production incident. The no-progress counter was a ten-line addition to the logging harness. It would have flagged the ping-pong loop in the first five minutes. It was added after the forty-minute loss.
Write the alert before you ship the loop.
Rules from this lesson
- Score agents at the horizon they ship at, not the horizon they demo at; the interesting failures all live after hour two.
- The four telemetry signals — token burn rate, tool-call distribution, context-band sizes, no-progress streak — are the cheapest possible early warnings for every major failure regime.
- Recovery from a failed long run is a designed feature; if you cannot resume from a checkpoint, you cannot run safely past hour two.
- Multi-agent shared state requires explicit locking discipline; "we'll be careful" is not a guard.