The move. Treat agent production failures as a different species, not as "feature bugs that happen to involve AI."
If you have shipped single-prompt AI features, you have a mental model of what can go wrong.
Hallucination. Latency spike. A refusal on a valid input. Prompt injection from a creative user.
These failures are well-documented, and most teams have developed instincts for them.
Then you ship an autonomous agent, and you discover that this mental model is only half the map.
The second half is unfamiliar territory, and the regimes feel different enough that teams regularly mislabel them.
A runaway loop is not a latency spike. A cost explosion is not a hallucination. A multi-agent deadlock is not a refusal.
Treating all of them as "feature bugs that happen to involve AI" means you instrument the wrong things, miss the real failure, and surface it only when a customer files a ticket or the invoice arrives.
The picture
Two columns.
On the left: single-prompt feature failures. Hallucination. Latency. Refusals. Prompt injection. Token-limit truncation. These are bounded — a single call goes in, something goes wrong, the call ends. Damage is local and synchronous.
On the right: agent production failures. Start with everything on the left, then add the new taxonomy.
Runaway loops — the agent re-plans and re-executes indefinitely when the stopping condition is never met.
Cost spikes — a routine session balloons because a tool call returns unexpectedly large context.
Tool flapping — the agent toggles between two tool calls without converging, spending tokens on each cycle.
Context blow-up on long-tail input — a document nobody tested turns a 2k-token context into a 40k context.
Multi-agent deadlock — two subagents wait on each other's outputs with no timeout.
Memory write-amplification — an agent that writes to shared memory does so in a loop, inflating the store unboundedly.
Checkpoint corruption — a crash leaves partial state that causes the resume path to produce worse output than starting over.
These are different failure modes, not a superset of the old ones. They require different telemetry, different kill switches, and different runbooks.
Why it matters now
By 2026 the playbook for single-prompt feature operations is well-documented. Sentry captures uncaught exceptions. A timeout kills runaway requests. Token count shows up in every major tracing vendor.
The infrastructure is there; teams just wire it in.
The playbook for autonomous-agent operations is being written in real time by every team shipping one.
Naming the regimes precisely is the prerequisite to instrumenting them. You cannot define an alert threshold for a runaway loop until you have defined "runaway loop" in your system's terms. You cannot wire cost-spike telemetry until you know which cost number triggers concern.
The teams that are ahead are the ones that wrote the inventory before the first incident, not after.
A source you should trust
Sweep AI's published postmortems are operator-grade documentation of agents failing in production over extended runs. They are specific, unglamorous, and representative of what actually happens when autonomous systems meet real users. Read the postmortems before you write the first line of agent code.
Aider's "things that went wrong" writeups are practical and recent. Aider ships a coding agent used by thousands of developers; their failures are a reasonable sample of what long-running coding-agent systems encounter in the wild. The postmortems cover both single-session failures and cross-session patterns.
A recipe
The production-failure inventory every agent system should maintain before launch:
- List the production failures you have already seen in development and testing, with date and resolution. Even pre-launch failures belong here.
- List the failures you expect but have not seen, organized by category: runaway loop, cost spike, tool flapping, context blow-up, memory write-amplification, multi-agent deadlock, checkpoint corruption.
- For each expected-but-unseen failure category, write down the telemetry that would catch it. "We would know from X exceeding Y within Z minutes."
- Triage: which categories must be instrumented before launch, and which can wait for first incident? Write the triage. Implicit deferrals evaporate; written deferrals get tracked.
- Assign an owner to each pre-launch category. "The team" is not an owner.
The inventory is a living document. After each incident, the entry for that failure mode gets updated with what actually happened versus the prediction. That gap — between expected and actual — is where the most useful learning lives.
The smell of it going wrong
The team has only ever debugged short-run failures. Every incident it has responded to was bounded within a single API call. It has no intuition for what a twelve-turn runaway loop looks like in traces.
"We have monitoring" means page-load times and HTTP error rates. It does not mean agent-specific signals. The monitoring stack was designed for a request-response service and reused without modification.
A production incident's root cause is a failure category the team had never named. The first postmortem includes a long section about "we didn't even know this could happen." That is the tell.
The same failure mode hits twice because the first incident's telemetry was not added. The second incident looks slightly different on the surface — different inputs, different tool — but the root cause is the same unlabeled regime.
A judgment call from real work
PL's polish-foundation-sprint phase included a long stretch of failed deploys that were effectively invisible. The deploy-dev workflow was failing on lockfile drift — the Docker build was producing a different lockfile than the committed one — but the user-visible site continued to serve traffic from a Vercel project belonging to an earlier era of the stack. Talvinder's QA reports were being read as "there may be an issue with the dev branch," but because the stale Vercel instance was serving content, the reports looked ambiguous rather than definitive.
The failure was not surfaced until someone asked: which URL is the user actually testing? The answer was the Vercel URL, not the Fly staging URL. The monitoring stack had never been updated to point at the new infrastructure.
The lesson for agent systems is the same as for any production system, only the stakes are higher because an agent can spend money during the confusion. Production-failure observability includes "which production are we watching?" That question should be on the pre-launch checklist, answered with a URL and a screenshot of the current monitoring dashboard's target list.
The second thing to learn from this pattern is sequencing. In a single-prompt feature, a misfired deploy is a nuisance. In an autonomous agent, a misfired deploy against the wrong environment can mean real user interactions go to a system that is not instrumented, not safe, and not recoverable. The stakes of the "watching the wrong production" failure compound with autonomy.
The next lesson establishes the instrumentation foundation that turns the failure-mode inventory into something operational: span-per-turn tracing, the minimum schema for capturing what an agent does at each step. Without that foundation, the inventory is a wish list.
Rules from this lesson
- Agent production failures are their own taxonomy; name them before launch, not after the first incident.
- Maintain a written inventory of expected failure modes and the telemetry that catches each; implicit deferrals evaporate.
- The most expensive incidents in agent systems are the ones where the team is watching the wrong production — verify the URL before you trust the signal.
- Assign a named owner to each pre-launch observability gap; "the team" is not an owner.
- The inventory is a living document; update it after every incident with what actually happened versus what was predicted — that gap is where the most useful learning lives.