The most reliable predictor of a "$400 overnight incident" is an agent loop with no explicitly designed stopping condition.
Every agent loop eventually stops. The question is whether it stops because your code says so, or because a cloud billing alarm fires at 6am, or because the API returns a 429 and the run dies without a checkpoint. Those three outcomes are not equivalent. Only the first one is under your control.
The stopping condition is the component most often treated as an afterthought because it is invisible during successful runs. When the loop works, it stops because the work is done. The stopping condition code never fires. This creates a dangerous selection bias: teams evaluate their stopping conditions only on the failure cases they expected, not on the failure cases they did not. The ones they did not expect are the ones that produce incidents.
The stopping condition is the component most often treated as an afterthought because it is invisible during the demo. Demos stop when the presenter clicks stop. Production runs stop when something makes them stop. In the absence of a designed stopping condition, that something is usually a billing surprise.
There are four stopping conditions every production loop should declare explicitly, before the loop ships.
Token budget. A maximum token count for the entire run, checked per turn. Not checked at the end — per turn. Checking at the end is how you discover you spent $400. Checking per turn, with a budget remaining below some threshold, gives the loop an opportunity to summarize work, write a checkpoint, and exit cleanly instead of running until the API refuses it.
Wall-clock timeout. A maximum duration regardless of activity. An agent can consume zero tokens for thirty minutes because it is stuck waiting on an external API that is not responding. The wall-clock timeout catches this class of hang that the token budget misses.
No-progress timeout. A count of consecutive turns without meaningful state change. This catches the class of failure where the agent is technically running, consuming tokens, making tool calls — but not making progress. The most common variant is tool-loop ping-pong: the agent calls tool A, gets a result it does not know how to interpret, calls tool B, gets an ambiguous result, calls tool A again. The loop is active. The work is not advancing. Without a no-progress guard, this runs until the token budget or the wall-clock fires. With a no-progress guard, it fires early and asks for human input.
Goal predicate. A small, testable function that returns true when the JTBD is done. This is the clean exit — the loop is allowed to end before hitting any of the three defensive guards above because the work is genuinely finished. The goal predicate should be explicit code, not the model's judgment that it is done. "The model said it was finished" is a wish. A function that checks the state of the world against the definition of done is a guard.
These four conditions form a hierarchy. The goal predicate fires first when the work succeeds. The no-progress timeout fires when the agent is stuck. The token budget fires when the run is too expensive. The wall-clock timeout fires when the run is too slow or hung. Each guard is covering a failure mode the preceding guard does not reach.
A common mistake is to implement only the token budget and declare the stopping conditions "done." The token budget catches overspend. It does not catch a hung loop with slow I/O that costs nothing per turn but wastes hours. It does not catch a tool-loop ping-pong that is making progress-shaped noise without producing output. All four guards are needed because agent failure modes are not correlated.
Why it matters now
By 2026, most frameworks ship a default timeout. Almost none of them ship a default cost cap. The gap between "the framework has a timeout" and "I have designed stopping conditions" is where most incidents live.
Default timeouts are also usually measured in wall-clock time and set to values that make sense for interactive use, not for long autonomous runs. A 30-second HTTP timeout is reasonable for a chat interface. A 30-minute autonomous research task needs a stopping condition that is aware of work completion, not just elapsed time.
The proliferation of multi-agent patterns — parent agents spawning children, children spawning tools that themselves take meaningful compute — has made stopping conditions harder to reason about and more load-bearing. In a single-agent loop, a missing stopping condition burns one run's budget. In a parent-with-thirty-children pattern, a missing stopping condition on each child burns thirty runs' budgets simultaneously.
The math is worth doing explicitly. If each child runs without a cap and spends 10K tokens on average, thirty children spend 300K tokens. If the quota limit is 200K, the run hits the limit at child twenty, mid-work, with no clean exit. The remaining ten children never start. The first twenty never finish. All of it is recoverable if children have checkpoint protocols; none of it is recoverable if they do not. The stopping condition architecture determines which world you live in.
A source you should trust
- LangGraph state-machine documentation. The current best primitive for declarative stopping conditions in a multi-turn loop. LangGraph's explicit node/edge model forces you to think about stopping states as named objects, which is the right mental model. Even if you are not using LangGraph, read the stopping-condition section for the vocabulary.
- Aider's cost-tracking writeups. Aider is a coding agent that ships to end users; the team has written openly about per-run budget management. Operator-grade discipline from a team that had to solve this for a consumer product, not a demo.
- Anthropic's "Building effective agents" (2024). The section on human-in-the-loop patterns covers the design of confirmation gates and timeout escalations. Directly applicable to no-progress and goal-predicate design.
A recipe
Four stopping conditions every production loop should declare explicitly:
- Token budget. Set a maximum for the run. Check remaining budget at the start of each turn. Below 15% remaining: write a checkpoint and exit cleanly. Below 5%: write an emergency summary and exit. Never run to zero.
- Wall-clock timeout. Set a maximum duration. Wire it to a checkpoint write before exit — not just a hard kill. A hard kill loses the work; a checkpoint write enables resume.
- No-progress timeout. Define "meaningful state change" for your domain — a file written, a row inserted, a decision logged, a message sent. Count consecutive turns without one. At N consecutive no-progress turns, escalate to human input rather than continuing to burn budget.
- Goal predicate. Write the exit condition as a function before you write the loop body. If you cannot write the predicate, you have not fully specified the JTBD.
The smell of it going wrong
- The framework's default timeout is the only stopping condition the loop has. Nobody has read what that default is.
- Cost is checked after the run completes, not turn-by-turn. The first signal that a run went wrong is the invoice.
- "Done" is implicit in the model's output — the loop exits when the model stops producing tool calls. This is the goal predicate delegated to model behavior, which is not a designed guard.
- The agent enters a tool-loop ping-pong — two tools alternating without progress — and runs for forty minutes before a human notices. There is no no-progress guard.
- A parent agent spawns children with no per-child token cap and no knowledge of aggregate spend. The first 429 is also the moment the parent learns its children have been expensive.
A judgment call from real work
The PL quota-429 incident is the canonical case for why stopping conditions are load-bearing, not optional.
The setup: a parent Claude Code session spawning thirty-plus parallel subagents on an issue-triage batch. Each subagent was working on an independent issue — reading code, proposing a fix, writing a draft PR. The parent had no per-child token cap. The children had no individual stopping conditions beyond the framework's default timeout.
The quota pool was shared. The parent and all thirty children drew from the same bucket. As children accumulated context — long code files, tool-call history, multiple retrieval passes — the burn rate accelerated. The quota hit the rate limit around the forty-minute mark.
At that point, none of the children had committed. The progress from all thirty parallel runs was in memory and in-flight tool calls. It was lost.
The fix had two parts. First, per-child token caps: each spawned subagent received an explicit budget that was a fraction of the parent's remaining quota, calculated at spawn time. The parent tracked aggregate spend and refused to spawn new children when the remaining quota fell below a safety threshold. Second, checkpoint protocol: children were required to commit work-in-progress to git before advancing past a certain complexity threshold. A 429 mid-work became a delay and resume, not a total loss.
The parent also got a deadman switch: a goal predicate that could terminate the batch cleanly if quota fell below the minimum needed to recover. "Run until something stops us" became "run within designed bounds, exit gracefully when bounds are reached."
Both components — the per-child cap and the checkpoint protocol — were direct implementations of stopping condition design principles. They were not novel engineering. They were the application of the four guards above to a multi-agent topology.
Rules from this lesson
- Every loop needs a token budget, a wall-clock timeout, a no-progress timeout, and a goal predicate — all four, not just one.
- Stopping conditions are designed objects, not emergent properties; write them before you write the loop body.
- Budget checks happen per turn, not at the end; checking at the end is how you discover you spent $400.
- In multi-agent topologies, per-child caps and aggregate parent meters are both required; shared quota is shared blast radius.