Most advice about multi-agent systems is written without an incident behind it. Here is one with the incident attached.
The design was sound in principle. PL had accumulated a backlog of GitHub issues: bug reports, feature requests, content gaps flagged by users. The task — read each issue, draft a response or a fix, open a PR — is exactly the embarrassingly parallel fan-out shape described in Lesson 3.
Each issue is independent. The variance is high. The synthesis step (reviewing the resulting PRs) is cheap relative to the per-issue work. The fan-out was justified.
The implementation was missing four things. Not one of them was obvious from the architecture diagram. All four were visible in retrospect. The incident was the teacher.
The picture
A timeline. T=0: thirty subagents spawned simultaneously, each assigned one issue, each running in its own worktree.
T=1 to T=11 minutes: steady-state consumption. Each subagent is reading the codebase, running searches, drafting changes. The aggregate token consumption is the sum of thirty active agents, each doing meaningful work.
T=12 minutes: the aggregate consumption approaches the account quota cliff. No signal.
T=14 minutes: the cliff is crossed. The Anthropic API returns 429 responses to every subagent simultaneously.
T=14 to T=15 minutes: cascade. Every subagent retries, hits the same 429, retries again. The queue of in-flight requests amplifies the problem rather than resolving it.
T=15 minutes: all thirty subagents are in a retry loop or stalled.
T+? recovery attempt: the parent tries to restart the batch with the same fan-out count, hits the same cliff, because the recovery attempt was not designed with the incident in mind.
At the moment of the cascade: zero subagents had committed a checkpoint. The work done in the first fourteen minutes — across thirty agents — was not persisted anywhere. Every subagent that had been making progress would restart from zero.
Why it matters now
Most multi-agent advice in 2024–2026 was written from research contexts where quota is generous and incidents are rare.
This incident happened in a production context with a real account quota and real work at stake. Specificity is what makes lessons stick.
The four misses in this incident are not exotic. They are the four most common omissions in fan-out system designs. Any team building a fan-out system for the first time is likely to miss at least three of them. The goal of this lesson is to let you miss zero.
A source you should trust
- This incident. It is the primary source. Read the timeline, the diagnosis, and the fixes as a postmortem, not as a cautionary tale. Postmortems are the source code of harness engineering.
- Lesson 7 of this course. The conceptual framework for quota pooling. This lesson is the incident that validates the framework; they are designed to be read together.
A recipe
A four-step incident-derived checklist, applied before any fan-out of more than five agents:
-
Aggregate quota observability at the parent. Before spawning, wire up a running sum of tokens consumed across all active subagents. The threshold to trigger a spawn pause is 80% of account quota. The first miss in the PL incident was the absence of this signal; the cascade was the first indication that quota was constrained.
-
Per-subagent checkpoint protocol. Before spawning, require each subagent to write a structured checkpoint after completing each meaningful unit of work. The checkpoint contains enough information to resume from that point. If the batch is killed, recovery reads checkpoints and skips completed work. The second miss: no checkpoints existed, so recovery restarted every subagent from zero, guaranteeing another collision with the quota cliff.
-
Staggered spawn. Do not spawn all N subagents at once. Spawn the first five, wait for initial consumption to stabilize, spawn five more, and continue. Simultaneous spawn creates a consumption spike in the first thirty seconds that is not representative of steady-state consumption and can exceed quota before steady state is reached. The third miss: thirty simultaneous spawns.
-
Pre-flight quota check. Before any large fan-out, compute: floor(account_quota / per_agent_worst_case * 0.6). If the safe fan-out count is less than the intended count, reduce the count or stagger across multiple runs. The fourth miss: no pre-flight check. The fan-out count was chosen by intuition, not by calculation.
The smell of it going wrong
"It worked with three" is the most reliable precursor to this failure. The inference — "if three worked, thirty will work ten times as well" — ignores the non-linear relationship between fan-out count and aggregate quota consumption.
Three agents at 10% quota each leave 70% unused. Thirty agents at 10% quota each require 300% of quota. The math is not subtle; the failure comes from not doing it.
The parent having no aggregate consumption view is both a smell and a structural cause. In the PL incident, the parent knew the status of each subagent (running, complete, failed) but not the aggregate token consumption. That gap meant the parent had no way to detect the approaching cliff. The first signal was the cliff itself.
Subagents without checkpoints guarantee that any mid-batch failure requires a full restart. The cost of adding checkpoints is low — a file write at each milestone. The cost of not having them is the full compute cost of every in-flight subagent at the time of failure.
A judgment call from real work
The decision to run thirty agents in parallel was not reckless. It was based on a correct analysis of the task shape — embarrassingly parallel, high variance, cheap synthesis — and an incorrect assumption about resource constraints.
The design was coherent at the task level and wrong at the infrastructure level. That distinction matters for how you learn from the incident.
The lesson is not "be more careful about resource limits." That is a feeling, not a fix.
The lesson is: aggregate quota observability is a prerequisite for any fan-out system, not an optional instrument. Checkpoint protocol is a prerequisite, not a nice-to-have. Staggered spawn is a default pattern, not an optimization. Pre-flight quota calculation is a step in the design process, not a post-incident retrofit.
The post-incident changes: a pre-flight quota check was added to the parent's spawn logic. Stagger interval was set at thirty seconds per new subagent. Checkpoints were added at three points in each subagent's workflow. The parent now logs aggregate token consumption and pauses spawning at 80% of account quota.
Since these changes were made, no quota incident has recurred across dozens of multi-issue fan-out batches.
That is the measure of a postmortem that worked: not "we understood what happened" but "the failure mode has not recurred because the system now prevents it structurally."
Rules from this lesson
- Read incidents like source code; they contain causal chains that no architecture diagram will show.
- Every fan-out system requires all four safeguards: aggregate quota observability, per-subagent checkpoints, staggered spawn, and a pre-flight quota calculation. The PL incident missed all four; any one would have prevented the loss.
- The fix is never "be more careful"; the fix is observable telemetry and designed recovery.