Quota pooling and blast radius — the under-discussed cost of fan-out — Multi-Agent Orchestration — fan-out, swarms, and the cost of doing both badly

Shared state is the source of most multi-agent coordination problems. Lesson 6 covered the visible shared state: files, database rows, memory stores.

This lesson covers the invisible shared state that the literature almost never discusses: quota.

Every agent in a fan-out system shares the parent's API quota. Tokens per minute, requests per minute, concurrent connections — these limits are enforced at the account or organization level. From the platform's perspective, ten subagents spawned by one parent are ten threads from one identity.

The quota is not multiplied by the number of agents. It is divided among them.

This means that fan-out has a hard upper bound that is invisible from inside any individual agent.

No single subagent is consuming more than its share. But the parent's account is consuming the sum of all their shares, and the sum can cross the rate-limit cliff long before anyone notices.

When the cliff is crossed, the platform does not throttle the offending subagent. It throttles the account.

Every active subagent receives a 429. The work that has not yet been committed is lost.

The picture

Draw a quota pool as a bucket on the left. The bucket has a capacity: some number of tokens per minute. On the right, draw N subagents, each connected to the same bucket with a drain. Each drain pulls tokens at a rate proportional to that subagent's workload. The total drain rate is the sum across all subagents.

Now draw a cliff line at the bucket's capacity. When the total drain rate exceeds that line, the platform enforces the limit — not gradually, but as a hard stop. Every subsequent request from every subagent receives a 429.

The key observation: no single subagent is at fault. Each one is consuming a reasonable share. The problem is the aggregate, and the aggregate is only visible at the parent level.

Individual subagents have no view of total consumption. The parent, if it is not instrumenting aggregates, also has no view. The first signal of a problem is the 429 itself.

The second key observation: the work lost is proportional to the number of agents with uncommitted progress at the moment of the 429.

This is the checkpoint problem. It is addressed in Lesson 8. But it starts here, with quota. If thirty agents have been running for twelve minutes each and none have committed a checkpoint, the loss is thirty times twelve minutes of work.

Why it matters now

The 2024–2026 era's API rate limits are real constraints, not theoretical ones. Anthropic, OpenAI, and Google all enforce per-account and per-organization quotas.

The limits are generous enough that small fan-out experiments — N equals three, N equals five — rarely hit them.

That is exactly why the problem is discovered late: it does not appear until someone tries a larger fan-out, and the aggregate consumption pattern that was fine at small scale is catastrophic at large scale.

The literature on multi-agent design is largely written from a research context where compute is abundant and quota management is not a first-class concern.

The production context is different. Quota is the most-shared resource in any multi-agent system and the least likely to appear in any design doc.

A source you should trust

Platform rate-limit documentation. For Anthropic, for OpenAI, for whatever platform you are building on: read the rate-limit documentation before you design the fan-out, not after the first incident. The limits are per-account, not per-agent.
The PL parallel-triage incident. Lesson 8 covers it in full. This lesson exists to give you the conceptual framework before you read the incident report.

A recipe

A quota-aware fan-out protocol, applied before any large fan-out is spawned:

Know your quota. Not the per-agent quota: the account-level quota. Tokens per minute, requests per minute, concurrent connections. Find it in the platform documentation or the account dashboard.
Estimate per-subagent consumption at worst case, not average. If a typical subagent uses 10,000 tokens per minute in steady state but can spike to 40,000 on a complex task, use 40,000 in the calculation.
Cap fan-out at: floor(account_quota / per_agent_worst_case * 0.6). The 0.6 safety factor leaves buffer for variance and for the parent's own consumption.
Stagger spawns. Do not fire all N subagents simultaneously. Spawn the first batch, wait for initial tool calls to complete and consumption to stabilize, then spawn the next batch.
Instrument aggregate consumption at the parent level. Track total tokens consumed across all active subagents, not just per-subagent. If aggregate consumption approaches 80% of quota, pause spawning new subagents.
Design per-subagent checkpoints before any large fan-out. If a 429 kills the batch, recovery starts from the last checkpoint, not from zero.

The smell of it going wrong

"It worked with three, let's try thirty" is the most common precursor to a quota incident. The logic is not unreasonable on its face. But three agents at 10% quota each leave 70% unused. Thirty agents at 10% quota each require 300% of quota. The math does not scale linearly and the cliff is hard.

The second tell is a parent with no aggregate view. If the orchestrator is only tracking per-subagent status and not summing consumption, it cannot warn before the cliff is reached.

The third tell is the absence of checkpoints. When a 429 hits a batch without checkpoints, every in-flight subagent starts over from scratch on retry. The recovery attempt often hits the same cliff because the retry fires the same number of agents simultaneously.

A judgment call from real work

The canonical PL incident is described in full in Lesson 8. The setup: thirty parallel subagents, one account quota, no aggregate instrumentation, no checkpoints, simultaneous spawn.

The incident was not the result of carelessness. It was the result of a gap between the design model — each subagent is independent — and the operational reality: all subagents share one quota pool.

The rule that emerged: stagger spawn, cap fan-out, and treat quota as shared state with all the discipline that entails. The quota budget check — account quota divided by worst-case per-agent consumption, times 0.6 — is now run before any fan-out of more than five agents.

The stagger rate — one new subagent every thirty seconds rather than all at once — is the default pattern, not the exception. It costs about five minutes on a thirty-agent batch. It costs nothing compared to losing the batch entirely.

Apply lever, risk, rollback. Lever: a pre-flight quota check and staggered spawn take twenty minutes to implement and prevent the class of failure that can lose hours of work. Risk: conservative fan-out caps reduce parallelism benefit on large batches. Rollback: quota caps can be raised as you build confidence in the consumption model; the cost of starting conservative is opportunity cost, not incident cost.

Rules from this lesson

Shared quota is shared blast radius; treat quota as explicitly as any other shared state.
Cap fan-out before spawning: account quota divided by per-agent worst-case consumption times 0.6.
Stagger spawns and instrument aggregate consumption at the parent; the first signal of a quota problem should be a metric, not a 429.