Multi-agent systems sound abstract until you have read one that ships.
The PL subagent worktree pattern is a nested agent system that is small enough to explain in one lesson and concrete enough to have produced real production incidents. It is also the system that scaffolded this course — several of these lessons were deep-authored by subagents working in parallel in isolated worktrees. The failure modes it surfaced are not theoretical.
This is the property that makes it worth studying: it is a production system that you can inspect end to end, in a repository you have access to, using the same SDK the lessons describe. The gap between "understanding the pattern" and "reading the actual implementation" is one file-read away.
The architecture has two layers.
The parent is a Claude Code session with its own loop, context window, toolbelt, memory, and stopping condition. The parent has a quota pool — a budget shared across all activity in the session. The parent's job is to decompose a task, spawn children, track their progress, and rejoin their outputs.
The children are spawned subagents, each assigned to an isolated git worktree. A worktree is a checked-out copy of the repository at a specific branch, in its own directory, with its own working tree. Each child has its own loop, its own context window, its own tool access, and its own stopping conditions. What the children share is the parent's quota pool.
That shared quota is the architectural seam where the system breaks first.
The spawn mechanism is the Agent tool — the parent calls it with a task description, a worktree path, and instructions. The child inherits the parent's toolbelt configuration and MCP server connections. It does not inherit the parent's context window; it starts fresh with only the task description and the project CLAUDE.md as context.
The rejoin mechanism is SendMessage — the structured protocol for a child to signal completion back to the parent, passing a summary of what was done and where the output lives. This is not "the child writes files and the parent reads them." That pattern produces race conditions and lost work. SendMessage is an explicit handoff: the child signals, the parent acknowledges, the output is integrated.
The abort mechanism is also SendMessage, in the reverse direction — the parent can signal to a running child that it should stop, checkpoint its work, and exit cleanly. This is the deadman switch for the children.
Why it matters now
By 2026, parallel subagent patterns are the primary way to scale autonomous work. A single agent loop is bounded by the sequential nature of the loop — one turn, one action, one result. A parent-with-children pattern breaks that bound. Twenty children working in parallel can produce in one hour what a single agent loop would produce in twenty.
The bound shifts from sequential latency to quota economics and coordination complexity. Both of those bounds require explicit design; they do not self-manage.
The PL pattern is small enough to reason about in one sitting and concrete enough to have every design decision documented in the repo's MEMORY.md files. Reading it is the fastest available path to understanding what a production nested-agent system actually involves.
A source you should trust
- PL's MEMORY.md entries:
project_pl_pr_workflow,feedback_subagent_quota_sharing,feedback_dont_clean_locked_worktrees,feedback_subagent_worktree_base. These are the lived record of the decisions and incidents. They are not polish documentation — they are the actual operational knowledge accumulated through failures. Read them in the order they were written to see how the pattern evolved. - Anthropic's parallel agents documentation in the Claude Agent SDK guide. The platform-level guarantees the pattern relies on: what isolation the worktree provides, what the Agent tool's spawning contract is, what SendMessage guarantees.
- Git worktree documentation. Understanding the git-level isolation is important for understanding what the children can and cannot affect. A worktree is not a clone; it shares the object store with the parent repo. Commits in a child worktree are immediately visible to the parent.
A recipe
A nested-agent system checklist before spawning at scale:
- Write down explicitly what is isolated per child (filesystem, git branch, environment, secrets) and what is shared (quota pool, MCP server connections, parent state, output rendezvous). If the "shared" column has anything that is not quota, examine it carefully — shared mutable state is where multi-agent deadlocks and overwrites live.
- Calculate the per-child budget before spawning. Take the parent's remaining quota, subtract a safety reserve for the parent's own continued operation, and divide by the number of children you intend to spawn. That is each child's token cap. Pass it explicitly at spawn time.
- Stagger spawning. Do not fire all N children simultaneously. Spawn in batches of three to five, verify the first batch is making progress, then spawn the next. The parent should retain enough quota at all times to abort a runaway child and checkpoint the session.
- Define the rejoin contract before spawning. What does a "done" message from a child look like? What constitutes a partial result? What should the parent do if a child fails to send a completion signal within the wall-clock timeout? Write this down before the first child spawns.
- Every child gets a deadman switch — a stopping condition that fires if the child exceeds its token budget, its wall-clock timeout, or its no-progress limit. The parent gets a quota meter — an alert that fires when aggregate spend across all children crosses a threshold that would leave the parent unable to operate.
The smell of it going wrong
- Children are spawned in a tight loop from the parent with no delay and no quota math. The parent will hit the rate limit before the last child finishes spawning.
- The parent has no mechanism to abort a running child. "Cancel" means killing the session and losing all in-flight work.
- Two or more children have write access to the same file path. The last writer wins, silently, and whoever reviews the work later cannot tell which child's version survived.
- The logic is "it worked with three children; let's try thirty" with no shakedown at the higher fanout. The failure mode at three children and the failure mode at thirty children are different. The thirty-child failure is the one in the incident log.
- The parent's quota state is not logged. When the 429 fires, nobody can reconstruct which child was responsible for the final burst.
A judgment call from real work
The canonical PL quota-429 incident is worth walking end to end because it produced most of the design discipline in the checklist above.
The task was an issue-triage batch: approximately thirty open GitHub issues needed to be read, evaluated for scope, and have draft PR descriptions written. The parent spawned thirty-plus subagents in rapid succession, each assigned one issue, each working in its own worktree. The issues were genuinely independent — no shared files, no coordination needed between children. The spawning logic looked correct.
What was missing: per-child token caps, staggered spawning, and aggregate quota tracking on the parent.
The children began accumulating context rapidly. Each issue required reading multiple source files, understanding the surrounding code, and drafting a structured proposal. Some children were retrieving large files. Some were doing multiple retrieval passes. The context windows were growing faster than expected because the source files were larger than the toy examples the pattern had been tested on.
Around the forty-minute mark, the quota pool hit the rate limit. The API returned 429s to every in-flight call — children and parent simultaneously. No child had committed yet. The parent had no checkpoint of which children had made progress and which had not. All work was lost.
The recovery was manual: read the per-child logs, identify which children had reached a state worth preserving, extract the partial work by hand, and restart those children with the partial work as context.
The redesign addressed each missing piece:
Per-child token caps, calculated from the parent's quota at spawn time, reduced per-child spend and forced children to checkpoint before hitting their limit rather than running until the platform stopped them.
Staggered spawning in batches of four, with progress confirmation between batches, kept the parent's quota meter visible throughout the run and prevented the all-at-once burst.
A commit checkpoint requirement — each child must commit work-in-progress to git before advancing past a complexity threshold — meant a 429 mid-batch became a delay and resume, not a restart. The parent could inspect the git log, identify completed and in-progress work, and spawn only the unfinished children.
The entire failure and recovery required one afternoon to live through and one hour to design against. The checklist above takes ten minutes to apply. The asymmetry is the reason the checklist exists.
One additional design principle surfaced by the incident: the "locked worktree" convention. When a subagent is actively working in a worktree, that worktree is considered locked — the parent should not clean it up, reassign it, or inspect it during the run. Cleaning an in-progress worktree destroys work in a way that looks, to the child agent, exactly like a file-system failure mid-task. The abort mechanism for a running child is SendMessage, not worktree removal. This is a design rule that only becomes obvious after violating it once.
Rules from this lesson
- Shared quota is shared blast radius; calculate per-child budgets before spawning, not after the 429.
- Spawn staged, not all-at-once; the parent must retain enough quota to abort cleanly at every point in the batch.
- Every child needs a checkpoint protocol so a rate-limit hit is a delay, not a loss.
- Define the rejoin contract before spawning; "the child writes files" is not a contract, it is a hope.