Distributed systems have a body of hard-won knowledge about concurrent state management. It took the field decades and a long list of expensive production incidents to develop.
Language models did not inherit that knowledge. Multi-agent systems, however, do inherit the problems.
When two agents read and write the same resource — a file, a database row, a memory store, a configuration value — you have a distributed system. The agent framework does not change this. The fact that the agents are language models does not change this.
Lamport's happened-before relation does not care about the technology stack. Race conditions do not become less real because the concurrent processes are talking in natural language.
The 2024 wave of multi-agent systems largely escaped this problem by avoiding shared state entirely. Fan-out designs with isolated contexts — one worktree per agent, one database row per task, one thread per subagent — sidestepped concurrency almost completely.
That works for fan-out tasks.
The moment you build a multi-agent system where agents need to collaborate on a shared artifact, every distributed-systems problem comes back.
Isolation is not available as a strategy when the task requires agents to build on each other's work. That is when you need the full toolkit.
The picture
Draw two agents. Both have a read-write connection to a shared resource — call it a working document.
Agent A reads the document at time T=1, reasons for a while, and writes an update at T=5.
Agent B reads the same document at T=2, reasons, and writes an update at T=4.
Agent B's write commits first. Agent A's write commits second and silently overwrites Agent B's changes, because Agent A read a version that predates Agent B's write.
This is a lost-update problem — one of the oldest known failure modes in concurrent systems. The fix is a concurrency strategy: optimistic locking (each writer checks that the version it read has not changed before writing), pessimistic locking (exclusive access before reading), or a single-writer-with-queue (one agent owns all writes, others submit requests).
Below the diagram, the same resource under each strategy, with the tradeoff annotated.
Optimistic concurrency: Agent B's write succeeds, Agent A's write fails and retries with the current version.
Pessimistic locking: Agent A waits for Agent B to release before reading.
Single writer: both agents submit change requests to a writer agent, which serializes them.
Each strategy has a cost; the right choice depends on conflict rate and latency tolerance.
Why it matters now
The 2024 wave of multi-agent systems papered over shared-state problems by avoiding shared state. That approach works for the task shapes in Lesson 3.
As teams build more collaborative, longer-horizon multi-agent systems — systems where agents are incrementally building shared artifacts rather than dividing a corpus — the avoidance strategy stops working.
The teams that encounter this problem without having thought about it will spend hours debugging behavior that looks non-deterministic: the system works on Monday and produces wrong output on Thursday, with no clear pattern. The cause is a race condition that manifests only when two agents happen to act within a narrow time window.
Without a concurrency strategy in place, the fix is reactive and fragile. With one, it is designed-in and preventive.
A source you should trust
- The classical distributed-systems literature. Lamport's clocks, Brewer's CAP theorem, the PACELC model. The abstractions are old and the applicability is direct. Any team building collaborative multi-agent systems should have at least one engineer who has read this literature seriously.
- Postgres MVCC documentation. Multi-version concurrency control is the practical reference for optimistic concurrency patterns. Postgres implements it cleanly and the documentation explains the tradeoffs with operational clarity. The concepts transfer to any system with concurrent writes.
A recipe
A shared-state design protocol for any multi-agent system where agents write to shared resources:
- List every resource that more than one agent might read or write. Be exhaustive: files, database rows, config values, memory stores, external API state.
- For each resource, classify the access pattern: read-only for all agents (no problem), read-write for exactly one agent (no problem), read-write for multiple agents (concurrency problem; choose a strategy).
- For each read-write-by-multiple resource, choose a concurrency strategy: isolated copies, single writer with queue, optimistic concurrency with retry, or pessimistic locking.
- Apply the simplest strategy that works. Isolated copies is cheapest: each agent gets its own copy and differences are merged at the end. Explicit locks are most expensive: they add latency, require lock management, and create deadlock risk. Start simple.
- Make write operations idempotent wherever possible. An idempotent write produces the same result whether applied once or multiple times. Idempotent writes survive retry logic, concurrent writes, and network partitions without producing duplicate or inconsistent state.
The smell of it going wrong
The most insidious symptom is intermittent correctness: the system produces right output most of the time and wrong output occasionally, with no clear pattern. This is the signature of a race condition that manifests only when two agents act within a narrow window.
The failure is real and reproducible, but the reproduction window is small enough that it escapes testing. Testing environments typically have lower concurrency than production. A race condition that requires two agents to act within 200 milliseconds of each other will be vanishingly rare in a sequential test suite and common in production.
The third symptom is silent data loss: a write happens, is acknowledged, and then is not reflected in the next read. This is the lost-update pattern. It is silent because neither agent errors; both believe they succeeded. The inconsistency only surfaces when downstream behavior diverges.
A judgment call from real work
PL parallel-triage isolates state by design: each subagent operates in its own git worktree, a separate filesystem copy of the repository. Agents cannot write to each other's worktrees. The parent coordinates by spawning agents and reading their output via the Agent tool's message protocol, not by reading a shared file.
The tradeoff was made explicitly. Worktree overhead is real: creating a new worktree for each issue takes a few seconds and consumes disk space proportional to the number of active agents. That overhead was accepted because the alternative — shared worktree with locking — would require every subagent to coordinate file access with every other, creating a single point of contention and a source of deadlocks.
Where this design would not work: a multi-agent system where agents need to build a single shared artifact incrementally, such as a code review where one agent is the author and another is the critic. For that task shape, isolation is not viable. The right answer is probably single-writer: the author agent owns all writes, the critic agent submits comments as requests. That preserves coherence without requiring locks on the shared artifact.
Apply lever, risk, rollback. Lever: isolating state by default eliminates an entire class of distributed-systems failures at the cost of some resource overhead. Risk: isolation does not compose; if agents genuinely need to build on each other's work, isolation creates divergence that must be reconciled at synthesis. Rollback: moving from isolation to a locking strategy is a design change, not a rebuild; the agents do not change, the state management does.
Rules from this lesson
- Multi-agent systems are distributed systems; the distributed-systems literature applies regardless of the technology or abstraction layer.
- Prefer isolated state to locked shared state when the task allows; isolation is cheaper to reason about, cheaper to debug, and eliminates an entire class of failure modes.
- Make writes idempotent by design; idempotent writes survive retry logic and concurrency errors without producing corrupt state.