Debugging multi-agent systems — why it's hard and what compensates — Multi-Agent Orchestration — fan-out, swarms, and the cost of doing both badly

Single-agent debugging is hard. Multi-agent debugging is a different problem category.

When a single agent produces unexpected output, you read its transcript. You find the turn where the reasoning diverged. You understand why. The causal chain is linear, in one document, in chronological order.

It is slow but tractable.

When a multi-agent system produces unexpected output, you have N transcripts.

They interleave in time. The handoff between agent A and agent B happened at some point, but the timestamp in A's transcript and the timestamp in B's transcript may not be synchronized.

Agent B's behavior depends on what it received from agent A, but agent A's transcript and the handoff payload are not the same document.

The synthesis agent's output depends on what all N workers reported, and the workers reported in a sequence determined by when each one finished, not by any natural ordering of the task.

Building a trace viewer before you build the second agent is not a nice-to-have. It is the difference between debugging taking thirty minutes and debugging taking three days.

The picture

A trace viewer with one swim-lane per agent, laid out horizontally in time. Each swim-lane shows tool calls as boxes — file reads, API calls, write operations — with duration and token count annotated. Handoff arrows connect swim-lanes at the points where one agent passes a result to another.

Shared state changes appear as vertical annotations crossing all lanes simultaneously, marking when the shared resource was modified and by whom.

Below the timeline, the structured-log schema each agent emits: span ID, parent span ID (for linking child work to parent dispatch), agent identifier, tool call name, input hash, output hash, token count, timestamp, and handoff event fields.

The span ID / parent span ID relationship is what makes the timeline reconstructable. Every child agent's spans carry a reference to the parent span that spawned them, allowing the entire call tree to be reconstructed from a flat log stream.

Without that schema in place from the first deployment, the flat log stream is just a chronological list of events with no causal structure. Reconstructing causality from unstructured logs after the fact is almost always harder than collecting structured spans from the start.

Why it matters now

LangSmith, Helicone, Phoenix, Langfuse — a category of trace-viewer products has emerged in direct response to the multi-agent observability problem. They exist because the problem is real and encountered by many teams.

The tools have converged on similar designs: the swim-lane timeline, the span tree, the per-call cost attribution. That convergence is signal: this is the minimal interface for making multi-agent behavior legible.

The bad news is that most teams discover they need these tools after an incident, not before. The PL quota incident from Lesson 8 would have been caught much earlier with aggregate consumption telemetry in place. The handoff failures from Lesson 5 are much faster to diagnose with a trace view that shows the payload at each boundary.

OpenTelemetry has published GenAI semantic conventions — a standard vocabulary for what fields spans should carry in AI workloads. It is worth tracking even if you do not adopt it immediately. When your internal trace schema converges with an emerging standard, integrating vendor tools becomes dramatically cheaper.

A source you should trust

LangSmith's tracing documentation. A representative vendor implementation with clear explanations of the span model and what it captures. Use it to understand the design even if you do not use LangSmith.
OpenTelemetry GenAI semantic conventions. The emerging open standard. The vocabulary is worth adopting early; changing span field names later is annoying but not expensive.

A recipe

A pre-multi-agent observability checklist, completed before the first agent beyond one is deployed:

Define your trace schema. At minimum: span ID, parent span ID, agent identifier, tool call name, token count, wall-clock timestamp, handoff event type (spawn, message, result), and cost in currency if available. Write this schema down before you write any agent code.
Choose trace storage. A vendor (LangSmith, Langfuse, Helicone) is faster to adopt. Self-hosted OpenTelemetry writing to Postgres or ClickHouse is cheaper at scale. Either is better than no trace storage.
Wire spans into every agent at instantiation — not "add tracing later." The agent that runs without a span is the agent whose failure will be impossible to diagnose.
Practice reading a multi-agent trace on a working system before you need to read one on a failing system. Run a small fan-out with tracing enabled, look at the resulting timeline, and verify you can answer: which agent took the longest, where was the handoff, what was in the payload.
Define what "an incident" looks like in trace terms before an incident happens. Which span field tells you that a handoff failed? Which field tells you that quota was approaching the cliff?

The smell of it going wrong

"We'll add traces later" is the most reliable predictor of multi-agent debugging suffering. It has been said by virtually every team that has later spent days debugging a multi-agent failure by hand-correlating transcript files.

The barrier to adding traces at the start is low. The barrier to retroactively reconstructing what happened from unstructured logs is high.

Per-agent traces that are not joined into a timeline. If each agent emits its own log file and those files are not linked by span IDs or task IDs, you have N separate debugging sessions rather than one multi-agent debugging session.

Trace storage that expires in 24 hours. Multi-agent incidents often surface from behavior that has been silently wrong for days before it becomes noticeable. If traces expire before the postmortem begins, the evidence is gone.

Unstructured log strings with no per-tool attribution. If the log says "agent B ran for 8 minutes" but does not break down which tool calls consumed which fraction of that time, you cannot tell whether the problem is in the agent's reasoning or in a specific tool that is slow or failing.

A judgment call from real work

Claude Code provides a transcript per session automatically. That is valuable. The gap is in the parent-child join: a subagent running in its own session produces its own transcript, and the parent's transcript records the spawn and the result but not the subagent's intermediate reasoning.

Debugging a subagent issue that is not a final-output failure — something that went wrong in the middle of the subagent's work and influenced the final output without producing an obvious error — requires correlating two transcript files by hand, matching timestamps, and reconstructing what the subagent was doing at the moment the parent reached a relevant milestone.

The operating discipline that compensates: every subagent spawn includes a structured task ID (the issue number and a timestamp hash) that appears in both the parent's log and the subagent's prompt. Every PR opened by a subagent includes a reference to that task ID. When debugging, the task ID is the join key across all artifacts: parent transcript, subagent transcript, PR, and git history.

Without it, correlation is guesswork. With it, debugging a cross-agent issue takes minutes, not hours.

This is not a full trace implementation. It is the minimum viable join mechanism for a system where full tracing has not been built yet. Use it to get unstuck; build the real trace infrastructure before the system scales.

Rules from this lesson

Tracing is a prerequisite for multi-agent systems, not an afterthought; the agent that runs without a span is the agent that will be impossible to diagnose when it fails.
Structured spans with span IDs, parent span IDs, and handoff event fields are the minimum schema; unstructured logs produce N debugging sessions, not one.
Practice reading a multi-agent trace on a working system; the first time you read one should not be during an incident.

Apply lever, risk, rollback. Lever: structured spans with a consistent schema mean postmortems take hours instead of days, and the discipline of reading traces before incidents means you recognize failure patterns faster. Risk: trace storage costs money at scale; retention and sampling policies need active management. Rollback: trace schema changes can be versioned; the cost of changing a field name is a data migration, not a rewrite.