The move. Every agent turn is a span. Every tool call is a child span. Every cost is an attribute. Wire this on day one.
The single most common failure mode in agent observability is treating an agent like a service.
Services get request-scoped spans. An HTTP call comes in, a span starts, the call returns, the span ends. The trace is complete.
An agent is not a request. It is a session.
It contains multiple turns, each of which may contain multiple tool calls, each of which may itself kick off downstream calls.
If you wire an agent like a service — one span for the whole run — you have a trace that tells you the agent took four minutes and cost three dollars. You have no idea which turn spent the money or which tool call added the latency.
When something goes wrong, you are left staring at a single opaque rectangle on a timeline.
The discipline is simple to state and requires deliberate wiring: one span per turn, one span per tool call, cost and token attributes hanging off each.
The picture
Picture a waterfall diagram on your tracing vendor's UI. The root span is the agent run — call it run.agent. Its start timestamp is the moment the session began; its end timestamp is when the agent returned control to the caller. Duration might be six minutes.
Below it, a series of child spans: run.turn.1, run.turn.2, run.turn.3, each offset from the start. Each turn span has attributes: model, prompt_tokens, completion_tokens, cost_usd, turn_index. Under turn 2, you see two grandchild spans: tool.call.web_search and tool.call.write_file. Each has tool_name, input_hash, return_shape_hash, latency_ms.
Now imagine you have a cost spike in production. You open the trace for that run. You expand turn 3. You see tool.call.load_document with input_tokens: 38000. The document was unexpectedly long. One span, thirty seconds of triage. That is the payoff for wiring this correctly on day one.
Why it matters now
OpenTelemetry's GenAI semantic conventions are converging in 2025–2026. The working group has published attribute names for model, prompt tokens, completion tokens, and several vendor-specific extensions.
The direction is clear: the industry is standardizing on a schema, and early adopters who wire to that schema will have traces that survive vendor migration.
The vendor space is mature. LangSmith, Helicone, Phoenix, Langfuse, and Braintrust each have production-grade tracing surfaces. Each offers the waterfall view, the cost aggregation, and the session-replay feature described in Lesson 7.
Picking one and wiring it on day one costs a few hours. Building your own schema and later migrating to a standard costs days. Do not roll your own tracing schema in 2026.
The other reason this matters now: the cost of missing a span grows with agent complexity.
A two-turn agent that fails has eight possible culprits. A twenty-turn agent that fails has sixty. You are not going to debug sixty candidates by reading raw logs.
A source you should trust
OpenTelemetry GenAI semantic conventions are the emerging open standard for LLM and agent observability. The draft specification is public; following it means your traces are vendor-portable and readable by any compliant tool. Start here before you design your attribute schema.
LangSmith's documentation gives a concrete worked example of span-per-turn wiring with the attributes that matter in practice. If you are on a LangChain stack, it is the natural home. If you are not, the documentation is still a useful reference for what the production discipline looks like.
Helicone's documentation covers the cost-attribution discipline specifically — how to record cost per span, how to aggregate it by user and by feature, and how to set alert thresholds. The cost layer is where most teams underinvest.
A recipe
A minimum-viable tracing setup for any production agent:
- Pick one vendor or open-source tracer. Make the pick on day one. Changing vendors later is a migration project; using nothing until you need it is an incident waiting to happen.
- Wrap every agent invocation in a root span tagged with
session_id,user_id,feature, andagent_version. These four tags are how you find the traces that matter. - Emit a child span per turn with at minimum:
model,prompt_tokens,completion_tokens,cost_usd,turn_index,latency_ms. Addstop_reason— this tells you whether the turn ended cleanly or was truncated. - Emit a child span per tool call with:
tool_name,input_hash(not the full input — hash it for privacy),return_shape_hash,latency_ms,errorif the tool call failed. - Set a retention policy before launch. Traces should be retained long enough to cover your typical incident-discovery time. If your team discovers incidents at T+3 days, retain for at least 30 days. Truncating at 7 days and having an incident at day 10 is a real pattern.
- Run one complete session manually against your tracing vendor before shipping. Confirm the waterfall looks right. Confirm cost attributes are non-null. This takes fifteen minutes and catches the most common wiring mistakes.
The smell of it going wrong
Traces exist per-service but not per-agent. The application monitoring shows service health; the agent's behavior is invisible inside it. You know the service is up; you have no idea what the agent is doing.
Token counts are recorded but cost is not. Cost requires a model-name lookup against current pricing and varies per model version. Teams skip this step because it requires maintaining a price table. The result is that cost incidents are discovered in the billing portal, not in the trace.
Tool calls are buried in log strings rather than emitted as structured spans. The agent logs "calling web_search with query='what is the current date'" at INFO level. This is not a trace. It is a hint that something happened. You cannot aggregate it, alert on it, or compare it across runs.
Trace retention is shorter than typical incident-discovery time. The alert fires on day 8; the traces expired on day 7. This is one of the most painful forms of observability failure because everything was working — you just lost the data before you needed it.
A judgment call from real work
PL traces Claude Code sessions through per-session transcript files. Each transcript records every tool call, every assistant response, and every user message in a locally-stored JSONL artifact. The tracing is rich for ad-hoc replay — you can reconstruct any decision by reading the transcript — but it is local, not aggregated. You cannot ask "across all sessions this week, which tool call had the highest latency?" without parsing all the transcripts manually.
Ostronaut services trace via OpenTelemetry into a self-hosted backend. Spans are emitted per turn, per retrieval call, and per embedding call. Cost is attributed per call using a model-to-price table that is updated when pricing changes. The aggregate view is available: you can query cost by model, by feature, by day.
The design choice between the two patterns is real. Local transcripts are the right call for development and for systems where privacy is paramount and aggregate analytics are not needed. Centralized trace stores are the right call when you need cross-session analysis, alert thresholds, and postmortem replay against a production baseline.
Most teams start with local logs and migrate to a centralized store after the first incident where they needed aggregate visibility. The migration is not expensive, but the incident it follows usually is. Wire the centralized tracer earlier than feels necessary.
The next lesson covers the downstream use of trace data: structured eval logging, which turns production interactions into regression sources. Tracing captures what happened; eval logging decides which of those interactions deserves to be frozen into a test that runs forever.
Rules from this lesson
- Tracing is non-optional for production agents; local logs are not a substitute for structured spans.
- Span-per-turn, span-per-tool-call, and cost attributes are the minimum schema — any less and you cannot triage a cost incident.
- Set retention before launch; traces that expire before the incident is discovered have no value.
- Tag every run with session, user, feature, and version; these are the dimensions you will filter on during every postmortem.
- Verify your tracing wiring with a manual session before launch; the most common mistakes — null cost attributes, tool calls logged as strings instead of spans — are invisible until you look.