An agent is just a loop with tools. Most product teams don't need a loop. They need a well-designed prompt. The cost of confusing the two is six months of debugging and a feature nobody trusts.
After this page, you’ll be able to:
- What the agent loop actually is — the mechanism, not the marketing
- What tool use means and what it costs (latency, reliability, cost)
- When a chain of prompts beats a full agent — the decision framework
- What a PM must spec before handing off agentic work to an engineer
- Observability in agentic systems — what to instrument and why
"Agents" became the dominant AI product narrative in 2025-2026. Every product roadmap has one. Most of them are either not agents (they're chains of prompts with if-statements) or are agents that shouldn't be. This page gives you the mental model to tell the difference — and to spec the ones that genuinely need to be agentic.
What an agent actually is
An agent is a system where an LLM iteratively decides what to do next, based on a goal, a set of available tools, and the results of previous actions. The loop:
- LLM receives a goal and current context
- LLM decides the next action (which tool to call, or whether to respond)
- Tool executes, returns result
- Result added to context
- Loop from step 2 until goal is reached or a stopping condition fires
The key property: the LLM controls the sequence. It's not a pre-defined pipeline where you call tool A, then tool B, then tool C. The model decides whether to call tool A or tool B first, whether to call tool C at all, and whether it has enough information to respond or needs to gather more.
This is genuinely powerful. It is also genuinely risky. The failure modes of sequential LLM pipelines are well-understood. The failure modes of open-ended agent loops are not — they depend on what the model decides to do, which you cannot fully anticipate.
Tools, tool use, and what they cost
A tool is a function the LLM can call. Standard tools include: web search, code execution, database query, API calls, file read/write, calendar access, email sending. In frameworks like LangChain, CrewAI, or Claude's tool-use API, tools are defined as typed function signatures that the model can invoke by generating a JSON object.
Tool use adds three costs:
Latency. Each tool call adds at minimum one round-trip (the LLM decides to call the tool, the tool executes, results return to the LLM). Complex agents with 5-10 tool calls per task have P95 latencies in the 15-30 second range. That is not acceptable for interactive UX. It may be acceptable for async background tasks.
Cost. Each LLM call in the loop charges tokens. A five-step agent that makes five calls costs roughly 5x the tokens of a single call. At GPT-4o pricing (~$5/1M output tokens), a complex agent task can cost $0.10-0.50 per invocation. At scale, this is material.
Reliability. Every tool call is a failure surface. The tool may return an error, a malformed result, or a result that confuses the model into a bad next step. Multi-step agent tasks compound failure probabilities multiplicatively. A task with ten steps that each succeed 95% of the time has a 40% chance of failing somewhere along the way. Agent reliability engineering is a real discipline; most teams underinvest in it.
When NOT to use an agent
This deserves its own section because the default in 2025-2026 is to overuse the agent pattern.
Don't use an agent if the task has a fixed, known sequence of steps. If you always retrieve documents, then summarize, then format — that's a pipeline. A pipeline is simpler to build, test, and debug than an agent. Use an agent only when the sequence of steps genuinely needs to vary based on what the model discovers.
An agent is not a pipeline with extra steps. A pipeline runs a fixed sequence; an agent lets the model decide the sequence. Use an agent only when the sequence genuinely needs to vary based on what the model discovers at runtime. If it does not, build the pipeline.
Don't use an agent if the task can be done in a single, well-crafted prompt. Many tasks that teams reach for agents to solve can be handled with a carefully structured prompt with structured output. Complex data extraction, multi-section document generation, analysis with embedded reasoning — all of these often work better with a single long-context call than an agent loop.
Don't use an agent for user-facing, synchronous interactions. If a user is waiting for a response, agent latency (typically 5-30 seconds for a non-trivial task) is a UX failure. Either pipeline the task, reduce the number of steps, or design the UX around async delivery (show intermediate results, allow the user to proceed while the agent works).
Don't use an agent if errors are catastrophic. Agents that write to databases, send emails, execute code, or make API calls in the real world need careful human-in-the-loop gates. An agent that can take actions with irreversible or high-cost consequences needs an approval step before execution. "The agent made a mistake and sent 5,000 emails" is not a recoverable product situation.
Any agent that can take irreversible or high-cost actions — writing to databases, sending emails, spending money, executing code in production — requires a human-in-the-loop approval step before execution. Autonomous does not mean unreviewed.
Sprint planning. Team is discussing implementation approach for a 'smart email triage' feature.
Engineer: “I want to build this as a full agent — it reads the email, decides if it's urgent, decides what action to take, and then either replies, escalates, or archives.”
PM: “Walk me through a decision the agent makes that a fixed pipeline couldn't.”
Engineer: “Well... if the email is ambiguous, it might need to check the customer's history first.”
PM: “So: classify urgency → if ambiguous, look up customer history → then decide action. That's three steps with one conditional. That's not an agent, that's a conditional pipeline. Let's build that first and see if we hit a case where the model needs open-ended autonomy before we commit to an agent architecture.”
The pipeline shipped in two weeks. The edge cases that required true agent behavior turned out to be 4% of emails. They built the agent path three months later, for only those cases.
Most agentic behavior can be decomposed into conditionals. Validate the assumption before paying the agent complexity tax.
Planning, memory, and multi-agent coordination
Planning is how an agent breaks a complex goal into sub-tasks. Modern frameworks like Claude's extended thinking mode, GPT-5's o3-style reasoning, and open-source planners (LangGraph, AutoGen) generate explicit plans before executing. This improves reliability for complex tasks and makes the agent's reasoning inspectable.
The PM implication: a planning step adds one LLM call and 2-5 seconds. For tasks where the agent frequently goes down the wrong path, a planning step reduces total turns and total cost. For simple tasks, it adds overhead. Profile your agents to understand whether planning is net positive for your specific use case.
Memory in agents refers to how the agent retains context across turns. Three kinds:
- In-context memory: the full conversation is in the prompt. Simple, but consumes context window fast in long agent tasks.
- External memory: key facts are written to a structured store (a vector DB or key-value store) and retrieved when needed. Scales better, but retrieval adds latency.
- Summarization: periodically summarize and compress the conversation history. Lossy but efficient.
For most product agents in 2026, in-context memory is sufficient for task-level context. External memory matters for user-level persistence across sessions ("remember that I prefer formal tone") and for long-running background tasks.
Multi-agent coordination is the pattern where multiple specialized agents work on sub-tasks in parallel or sequence. Useful when tasks genuinely decompose into parallel workstreams (e.g., a research agent, a coding agent, and a writing agent working on different parts of a report simultaneously). Complex to debug. Multiply all the reliability caveats above. Only use when single-agent approaches genuinely fail at the task.
What to spec before handing off to an engineer
Most engineering teams will build what you ask for. If you ask for "an agent," you'll get an agent. The real question is whether you've specified the right thing.
Before any agentic feature enters development, the spec must answer:
1. What is the goal definition? An agent needs a crisp, testable definition of success. "Help the user complete their task" is not a goal. "Find and summarize the three most relevant policy documents for the user's question, then draft a response that cites all three" is a goal.
2. What tools are available? List every tool by name, describe its inputs/outputs, and state what happens when it errors. Underspecified tools lead to agents that handle errors poorly.
3. What is the stopping condition? When does the agent stop looping? Maximum steps? User confirmation? A confidence threshold? Without a stopping condition, agents loop indefinitely on ambiguous tasks.
4. What can the agent NOT do? Negative constraints matter more than positive ones. Can the agent send emails on the user's behalf without confirmation? Can it modify records? Can it spend money (API calls with per-use fees)? Specify the guardrails explicitly.
5. What does failure look like and what does the user see? Agent failure is not binary — it may be partial success, a stuck loop, or a confident wrong answer. Design the failure UX before you design the success UX.
6. How is it observed? What do you instrument to understand whether the agent is performing well? Every tool call should be logged with timing, inputs, outputs, and success/failure. Every agent session should have a trace you can inspect.
A production agent without a full trace — goal, every tool call with inputs and outputs, every LLM call with token counts, total cost and wall time — is a black box that costs money and produces outputs you cannot explain. Require the trace in the spec.
Observability in agentic systems
A production agent without observability is a black box that costs money and produces outputs you can't explain. Observability is not optional.
The minimum viable agent trace records, per agent session:
- The initial user goal
- Every tool call (name, inputs, outputs, duration)
- Every LLM call (model, prompt tokens, completion tokens, duration)
- The final output
- Total wall time and total cost
Tools like LangSmith (LangChain), Weights & Biases Traces, and Helicone provide this out of the box. The PM's job is to require this instrumentation in the spec, not discover its absence after launch.
What to review in production:
- Task completion rate (did the agent reach the stopping condition successfully?)
- Mean and P95 step count (are agents taking more steps than expected?)
- Tool error rate (which tools fail most often?)
- Cost per session (are certain task types disproportionately expensive?)
- User satisfaction signals correlated with trace characteristics
What to do this week
-
Audit your current "agent" ideas. For each one, ask: does the sequence of steps genuinely need to vary based on what the model discovers, or is it a fixed pipeline with conditionals? If the latter, spec a pipeline. Save the agent complexity for cases that actually need it.
-
Write a tool spec for one planned agent tool. Name, inputs (typed), outputs (typed), error cases, latency SLA, cost per call. If you can't write this, you aren't ready to build the agent.
-
Define your stopping conditions. For any agentic feature in planning, state explicitly when the agent stops and what the user sees if it fails.
Where to go next
- Eval Design — how to test agentic pipelines systematically
- Latency and Cost — the economics of multi-step agent calls
- Safety and Auditability — what a PM owns when an agent makes a mistake
- Writing PRDs — how to extend the PRD format for agentic features