Tool Use, Function Calling, Agents — The Maturity Ladder — the pm manual

Most products that say they are 'agents' are tool-call wrappers — and that is fine. The actual agent loop is harder than the demo videos let on, and the gap between rung three and rung six is where careers, quarters, and seed rounds go to die.

Talvinder Singh

The most expensive sentence in AI product strategy in 2026 is "let's make it an agent." It sounds like ambition. It is usually scope creep wearing a hoodie. The teams that ship working AI features are not the ones with the most elaborate agent diagrams on the whiteboard — they are the ones who knew, before kickoff, which rung of the ladder the feature actually needed, and refused to climb a rung higher than the user's job demanded.

Chapter 1 was the gate (is this an AI problem at all?). Chapter 2 picked the model. Chapter 3 wrote the prompt. Chapter 4 built the eval. Chapter 5 designed around hallucination. This chapter is the next decision: how much autonomy does this feature need, and what does each rung cost in production?

Six rungs. Not steps you graduate through — a menu. Pick the one the job needs.

The maturity ladder, plainly

Rung 1 — Single-shot prompt. One input, one output, one model call. No tools, no loop. Notion AI rewriting a paragraph, ChatGPT answering a question, a "summarise this ticket" button. Good for: bounded, language-shaped tasks where the model has everything it needs in the prompt. Costs: cents and seconds. Fails: when the answer needs facts the model does not have, or when it needs to do something in the world rather than say something about it. This is the rung 80% of shipped AI features should sit on. Most do not, because the rung above demos better.

Rung 2 — Prompt with tool / function calling. One round trip from the user's perspective, but the model can call out — fetch a fact, look up a row, hit one API — and fold the result into its answer. "What's my account balance?" routed through a getBalance function. Good for: grounding answers in fresh, user-specific, or proprietary data without a full retrieval system. Costs: 2-3x rung 1, plus the tool's own latency and failure modes. Fails: when the tool returns garbage and the model dresses it up confidently, or when the model picks the wrong tool because the schema description was loose.

Rung 3 — Multi-step tool use (the read-write loop). The model calls a tool, looks at the result, decides what to do next, and keeps going until it finishes or hits a stop condition. This is where "tool calling" becomes "agentic" — there is now a loop with the model in the steering seat. Claude Code, Cursor agent mode, Copilot Workspace, and the useful side of Operator all live mostly here. Good for: tasks where the path is not knowable up front — searching a codebase, debugging a test, drafting and revising a doc. Costs: 5-20x rung 2, plus real risk of runaway loops if you do not set budgets. Fails: silently. The loop produces a plausible final answer that is wrong in a way you cannot detect without trace-level eval.

Rung 4 — Plan-then-execute with reflection. The model writes a plan, executes step by step, and at intervals checks its own work against the plan and decides whether to revise. The reflection step is the difference between a smart loop and a hopeful one. Devin promised this rung; Claude Code does a stripped-down version. Good for: tasks long enough that a flat loop loses the thread — multi-file refactors, multi-day research. Costs: another 2-5x on top of rung 3. Fails: when the plan is wrong in step one and reflection rubber-stamps it, or when the model "reflects" itself into a worse approach than the first try.

Rung 5 — Multi-agent with explicit roles. Multiple model instances — researcher, writer, critic, executor — talking through a coordinator. The architecture every framework vendor wants you to use because it produces the most impressive demo. Good for: a small set of problems where the roles are genuinely different and the work parallelises. Costs: the sum of every rung-3 agent plus the coordinator plus the context passed between them. Fails: agents debate, critic keeps sending writer back to revise, coordinator loses the thread, wall-clock blows past anything the user will sit through. The honest 2026 read: marginal value over a well-designed rung 3 is small for most products, and the marginal cost is large.

Rung 6 — Long-horizon agent with memory and recovery. Runs for hours or days, holds state across sessions, can be interrupted and resumed, recovers from its own mistakes. The rung the demo videos promised in 2023-24, and the rung almost nothing has actually delivered to product-ready quality in 2026. Devin was the most public attempt; the candid eighteen-month review is that it works in narrow tracks and breaks outside them. Operator sits closer to rung 3-4 with screen-control bolted on. Good for: experimentation, research preview, internal tooling where humans audit the trace. Fails in ways that are hard to even classify, because the failure happened six steps ago and the agent has been compensating ever since.

Read the ladder as a forcing function. Every additional rung adds at least one of three things — cost, latency, or blast radius — and very often all three. The judgment is not "how impressive is the architecture." It is "what is the smallest rung that clears the user's bar, and is the gap from there to the next rung worth what it buys?"

What the 2026 product landscape actually shipped

The marketing makes the ladder look flatter than it is. Where the named products actually sit:

Claude Code (Anthropic) sits at rung 3 with deliberate rung-4 behaviour when you ask for a plan. The cleanest example of a tool-using loop done right in 2026 — bounded domain (your repo), small set of high-leverage tools (read, edit, run, search), loop tuned until failure modes are predictable. It is what people mean when they say "agent" and actually deliver value.

Cursor Agent mode sits at rung 3 with the same shape — IDE chrome on top, same read-edit-run loop underneath, scoped to a workspace.

Devin (Cognition) was pitched at rung 6. The 2024 demos showed a long-horizon engineer that could pick up a Jira ticket and ship a PR. The 2025 follow-through showed a system that works in a narrow envelope — small, well-specified, isolated tasks — and degrades outside it the way every rung-6 attempt has so far. The lesson is not "Devin failed." The lesson is that rung 6 is hard enough that even a well-funded specialist team is still climbing it.

GitHub Copilot Workspace sits at rung 4 — plan, then execute — over a GitHub-native context. It works because the domain is structured, the tools are bounded, and the user reviews the plan before execution.

OpenAI Operator is rung 3-4 wearing rung-6 clothing. It can drive a browser, click buttons, fill forms. When it works it looks magical. When it fails, it fails in the most product-damaging way possible — by confidently doing the wrong thing on a real website with the user's session. The product question Operator forces is not "is the model good enough?" It is "what blast-radius budget did you give it, and who pays when it spends that budget wrong?"

The products that ship value in 2026 picked a low rung honestly and tuned it well. The ones that pitched a high rung are still climbing in public.

Agents need a budget, four of them

Every agent — every rung 3 and up — must run inside four budgets. If you cannot state all four before you ship, you do not have an agent, you have a runaway loop with a press release.

1. Wall-clock. How many seconds before the user gives up? For an in-product feature, the bar is under 30 seconds before you need a streaming "I am working" affordance, under 3 minutes before the user closes the tab. For a background task, hours — but a deadline beyond which the agent stops and reports what it has rather than spinning forever.

2. Dollars. A multi-step tool loop on a frontier model can spend $1-5 per task. If your unit economics assume $0.02 per use because that was the rung-1 cost, a rung-3 loop will eat your margin in a week. Hard token cap per session, no exceptions.

3. Tool-call count. "Maximum 20 tool calls before you stop and ask for help" is a far more useful guardrail than "be efficient." Loops fail by ballooning, not by being slow per step. Pick a number, log it, alert on near-misses.

4. Side-effect blast radius. What is the worst thing this agent is allowed to do without a human? "Read anything in the repo" is one blast radius. "Edit files" is bigger. "Push to a remote branch" is bigger again. "Merge to main" is a different planet. The single highest-leverage product decision for any agent is: which actions are auto-approved, which require confirmation, which are off-limits entirely. (Chapter 5 lives next to this — if a wrong answer is bad, a wrong action is worse.)

These four are not nice-to-haves. They are the spec. Write them in the PRD, alongside the eval set, before any code is written.

MCP changes the math

The Model Context Protocol — MCP, the open spec for how a model talks to external tools, originated by Anthropic in late 2024 and broadly adopted across the major labs by mid-2025 — is the biggest shift in the agent stack since function calling itself. The first-order effect is boring: a standard wire format for tool definitions. The second-order effect is the one that matters: interoperability flips build-vs-buy for tools.

Before MCP, every tool was a custom integration. Building "give the agent access to the company wiki" was a project. Buying it was impossible — nobody shipped a generic "wiki for agents." Everyone built; everyone built badly; everyone re-built.

With MCP — and the 2025-26 explosion of servers for Notion, Slack, Linear, Postgres, GitHub, Stripe, Drive — the question changes. For most "give the agent access to X" requirements in 2026, somebody has already shipped a server. Your job is to evaluate it like any third-party dependency: trust the vendor? Trust the maintenance cadence? Does the schema match the actions you want? Is the auth model compatible with how your users sign in? Yes to all four, you save weeks. No on any, you build — to the MCP spec, so the next team can reuse it.

The product implication: agent capability is now a procurement decision, not just an engineering one. PMs who win at agent products in 2026 know which MCP servers exist, which are production-grade, and which are vapor.

Two caveats. MCP makes it cheap to add tools, which makes it cheap to bloat — curate the toolset like you curate features. And security review for any MCP server is a real exercise: you are letting a model call into another vendor's surface, with credentials, on the user's behalf. The blast-radius budget applies doubly here.

What to do on Monday morning

Pull up the AI feature highest on your roadmap. Answer four questions in writing, in this order.

Which rung does it actually need? Be honest. If a rung-2 function call clears the bar, ship rung 2.

What are the four budgets — wall-clock, dollars, tool-calls, blast radius — and which budget owns the kill switch? Pick one of the four as the master, the one that automatically halts the loop. The tool-call count is usually the right choice because it is the cheapest to enforce.

Which tools does this agent need, and how many of them already exist as MCP servers your team can adopt rather than build? If more than half are buy-not-build, you just saved a sprint. If none are, ask why — it usually means the domain is new or the tools are too privileged for off-the-shelf.

How will you eval the loop, not just the steps? Chapter 4's eval set has to include end-to-end traces, not just single prompts. A rung-3 system that scores 95% on individual tool calls can score 60% end-to-end because errors compound. Build the trace eval before the trace ships.

The next chapter (RAG, Fine-Tune, or Context Window?) is the companion to this one — how to give your model the right data is half the question; how to let it act on the world is the other half. Read them as a pair.

Rules

Where to go next

Chapter 7 — RAG, fine-tune, or context window: how to give the model the right data, the companion to "how to let it act." (RAG, Fine-Tune, or Context Window?)
Chapter 4 — Eval before launch: the trace-level evaluation an agent needs, not just prompt-level. (Eval Before Launch)
Chapter 5 — Hallucination as a product problem: wrong answers are bad, wrong actions are worse. (Hallucination as a Product Problem)
Chapter 9 — Cost & latency as first-class product constraints: the budgets in this chapter become line items there.
Companion: Working with Engineers — agent specs are where this seam matters most.