The most common expensive mistake in agent design is a model-selection meeting that happens before anyone has written down what verbs the agent needs.
Model choice matters. Toolbelt design matters more for most product surfaces. This is not a provocative claim — it follows directly from how the agent loop works. The model reads the available tools on every turn and selects among them. The quality of that selection is bounded by the quality of the vocabulary. A clear, well-named, minimal toolbelt reduces the selection problem to something tractable. A crowded, confusable toolbelt makes it intractable regardless of model capability. A capable model with a poorly designed toolbelt will misfire on confusable pairs, call irreversible tools when it meant read-only ones, and accumulate tool-call history that inflates the context window. A less capable model with a tight, well-named toolbelt will outperform it reliably.
The toolbelt is the agent's vocabulary for acting on the world. If the vocabulary is imprecise, the sentences will be imprecise.
Start with the side-effect spectrum. Every tool belongs to one of four bands: read (fetches data, no mutations), write (mutates local state), external (calls a third-party service that may have its own side effects), irreversible (deletes, sends, charges, publishes — actions that cannot be cleanly undone). Knowing which band a tool lives in tells you how much human oversight it requires and what the blast radius of a misfire is.
The four-band taxonomy is not bureaucracy. It is the information you need to answer the most important architecture question for any agent action: can the agent both decide to do this and execute it in one call, or should a human be in the loop?
Read-band tools can almost always be autonomous. Irreversible-band tools almost never should be. The most dangerous design pattern in agent toolbelts is collapsing the decision and the execution of an irreversible action into a single tool. The model calls send_email, and the email goes. There is no confirmation step, no preview, no undo. This is not a model failure when it misfires. It is an architectural choice you made.
The second critical dimension is confusability. Two tools are confusable when their names, descriptions, or parameter schemas are similar enough that a model under load — at turn thirty, with a partially evicted context — would have meaningful probability of calling one when it meant the other.
The model does not browse your source code when choosing a tool. It reads the tool names and descriptions that arrive in the context. If you have getUser, lookupUser, and findUserById — all doing slightly different things — the model will mix them up. Not because the model is weak. Because the vocabulary is ambiguous. A tired human reading those three names would also mix them up.
The confusability test is simple: read your tool list to a colleague who has not seen it. Ask them to describe what each tool does. Where they hesitate or guess wrong, you have a confusability problem.
Why it matters now
ReAct-style tool use — the pattern of interleaved reasoning and acting introduced by Yao et al. in 2022 — became the dominant agent architecture by 2024. Every major framework implements a variant of it. The pattern is solid. The shipping discipline did not keep up.
Most production toolbelts grew organically. Someone needed one tool, then three, then ten, then twenty. Nobody ran the confusability audit. Nobody enforced the irreversibility gate. By the time the toolbelt had twenty-five entries, the model was making tool-selection errors that looked like model failure but were actually vocabulary failure.
Toolbelt curation is a product decision that gets treated as an engineering backlog. That misclassification is expensive.
The correct sequencing is: define the JTBD (what must the agent be able to do?), derive the minimum verb set (what tool does each JTBD require?), name each tool with the specificity of an API contract (not search, but search_lessons_by_topic), audit for confusable pairs, gate irreversible actions. That sequence takes two hours for a five-tool system and saves a week of debugging for a twenty-five-tool system.
A source you should trust
- Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models" (2022). The paper that formalized interleaved reasoning and tool use. Read it for the architectural pattern, not for the specific benchmark numbers — those have been superseded. The core insight about the reasoning trace being load-bearing for correct tool selection is still directly applicable.
- Anthropic's tool-use best practices documentation. Operator-grade guidance on naming conventions, description writing, parameter schema design. The section on avoiding confusable names is practical and immediately applicable. Read it before designing any toolbelt larger than five tools.
- Toolformer (Schick et al., 2023). The paper that showed models can learn when to call tools and when not to — the "knowing when to stay silent" capability. Relevant context for understanding that over-equipped toolbelts harm the model's ability to choose the right moment to act.
A recipe
A toolbelt audit for any agent system in flight:
- List every tool. Name, one-sentence purpose, side-effect band (read / write / external / irreversible). If you cannot write the one-sentence purpose without using "and," the tool is doing two things and should be split.
- Draw the confusability graph. For every pair of tools that share a first word, similar descriptions, or overlapping parameters, draw a line. Tools with two or more lines are your misfire candidates. Rename or merge them.
- For every irreversible-band tool, ask: should the agent be able to call this unilaterally, or should there be a separate confirmation step? The default answer is no, there should not. Build the confirmation as a separate read-band tool that returns a preview; build the irreversible action as a separate execute tool that requires confirmation state.
- Remove every tool that has not been called in the last twenty production runs. Dead tools are cognitive load that inflates the model's selection task without adding capability.
- Target five tools. Ten is acceptable with careful naming. Twenty is a smell. Twenty-five is the toolbelt that will cause the incident that brings you back to this lesson.
The smell of it going wrong
- The agent keeps calling one member of a confusable pair when the task clearly requires the other. This will be filed as a model regression. It is a naming problem.
- Tool-call history shows a pattern of the agent calling a tool, getting a result, calling the same tool again with different parameters, and repeating. The agent is using retrieval-style reasoning to disambiguate a vocabulary problem it should not have.
- There are tools in the definition that appear zero times in the call log. The model has no reliable signal they exist, or cannot distinguish them from their confusable peers.
- A single tool has a description that includes the word "or" — it does one thing or another depending on parameter values. This is two tools wearing a trench coat.
- The irreversible tool has the same API shape as the read tool beside it, with no confirmation surface.
A judgment call from real work
The PL pl-judgment MCP server is being designed explicitly against this discipline. The server exposes structured judgment operations — retrieving relevant mental models, annotating decisions, logging decisions for learning — as a small set of precisely named tools.
The early design had eleven tools. The confusability audit cut it to six. Three pairs were confusable: get_model / fetch_model, annotate_decision / tag_decision, and log_judgment / record_judgment. In each pair, one was renamed and one was merged into the other, because the distinction between them did not survive a one-sentence description.
The irreversible tool — publish_judgment_card, which exports an annotated decision to the shared PL learning feed — was split into two: preview_judgment_card (read-band, returns the card as it would appear) and publish_judgment_card (external-band, requires the preview result as a required parameter). If the model cannot produce a preview, it cannot publish. The gate is structural, not advisory.
Six tools, zero confusable pairs, one structural irreversibility gate. That is the discipline in practice.
The audit took ninety minutes. The resulting toolbelt will prevent entire classes of misfire that would otherwise have been filed as model regressions and investigated at far greater cost. Toolbelt curation is the cheapest performance improvement available before a system ships.
Rules from this lesson
- The size and shape of the toolbelt is a product decision, not a backlog accumulation.
- Confusable tool names cause confusable tool calls; rename before reaching for a bigger model.
- Irreversible actions deserve their own confirmation surface; never collapse decide-and-execute into one tool call.
- A tool that does two things is two tools that have not been split yet.