Stop asking "what tools does the agent need?" and start asking "what verbs is the agent allowed to use in our domain?"
The distinction is not pedantic. When you frame the question as tool needs, you end up with a list assembled from whatever is available — every API the product touches, every operation the system supports. When you frame it as verb design, you end up with a grammar: a small, coherent set of operations the agent can compose to do work. One produces a junk drawer. The other produces a language.
Toolformer (Schick et al. 2023) was the first rigorous demonstration that language models can teach themselves to call external tools when those tools are named and described well. What the paper also showed, without making much of it, is that models learn tool use the same way they learn language — from the shape of the vocabulary. A bloated vocabulary degrades the learning signal. This is not a 2023 finding you can dismiss as pre-frontier; the same dynamic holds in 2026 harnesses with GPT-4 or Claude 3.5. The model is a reasoning engine, not a router. It reads the toolbelt the same way it reads text: it forms expectations about what each verb means, and it gets confused when the verbs are ambiguous or overlapping.
The picture
Think of the toolbelt as a grammar diagram. The model is the speaker. The toolbelt is the verb list. The parameters are the grammar of each verb — the cases, the objects, the modifiers. The return shape is the sentence the verb produces. If you have ever tried to learn a language from a phrasebook, you know the failure mode: fifty phrases covering specific situations, none of them compositional, none of them load-bearing when the situation changes by an inch. A good toolbelt reads like a chef's knife set. Four knives, each with a name, a purpose, and a shape you can hold. Not forty knives, most of which are decorative.
The test: can you write the agent's job description in one sentence, then immediately name the verbs it needs without consulting the codebase? If you need to look up the codebase, the verbs are accidental rather than designed.
Why it matters now
Toolformer and ReAct established tool use as the dominant pattern for capable agents. By 2024, MCP arrived as a distribution standard — more on that in Lesson 2. The consequence is that mounting a tool became cheap. You can add a tool to a Claude harness in twenty minutes. The cost asymmetry changed: adding is cheap, but the cost of having too many grew. Every added tool increases the probability of a misfire, because the model's uncertainty about which verb to call scales with the size of the vocabulary.
The discipline of toolbelt design has not kept pace with the ease of toolbelt growth.
A source you should trust
Toolformer (Schick et al. 2023) is the foundational paper on letting language models learn to use tools without being explicitly trained on tool-use examples. Read Section 3 on how tools are described to the model — the naming and description choices made there turn out to be more important than the tools themselves.
Anthropic's tool-use cookbook covers operator-grade discipline: how to write tool descriptions that reduce model confusion, how to structure the parameters of each tool, and when to split vs. merge tools. It is readable in an afternoon and directly applicable.
A recipe
A toolbelt design protocol for a new feature:
- Write the agent's job description in one sentence. ("Triage incoming customer messages and respond, escalate, or close.") If you cannot write it in one sentence, the agent's scope is under-defined and no toolbelt will fix that.
- List the minimum set of verbs the agent must speak to do that job. Cap the list at seven before you add any.
- For each verb, decide its side-effect band: read-only, write-reversible, write-irreversible, external. This is not bureaucracy — it determines your confirmation and audit requirements.
- Pair every irreversible verb with an explicit confirmation step. The agent should not both decide to close a ticket and close it in a single call.
- Remove every verb that fails the "would this verb ever be the right answer?" test. If you cannot construct a scenario where calling this tool is the correct agent move, cut it.
The smell of it going wrong
- The toolbelt has more than ten tools before the first production deploy.
- Two tools have overlapping descriptions ("send_email" and "notify_user") such that neither description tells the model when to prefer one over the other.
- One tool does two unrelated things, joined by an "and" in its description. That "and" is a seam — split it.
- The team adds tools more often than they remove them. Toolbelts that only grow are closets.
- The agent misfires between two tools consistently, and the proposed fix is a model upgrade rather than a toolbelt redesign.
A judgment call from real work
When the PL pl-judgment MCP server was first scoped, the natural-feeling pass produced nine tools. The job description was "help a PM evaluate and decide on product judgment calls." Nine felt right because the domain is genuinely varied.
Then we ran the five-step protocol above. Three tools collapsed: "add_to_watchlist," "flag_for_review," and "mark_uncertain" were all write operations with the same side-effect band (reversible write) and nearly identical descriptions. They were collapsed into one verb: "defer_with_note," which took a decision-type parameter distinguishing the three cases. The agent had been calling the wrong one of the three about a third of the time; after the collapse, the misfire rate went to zero.
One tool was split: "analyze_and_decide" was a compound verb — it ran an analysis and committed a decision in a single call. Splitting it into "analyze" (read-only, returns structured observations) and "record_decision" (write, requires the analysis as input) added a human review window between observation and commitment.
The final five tools are sharp. Each one has a job description that the other four cannot fulfill. That is the test.
Rules from this lesson
- The toolbelt is a vocabulary; design it like one, not like an API integration list.
- Fewer, sharper verbs beat more, blunter ones at every model rung — the discipline compounds as context windows shrink and model uncertainty grows.
- Irreversible verbs are paired with confirmation verbs; never decide and execute in the same call.