Write tool names and schemas the way you write API docs for a tired junior engineer on their first day — explicit, unambiguous, hard to misuse.
This is the cheapest reliability work you will do on an agent system. A model upgrade costs money, latency, and months of waiting. A toolbelt redesign costs an afternoon and frequently cuts the misfire rate in half. The reason teams reach for the model upgrade first is that it feels like a capability problem. It is not. The model is faithfully reading the vocabulary you gave it. If two tools sound alike, the model will confuse them, because you designed in a confusion point. The fix is not a smarter model — it is a less ambiguous vocabulary.
The picture
Consider two toolbelts serving the same user-lookup function. The bad toolbelt has three tools: getUser, findUser, and lookupUserById. The model must infer from context which of the three to call, and the descriptions are not different enough to guide that inference reliably. The good toolbelt has one tool: lookup_user(by: 'id' | 'email', value: string). One verb, two modes, explicit enum. The model never has to choose between three overlapping verbs.
Now imagine you have ten tools that each have one or two naming issues of this kind. The misfire surface compounds. At five tools with clean schemas, the agent calls the right tool about 97% of the time on the test bench. At ten tools with messy naming, that rate can drop to 80%. That 17-point difference is not the model getting worse — it is the vocabulary getting worse. The model is doing exactly what it should: making a reasonable inference from ambiguous text.
Why it matters now
Naming and schema discipline has always mattered in software engineering — OpenAPI and JSON Schema both exist because the industry learned this the hard way over thirty years of API design. What changed in the agent era is that the consumer of your interface is now a language model, not a human developer. A human developer can read a confusing tool description and mentally resolve the ambiguity. A language model reads it and takes the text at face value. It cannot ask you what you meant.
The implication is that naming and schema discipline becomes more important, not less, as models get more capable. A highly capable model faithfully executes a confusing schema, which can produce confidently wrong tool calls. The failure is quieter and harder to catch than a model that simply says "I don't know which tool to use."
A source you should trust
Anthropic's tool-use cookbook has the most directly applicable guidance, especially the sections on naming conventions and parameter descriptions. The recommendations are grounded in observed tool-call patterns across production deployments, not theory.
OpenAPI and JSON Schema discipline represents thirty years of industry learning on how to write machine-readable interfaces that are also human-readable. The rule "every field has a type and a one-sentence description" predates LLMs and remains exactly right. Borrow this practice without modification.
A recipe
Four naming and schema rules that cover the majority of preventable misfires:
- One verb per tool, no conjunctions. A tool named
get_and_update_useris two tools pretending to be one. Split it. The model must decide whether the action is a get or an update, and a tool name that contains both verbs gives it no guidance. - Full words, no abbreviations.
customernotcust.identifiernotidif the type is ambiguous. Abbreviations save characters; they cost inference reliability. - Parameter descriptions are one full sentence, including when to set the value. "The user's email address (required); pass null if the email is not known at call time" is a parameter description. "user email" is a label. Labels require inference. Sentences constrain inference.
- Enum-typed parameters list every valid value with a one-phrase description of each.
by: 'id' | 'email'with a description that says "use 'id' for database lookups, 'email' for identity verification flows" eliminates the class of errors where the model guesses the wrong enum value because the values are individually ambiguous.
The smell of it going wrong
- A pair of tools is consistently confused by the model across test runs. This is a naming problem, not a model problem.
- A parameter description is shorter than the parameter name. If the description is just the name restated, it adds nothing and the model will infer rather than read.
- Boolean flags carry implicit semantic meaning.
force: trueis a flag that means different things in different contexts and forces the model to infer what "force" means for this particular operation. Name the intent:skip_confirmation: true. - Tool descriptions are auto-generated from the function docstring with no curation. Function docstrings are written for developers who know the codebase; tool descriptions are written for a model that does not.
- The team's debugging approach to misfire issues is to add more examples to the system prompt rather than to redesign the schema. Examples patch the symptom; schema redesign fixes the cause.
A judgment call from real work
During the development of the PL pl-judgment MCP server, the initial schema had two tools that caused a consistent confusion pattern: evaluate_claim and assess_claim. The descriptions differed in nuance — evaluate was for claims with multiple dimensions, assess was for single-dimension pass/fail checks — but the model misfired between them at a rate that was unacceptable for a reliability-critical judgment system.
The fix required two moves. First, a rename: evaluate_claim became multi_dimension_evaluation and assess_claim became binary_assessment. The change made the model's selection criteria visible in the name itself. Second, a description rewrite: each tool's description was rewritten to include an explicit "do NOT use this tool if..." clause, naming the other tool as the right choice in that case. Cross-references in tool descriptions are unusual, but they work — they convert the selection problem from a single-tool read into a two-tool comparison, which is easier for the model to resolve.
The call-error rate for the pair dropped from approximately 23% to under 3% after the rename and description rewrite. No model changes. No prompt engineering. A vocabulary fix.
Rules from this lesson
- Tool names are read by the model; write them as documentation for an intelligent reader who has no other context, not as function names for a compiler.
- Confusable pairs are guaranteed to misfire at some rate; rename or merge them at design time before the rate becomes a production problem.
- Schema descriptions are full sentences; one-word labels are a smell that will produce inference-dependent behavior in the field.