Capstone — design the toolbelt and memory layer for your agent — Tool & Memory Design — when the agent's effective IQ depends on its toolbelt

Design the cage before you build the bird.

The output of this course is a design artifact, not a summary of concepts. Every lesson in the preceding nine has described a design discipline: vocabulary over abundance, distribution versus capability, naming as reliability, four nouns for memory, context versus persistence, three indexes not one, write rules before schemas, working examples over theory, failure mode diagnosis over model upgrades. The capstone is where those disciplines produce a concrete deliverable for a real system on your roadmap.

The artifact is a design memo, not a technical specification. It is written in plain prose for a leadership audience. The engineering team will build from it; the memo is how you and leadership align on what the agent is, what it can and cannot do, and what failure looks like before you have spent a day writing code.

Most teams skip this. They go directly from "we'll add an agent" to the first sprint of implementation. The cost is three to five months of re-work when the toolbelt turns out to be wrong, the memory collapses into a single regime, and the failure-mode plan is "we'll see what happens." Writing the memo takes two and a half hours. The re-work costs much more.

The brief

Pick one agent system you are scoping, either active or planned. Produce five artifacts — they build on each other, so produce them in order.

Artifact 1: The toolbelt design. Apply Lessons 1 through 3. Name the agent's job in one sentence. List the verbs, capped at seven to start. Assign a side-effect band to each. Flag every irreversible verb and name its paired confirmation step. Write a one-sentence schema description for every parameter in every tool. If any two tools are confusable, rename or merge before you continue.

Artifact 2: The memory map. Apply Lessons 4 through 7. List every fact the agent needs to remember. Label each by regime: working, episodic, semantic, procedural. For each, write the write rule (who writes, when, automatically or with confirmation), the read rule (when retrieved and from what trigger), and the decay policy (expires, gets superseded, or persists). Note which facts are candidates for the semantic index, which for the episodic index, which for lookup.

Artifact 3: The retrieval plan. For each memory that will live in a retrieval index, name the index (semantic, episodic, or lookup), state the quality metric, and name the refresh rate. If two different memory regimes are going into the same index, flag this and explain why.

Artifact 4: The failure-mode plan. For each of the three failure modes — forgetting, repeating, hallucinating — name the most likely instance in your specific system (what fact is most likely to be forgotten? what pattern is most likely to repeat? where does the model have enough training coverage to confabulate without retrieval?). For each, name the telemetry you will ship on day one to catch it, and the response protocol when it is detected.

Artifact 5: The one-page memo. Plain prose, leadership audience, no jargon without a gloss. Covers: what the agent does, what its vocabulary is, what it remembers and how, what failure looks like and how you catch it. If the memo is longer than two pages, the design is not clear enough yet.

The picture

The published capstone will show one anonymized PL or Ostronaut system walked end to end — the five-artifact package produced before the first implementation sprint. The canonical example is the pl-judgment MCP server design: a five-verb toolbelt, a procedural-memory split, a single-index retrieval choice with reasoning, and a failure-mode plan with telemetry pointers.

What makes it instructive is not that it is sophisticated. It is that it is legible. Every decision traces to a lesson. Every constraint is named. Every failure mode has a telemetry hook. When the first production anomaly appeared — a confusable pair misfiring under load — the debug path was twenty minutes, because the toolbelt design document named the pair and the telemetry was already logging call selection.

Why it matters now

The design memo is the artifact that makes agentic system development reviewable. Code can be reviewed; code in the absence of a design memo can only be reviewed for correctness, not for whether it is building the right thing. A toolbelt with twelve tools and no schema discipline looks like code. A toolbelt design memo that shows the original twelve, the three that were collapsed, the one that was split, and the resulting five with schemas is legible to anyone who has read the preceding lessons.

Agentic systems are increasingly the surface where product strategy becomes user behavior. The toolbelt is not an implementation detail — it is the grammar of what the agent is allowed to do. The memory layer is not a feature — it is the mechanism by which the product learns from every interaction. Both deserve design documentation before they deserve code.

A source you should trust

The preceding nine lessons are the source for this capstone. Every artifact maps to at least two lessons; if an artifact does not reference a specific lesson's discipline, you probably skipped a step.

AI Manual Chapter 6 (tool use, function calling, agents) provides the practitioner grounding for the toolbelt design if you need to go deeper on the agentic loop itself.

AI Manual Chapter 7 (RAG, fine-tune, or context window) is the upstream decision tree for the retrieval plan. Before you design the three indexes, decide whether retrieval is the right answer for each memory regime or whether fine-tuning or context expansion serves better.

A recipe

A two-and-a-half-hour working session. Do not compress it — the value is in the friction at each step, not in producing an artifact quickly.

Pick the system. (10 minutes.) Write the job description in one sentence. If you cannot, do not start the toolbelt design yet. Go back and scope the agent.
Toolbelt design. (30 minutes.) One sentence job description. Verb list capped at seven. Side-effect bands. Irreversible verb flags. Schema descriptions. Confusable pair audit.
Memory map and write rules. (30 minutes.) Full fact inventory. Type labels. Write rules, read rules, decay policies. Flag candidates for each index type.
Retrieval plan. (20 minutes.) Index assignments. Quality metrics. Refresh rates. Flags for any two-regime-one-index decisions.
Failure-mode plan. (20 minutes.) Named instance of each mode in your system. Telemetry for each. Response protocol for each.
One-page memo. (30 minutes.) Write it last, after the artifacts are complete. It should be easier to write than you expect, because the artifacts contain the logic and the memo is the synthesis.
Sleep on it. Revisit the next day. The overnight gap almost always surfaces one toolbelt decision that needs revision and one memory type that was labeled wrong.

The smell of it going wrong

The toolbelt has more than seven tools or any irreversible action without a confirmation step. Cap and add confirmation before continuing.
The memory map uses "memory" as a single label for more than one regime. Name each one.
The retrieval plan uses one index for everything. If that is a deliberate choice — the system is simple enough — write down why. If it is not a deliberate choice, separate the indexes.
The failure-mode plan is missing telemetry for any of the three modes. Telemetry is not a post-ship concern; it is a day-one concern.
The memo is more than two pages. If it is two pages, the design is probably solid but the writing needs compression. If it is four pages, the design has unresolved ambiguities that the prose is papering over.

A judgment call from real work

The pl-judgment MCP server design was scoped using this five-artifact protocol, and the process produced one outcome that was not in the original plan.

The toolbelt design surfaced a fact about the memory design: the system needed to remember how a particular PM tends to reason about tradeoffs — not as a user fact (who they are) but as a procedural fact (how to engage their reasoning specifically). This is procedural memory. It was not in the original memory map because the original map was organized around what the system needs to know, not around how the system needs to behave. The difference is the lesson from Lesson 4: procedural memory is different from semantic memory, and forgetting the distinction means you design a facts store when you needed a behavior store.

The procedural memory split — user facts in a semantic index, behavioral patterns in a procedural layer with a different write protocol — did not come from the first pass of the design. It came from auditing the memory map against the taxonomy, noticing that two items with different decay policies and different write rules had been labeled with the same type, and correcting before implementation.

That is the value of the design memo: the correction happened in two hours of design time rather than after three months of wondering why the system never seemed to get better at engaging a specific user's reasoning style.

Rules from this lesson

The five-artifact design package is the course output — produce it in order, before the first implementation sprint, not after.
Every artifact should reference at least one of the preceding nine lessons; if it does not, a design discipline was skipped.
The memo is the deliverable that leadership reads and that makes the design reviewable; budget the writing time as seriously as the artifact time.

Design the cage before you build the bird.

Pick the system. (10 minutes.) Write the job description in one sentence. If you cannot, do not start the toolbelt design yet. Go back and scope the agent.
Toolbelt design. (30 minutes.) One sentence job description. Verb list capped at seven. Side-effect bands. Irreversible verb flags. Schema descriptions. Confusable pair audit.
Memory map and write rules. (30 minutes.) Full fact inventory. Type labels. Write rules, read rules, decay policies. Flag candidates for each index type.
Retrieval plan. (20 minutes.) Index assignments. Quality metrics. Refresh rates. Flags for any two-regime-one-index decisions.
Failure-mode plan. (20 minutes.) Named instance of each mode in your system. Telemetry for each. Response protocol for each.
One-page memo. (30 minutes.) Write it last, after the artifacts are complete. It should be easier to write than you expect, because the artifacts contain the logic and the memo is the synthesis.
Sleep on it. Revisit the next day. The overnight gap almost always surfaces one toolbelt decision that needs revision and one memory type that was labeled wrong.

The smell of it going wrong

The toolbelt has more than seven tools or any irreversible action without a confirmation step. Cap and add confirmation before continuing.
The memory map uses "memory" as a single label for more than one regime. Name each one.
The retrieval plan uses one index for everything. If that is a deliberate choice — the system is simple enough — write down why. If it is not a deliberate choice, separate the indexes.
The failure-mode plan is missing telemetry for any of the three modes. Telemetry is not a post-ship concern; it is a day-one concern.
The memo is more than two pages. If it is two pages, the design is probably solid but the writing needs compression. If it is four pages, the design has unresolved ambiguities that the prose is papering over.

A judgment call from real work

The pl-judgment MCP server design was scoped using this five-artifact protocol, and the process produced one outcome that was not in the original plan.

Rules from this lesson

The five-artifact design package is the course output — produce it in order, before the first implementation sprint, not after.
Every artifact should reference at least one of the preceding nine lessons; if it does not, a design discipline was skipped.
The memo is the deliverable that leadership reads and that makes the design reviewable; budget the writing time as seriously as the artifact time.