Capstone -- Write the Model-Selection Memo and the Eval Doc for a Real Feature

Reading time

7 min

7 min left0%

capstone -- write the model-selection memo and the eval doc for a real feature0%

7 min left

This course ends the right way: not with a quiz, but with two documents you can carry into a real product review.

If the earlier lessons did their job, you should now have a different instinct. You should be less impressed by model theater, more disciplined about the user bar, more alert to unit economics, and far less willing to ship without an eval set. The capstone is where that instinct becomes operating practice.

Your task is to pick one real feature and write two artifacts:

One, a model-selection memo.

Two, a golden-eval document.

Pick something real. Not "AI assistant" in the abstract. Not "customer support chatbot" as a category. Pick a specific user job inside a product you own, are planning, or can describe with enough seriousness that someone could challenge your reasoning. The best capstones are narrow. "Summarize customer feedback themes for the PM after 50 interviews." "Draft the first response to common refund questions in the payments app." "Suggest the next code edit for a developer inside a repository-aware coding assistant." Narrow is what makes the artifacts honest.

Before writing anything, run the feature through the screen from When AI Is the Right Answer. You should be able to state clearly why the job wants AI rather than a deterministic workflow. Then use The Model-Selection Ladder to force the real decision: what is the smallest model that could plausibly clear the bar, and what evidence would justify climbing?

The model-selection memo should fit on one page if written tightly. If it becomes five pages, you are usually hiding uncertainty behind volume.

The memo needs seven sections.

First, the feature and the user job. One paragraph. Who is the user, what are they trying to get done, and where in the workflow does this AI behavior sit?

Second, why AI is appropriate here. One paragraph. Explain why the input or output is judgment-shaped enough to deserve a model. If the feature barely passes this test, say so. Ambivalence written honestly is better than fake certainty.

Third, the bar. Write the product threshold in user terms. What does good enough mean? What errors are unacceptable? What kinds of failure are tolerable if the value is high enough?

Fourth, the candidate ladder. List the rungs you considered -- mini, workhorse, frontier, or specific providers if relevant -- and the reason each is in scope.

Fifth, the decision. Which rung is the starting default, and why? If you are using routing or escalation, say so plainly. "Mini by default, workhorse on low-confidence cases" is a decision. "We may route dynamically" is a dodge.

Sixth, the economics. What do you believe one useful outcome costs on this path? What tier of user is this intended for? What assumptions would make the economics fail?

Seventh, the reversibility note. What are you locked into, what is the fallback path, and what would trigger a serious re-evaluation?

That is the whole memo. If it cannot survive being this short, your thinking is probably not sharp enough yet.

Now the second artifact: the golden-eval document.

This is where most teams get exposed, because it forces them to stop talking about quality in slogans and start defining it row by row.

The eval document should have five sections.

First, the eval objective. One paragraph on what the set is trying to prove. Not "the model is good." Something specific like "the assistant can draft a correct first response for the top refund-policy questions while abstaining on exception cases."

Second, the buckets. Name the categories that matter. Core wins, edge ambiguity, abstain cases, adversarial cases, high-value segment cases, multilingual cases if relevant. The buckets matter because they tell reviewers how to read pass rates.

Third, the first twelve rows. This is the sharp set from lesson four. For each row, write the input, the category, and the expected properties of a good output. Again, avoid vague criteria. Use concrete properties like "must mention the refund window," "must not invent exceptions," "must route to human review," or "must preserve the action items and dates."

Fourth, the gating rule. What threshold must this feature clear before launch, and what kinds of failures are blocking regardless of overall score? This is where Eval Before Launch becomes a real product discipline instead of a reading assignment. If you cannot write the gating rule in advance, you are planning to negotiate with the result later.

Fifth, the regression plan. What changes trigger a rerun? Prompt changes, model swaps, retrieval changes, tool changes, policy updates, context-window adjustments. List them. Make it impossible for a future teammate to pretend they did not know this counted as a deploy.

Your capstone improves dramatically if you include one "recent nightmare" row even if the feature has not launched yet. What is the case that would create the worst executive or support pain if the system got it wrong? Put that case in the set early. Sharp rows create honest teams.

There are a few common ways people fail this capstone.

One is writing a memo that is really a vendor justification memo rather than a product memo. If the document reads like "we chose Provider X because it is state of the art," you have missed the point. The memo should be about the user bar and the economics first, provider identity second.

Another is writing an eval set made of easy cases because those are easier to label. That gives false confidence and guarantees a worse product. The first rows should be uncomfortable. If every example looks like the happy path, you have built a demo set, not an eval set.

A third failure mode is skipping the abstain cases. Teams love testing whether the model can answer. Mature teams spend equal time testing whether it knows when not to answer. This matters even more if your product is closer in spirit to Harvey's legal AI than to GitHub Copilot. Cheap wrong answers and expensive wrong answers should not share a philosophy.

Another is refusing to write the economics because "we do not know usage yet." Forecasting uncertainty is fine. Pretending economics do not exist until launch is not. Even rough ranges force the right conversation earlier. That is the point.

One last habit will make your capstone materially better: have another person attack both documents. Ask them where the memo is hand-wavy, where the eval rows are too easy, where the product bar is vague, and where the fallback story sounds fake. Good AI product decisions survive hostile reading better than warm-room consensus.

If you want a clean review checklist for your own capstone, use this one.

Can a skeptical PM tell what user job this feature serves?

Can an engineer tell which rung should be tested first and why?

Can finance or a founder see the margin risk clearly enough to challenge it?

Can support or domain experts see whether the worst cases are represented?

Can the team rerun the eval after a prompt or model change without inventing a new process?

If the answer to any of those is no, tighten the artifacts.

The point of this course was never to make you sound informed about AI. Too many people already do. The point was to make you harder to fool -- by vendors, by benchmarks, by your own demo, and by the seductive idea that intelligence alone creates product value.

When you finish this capstone well, you will have something rare: not just a point of view, but a repeatable operating standard. That is what strong AI product teams actually need.

Rules from this lesson

Pick a narrow real feature for the capstone. Broad AI concepts produce vague artifacts and vague artifacts produce bad decisions.
The model-selection memo should be short enough to force clarity: user job, bar, starting rung, economics, and reversibility.
The eval document should begin with twelve sharp rows that pressure the feature rather than flatter it.
Write launch thresholds and regression triggers in advance. If you leave them unwritten, the team will negotiate with the result later.
A good capstone survives hostile reading from product, engineering, and business stakeholders alike.

Capstone -- Write the Model-Selection Memo and the Eval Doc for a Real Feature

Reading time

7 min

7 min left0%

capstone -- write the model-selection memo and the eval doc for a real feature0%

7 min left

This course ends the right way: not with a quiz, but with two documents you can carry into a real product review.

Your task is to pick one real feature and write two artifacts:

One, a model-selection memo.

Two, a golden-eval document.

The model-selection memo should fit on one page if written tightly. If it becomes five pages, you are usually hiding uncertainty behind volume.

The memo needs seven sections.

First, the feature and the user job. One paragraph. Who is the user, what are they trying to get done, and where in the workflow does this AI behavior sit?

Third, the bar. Write the product threshold in user terms. What does good enough mean? What errors are unacceptable? What kinds of failure are tolerable if the value is high enough?

Fourth, the candidate ladder. List the rungs you considered -- mini, workhorse, frontier, or specific providers if relevant -- and the reason each is in scope.

Sixth, the economics. What do you believe one useful outcome costs on this path? What tier of user is this intended for? What assumptions would make the economics fail?

Seventh, the reversibility note. What are you locked into, what is the fallback path, and what would trigger a serious re-evaluation?

That is the whole memo. If it cannot survive being this short, your thinking is probably not sharp enough yet.

Now the second artifact: the golden-eval document.

This is where most teams get exposed, because it forces them to stop talking about quality in slogans and start defining it row by row.

The eval document should have five sections.

There are a few common ways people fail this capstone.

If you want a clean review checklist for your own capstone, use this one.

Can a skeptical PM tell what user job this feature serves?

Can an engineer tell which rung should be tested first and why?

Can finance or a founder see the margin risk clearly enough to challenge it?

Can support or domain experts see whether the worst cases are represented?

Can the team rerun the eval after a prompt or model change without inventing a new process?

If the answer to any of those is no, tighten the artifacts.

When you finish this capstone well, you will have something rare: not just a point of view, but a repeatable operating standard. That is what strong AI product teams actually need.

Rules from this lesson

Pick a narrow real feature for the capstone. Broad AI concepts produce vague artifacts and vague artifacts produce bad decisions.
The model-selection memo should be short enough to force clarity: user job, bar, starting rung, economics, and reversibility.
The eval document should begin with twelve sharp rows that pressure the feature rather than flatter it.
Write launch thresholds and regression triggers in advance. If you leave them unwritten, the team will negotiate with the result later.
A good capstone survives hostile reading from product, engineering, and business stakeholders alike.