The Model-Selection Ladder — the pm manual

Most teams pay frontier prices for mini-model work — and the gap is bigger than they think. The cost of that habit is not the bill; it is the price point you will never reach.

Talvinder Singh

There is a slide that shows up in every AI product review I sit in. It compares four models. The biggest one wins by two percentage points on whatever benchmark the team chose. The slide ends with a recommendation to use the biggest model. The cost column, when it appears at all, is in a smaller font. Six months later the team is on a call with finance trying to explain why their inference bill is eating the gross margin of the feature.

This chapter assumes When AI Is the Right Answer (and When It Isn't) is behind you — AI is the right answer. The next decision is which rung of the ladder you start on. Get this wrong and you will not notice for a quarter; you will pay for it for the entire life of the product.

The ladder, in 2026

Three rungs. The names change every nine months; the structure does not.

Rung 1 — the mini tier. Claude Haiku 4.5. GPT-5 mini and GPT-5 nano. Gemini 2.5 Flash and Flash-Lite. Open-weights models like Llama 3.3-70B served via Groq or Fireworks. Cheap enough to be wasteful with, fast enough that latency rarely gates you, and capable enough to clear the bar on a surprising fraction of real product tasks — classification, extraction, routing, drafting, summarising bounded text, single-step tool calls.

Rung 2 — the workhorse tier. Claude Sonnet 4.6. GPT-5 (standard tier). Gemini 2.5 Pro. The models that win most production deployments when rung 1 doesn't quite clear the bar — multi-step reasoning over messy input, long-document synthesis, tool-use chains of three to seven steps, agentic loops where you can't afford to be wrong twice.

Rung 3 — the frontier tier. Claude Opus 4.7. GPT-5 with high reasoning effort. Reasoning-heavy variants. The work the other two can't do — research-grade analysis, novel problem decomposition, long-horizon agents, code-generation at the limit. 10× to 100× more per call than rung 1 and slow enough that you will notice.

The naive read is "use the cheapest one that works." Correct but useless — it begs the question of what "works" means. The useful read: start at rung 1, write an eval, climb only when the eval says you must. Chapter 4 (Eval Before Launch) makes this operational.

The price spread is the whole story

The reason this ladder matters is the size of the gap between rungs. The per-million-token rates below are pulled from each provider's published pricing page in May 2026 (anthropic.com/pricing, openai.com/api/pricing, ai.google.dev/pricing). These are list prices, not negotiated enterprise rates.

Model	Input $/M tok	Output $/M tok	Rough ratio vs Haiku-class
Claude Haiku 4.5	~$1	~$5	1×
Gemini 2.5 Flash	~$0.30	~$2.50	0.5×
GPT-5 mini	~$0.25	~$2	0.4×
Claude Sonnet 4.6	~$3	~$15	3×
GPT-5	~$1.25	~$10	~1.5×
Gemini 2.5 Pro	~$1.25	~$10	~1.5×
Claude Opus 4.7	~$15	~$75	15×
GPT-5 (high reasoning)	~$10–$15 effective	~$40–$60 effective	~8–12×

(Numbers rounded to the rung. Re-check live pricing before you sign anything; the labs adjust quarterly. Output tokens are 4–5× the input rate, which matters more than people think for generation-heavy workloads.)

The gap between rung 1 and rung 3 is roughly 15× on input and closer to 20× on output within a single provider. Mix providers — Gemini Flash at the bottom, Opus 4.7 at the top — and the gap is 50×.

Every model-selection conversation in your team is downstream of this number. A feature that costs 0.4¢ per call at rung 1 costs 20¢ at rung 3. Multiply by 100,000 calls a day. The frontier-default habit turns a $400/day bill into a $20,000/day bill — at which point the feature has to clear a different bar to survive a board review.

Cost per error, not cost per token

The counter-argument: "the bigger model costs 15× more per call, but makes half as many mistakes, and each mistake costs me a support ticket. Net I save money." Sometimes this is true. Often it is theatre. The way to tell is to measure cost per error alongside cost per token.

Cost per error = (cost per call × calls per error) + cost of handling one error.

Worked example. Invoice classification. Rung 1: 92% accuracy at 0.1¢ per call. Rung 3: 98% accuracy at 2¢ per call. Each error costs your support team ₹150 (~$1.80) to clean up.

Rung 1: (100 × $0.001) + (8 × $1.80) = $14.50 per 100 calls.
Rung 3: (100 × $0.02) + (2 × $1.80) = $5.60 per 100 calls.

Frontier wins. Error-handling cost swamped the 15× model price.

Flip it. You're summarising help-centre articles for an internal search feature. A bad summary costs you nothing — the user rephrases. Error cost ≈ 0, so price-per-token is the only term that survives. Rung 1 wins by 15×.

The rule: cost per error is what matters; cost per token is what's on the invoice. They are not the same. Compute the one that matters before defaulting to the one that's easier to read.

Escalate on failure

The pattern that wins in production is escalate on failure, not pick-once-and-forget.

Send the request to rung 1.
The mini model returns an answer plus a confidence signal — either the model self-rates, or you compute it downstream (did the structured output validate? did retrieval support the claim? did the tool call return success?).
High confidence: return. You paid 0.1¢.
Low confidence: escalate to rung 2 or 3 with the same input.
Log the disagreements to your eval set so next quarter you can re-tune the threshold.

In a typical real-world distribution, 80–90% of requests are easy and clear rung 1. The 10–20% that don't get the bigger model. Blended cost: (0.85 × $0.001) + (0.15 × $0.02) = $0.0039 per call — roughly 5× cheaper than defaulting to rung 3 everywhere, with comparable quality on the cases that actually mattered.

This pattern is not exotic. It is how the credible production AI products work. The teams that don't use it haven't measured what fraction of their traffic actually needs the big model — and the answer is almost always "less than you assumed."

Small models with tool use vs. frontier reasoning

A sub-decision people get wrong: do I need a frontier model, or a small model with tools?

Frontier reasoning models are good at problems where the work happens inside the model's head. "Here is a 50-page contract; find the ambiguities." "Here is a math olympiad question; solve it." Closed input, deep thinking.

Small models with tool use are good at problems where the answer is in the world. "What is the current GST rate on this HSN code?" — call a lookup tool. "Has this customer paid their last invoice?" — call your database. The reasoning each step requires is shallow; the model just picks the right tool, calls it, passes the result on.

For most product features — not research products, everyday features — the second pattern is dramatically cheaper and more reliable. You don't need GPT-5 with high reasoning to answer "is this user on the Pro plan?" You need any half-decent model with a lookup_subscription_status tool. The model IQ is barely the constraint; the tools are.

The rule: if the problem is solvable by reading an authoritative source, you need a small model and a good tool definition. If it's only solvable by thinking about the question itself, that's when you reach for the frontier rung.

Batching and caching — the levers you forget

Two technical levers turn a borderline-affordable feature into a comfortable one. Most teams discover them after launch — the wrong order.

Prompt caching. Anthropic, OpenAI, and Google all support caching the prefix of your prompt — system message, tool definitions, the long document the user is asking about. First call pays full price; subsequent calls reusing the prefix pay roughly 10% of the input rate for the cached portion (per Anthropic's published cache pricing; the other two publish similar discounts). If your prompts have a 5K-token system message and a 20K-token document, that's the bulk of your input bill and 90% of it is cacheable. Caching frequently turns inference cost from "scary" to "rounding error."

Batch APIs. Anthropic's Message Batches API and OpenAI's Batch API process requests asynchronously over a 24-hour window and charge 50% of the synchronous rate (per both providers' batch pricing pages). For nightly classification, document enrichment, embedding refresh — anything that doesn't need a real-time response — not using batch leaves half your budget on the table.

Together these levers cut the bill by 5–8× on the right workload. If your eng team has not implemented either, that is the first conversation, not the model-swap conversation.

Fine-tune vs. few-shot

A common over-correction: rung 1 with the default prompt doesn't clear the bar, so the team jumps to fine-tuning. Six weeks later they have a brittle training pipeline, a model that drifts from base with every lab release, and accuracy barely better than what a 200-line few-shot prompt would have produced in an afternoon.

Order of operations:

Write a careful prompt. (Chapter 3 — Prompt Design as Product Design.)
Add 5–20 in-context examples. For most classification, extraction, and formatting jobs, this gets you 80% of the way to fine-tuned quality at 0% of the operational cost.
Move up a rung. A rung-2 model with a good prompt usually beats a rung-1 model with a fine-tune on tasks that need real reasoning.
Only if 1–3 are exhausted, fine-tune — and fine-tune the smallest model that fits. Fine-tuning a frontier model is almost never the answer; you lose the price advantage that justified the work.

Fine-tuning earns its operational cost in three cases: regulated outputs where you need bounded behaviour, narrow domains where the few-shot context window would be enormous, and edge deployment where every parameter matters. Outside those, prompt-and-rung-up first.

Three shipped examples

Cursor. The code editor that ate developer mindshare in 2024–25 doesn't run a frontier model on every keystroke. Cursor's "Tab" auto-complete is a small, specialised custom model, optimised for inline completion latency (typical p95 well under 100ms). The frontier models — Claude and GPT — only get invoked when the user explicitly asks via chat or Compose. Route by user intent: cheap fast model for the work that happens 200 times an hour, expensive smart model for the work that happens twice a day.

Perplexity. Perplexity has openly discussed model routing as core to how they ship — "Pro Search" picks among their own Sonar variants (fine-tuned Llama-derivatives), GPT-class, Claude, and others, depending on query type and user selection (docs.perplexity.ai). Reason: most queries are "summarise these three search results" — rung 1 work. A minority are "research this technical question across 20 sources and reconcile contradictions" — rung 3 work. Routing lets the unit economics survive the long tail.

GitHub Copilot. Per GitHub's engineering blog (github.blog), inline completions use a smaller specialised model (the Codex-line descendant), while Copilot Chat and the agent flows use GPT-class frontier models. Same pattern: the completion model fires hundreds of times per developer per day; chat fires a handful. You don't pay frontier rates for what happens 200 times an hour.

Three products, three providers, same architecture. Route by task, not by prestige.

What to do on Monday morning

If your team's current AI feature is on rung 2 or 3 by default, this is the drill:

Build a 50–100-row eval set of real user inputs with labelled correct outputs. (Chapter 4 again.)
Run it against rung 1. Score.
If rung 1 is within 3–5 points of rung 3 on the metric that matters, ship rung 1. If it isn't, look at the failures — half are usually prompt problems, not model problems.
Build the escalate-on-failure path so the 10–20% hard cases still get rung 3 quality.
Turn on prompt caching for any prompt with a stable prefix > 1,000 tokens. Turn on batch for anything that doesn't need a real-time response.

Do all five and the inference bill drops 5–20× without touching the user-visible product. The Indian-context rule from Chapter 1 (ai-7) lands hardest here: at ₹1,000–₹5,000/user/month ARPU, you have at most ₹30–₹150/user/month of inference budget, which is the rung-1 budget by default. The discipline isn't optional; it's the only way the feature ships profitably.

Rules

Where to go next

Chapter 3 — Prompt design as product design: half the failures you attribute to "the model is too small" are actually prompt failures. Fix that first. (Prompt Design as Product Design)
Chapter 4 — Eval before launch: the eval set is the lever that makes the ladder operational. Without it, every model-selection conversation is opinion. (Eval Before Launch)
Chapter 9 — Cost and latency as first-class product constraints: the numbers from this chapter on the dashboard you review every week. (Cost & Latency as First-Class Product Constraints)
Companion: Working with Engineers — most model-selection decisions happen at the PM-engineer seam.
Back to: When AI Is the Right Answer (and When It Isn't) — the gate before this one.