If your AI feature's unit economics aren't on the roadmap, you haven't shipped it — you've launched a science fair project. The bill arrives whether you forecasted it or not.
After this page, you’ll be able to:
- Treat $/MAU and p95 latency like CAC and conversion rate — numbers you defend in every product review
- Use prompt caching, batch APIs, streaming, and per-session cost ceilings as design tools, not optimisations
- Design escalation budgets so a free user, a paid user, and an enterprise user each get the right rung of the model ladder for what they pay
Every AI product I have reviewed in the last eighteen months has the same blind spot in its first three monthly business reviews. The team can recite DAU, retention, NPS. Ask what the feature costs per active user per month, and you get a pause. Ask p95 latency last week, and you get a different number than the SLO dashboard. Ask what happens when usage doubles tomorrow, and you get a face.
This is the gap between shipped and shipped responsibly. The AI era added two numbers to every product dashboard, and most teams have not wired them in. Cost — dollars per active user per month, more sharply dollars per use. Latency — p50 and p95 from action to first useful pixel, and to full response. Both are levers. Both compound. Both are invisible until they aren't, at which point they are sometimes catastrophic and almost always avoidable. This chapter is about putting them on the wall before launch, not after the first finance review.
Why cost is now a product constraint, not a finance problem
For most of the SaaS era, marginal cost of usage was approximately zero. Pricing was a packaging decision; cost was a capacity-planning decision. The two lived in different rooms. Inference broke that. Every call to a frontier model has a variable cost that scales linearly with usage. Heavy users cost more than light users. A power feature can cost more per call than the entire stack used to cost per user-month. The 2026 pricing snapshot, per million tokens list (anthropic.com/pricing, openai.com/api/pricing, ai.google.dev/pricing, May 2026):
- Rung 1 (cheap): Gemini 2.5 Flash-Lite ~$0.10 in / ~$0.40 out. GPT-5 nano ~$0.10 / ~$0.80. GPT-5 mini ~$0.25 / ~$2. Claude Haiku 4.5 ~$1 / ~$5.
- Rung 2 (workhorse): GPT-5 ~$1.25 / ~$10. Claude Sonnet 4.6 ~$3 / ~$15. Gemini 2.5 Pro ~$1.25–$2.50 / ~$10–$15.
- Rung 3 (frontier): Claude Opus 4.7 ~$15 / ~$75. GPT-5 with high reasoning effective ~$10–$15 / ~$40–$60.
Chapter 2 (The Model-Selection Ladder) walks through how to pick a rung. This chapter is what you do after you have picked — because the wrong defaults compound silently.
Three multipliers turn a manageable bill into a runaway one. Context compounds: a naive RAG pipeline that stuffs 30K tokens of retrieved chunks into every call pays for 30K tokens on every call, even when the user asked a one-line follow-up. With Sonnet at $3/M input, that is 9¢ of input alone per call; 100,000 calls a day is $9,000 a week. Output compounds harder: output rates are 4–5× input rates across every provider, so a chatty system message that produces 2K tokens where 200 would have sufficed is a 10× silent overspend on the most expensive part of the bill. Retries compound on top of both: every retry, every tool-use loop, every agentic step is a fresh call against the full context. A three-step agent on a 20K-token context is paying for 60K tokens of input across the trajectory. (Tool Use, Function Calling, Agents — The Maturity Ladder is why each rung up the agent ladder costs more than it looks.)
The two levers that earn back 10× — and why teams still skip them
Two provider-side features do most of the cost-savings work, and most teams in their first six months of shipping AI do not turn either of them on.
Prompt caching. Anthropic's prompt caching (anthropic.com/news/prompt-caching, GA since late 2024) charges roughly 10% of the input rate for cached tokens after a one-time write premium. OpenAI's "cached input" (openai.com/api/pricing) discounts the cached portion similarly. Gemini's context caching (ai.google.dev/pricing) publishes the same shape. If your prompt has a 5K-token system message, a 10K-token few-shot block, and an 8K-token document the user is asking three questions about, that is 23K tokens of stable prefix. Without caching, you pay full input on every question; with caching, the stable 23K costs 10% after the first call. On Sonnet 4.6 that turns a 7¢ call into ~1.2¢. A 5–10× compression of the input bill for the cost of a configuration flag. If your team is not running caching on any prompt with a >1K-token stable prefix, that is the first conversation, not the model-swap conversation.
Batch APIs. Anthropic's Message Batches and OpenAI's Batch API both charge 50% of synchronous for jobs processed within a 24-hour SLA (anthropic.com/api Message Batches; openai.com/api/pricing Batch). Anything that does not need a real-time response — nightly enrichment, embedding refresh, eval runs (Eval Before Launch), classification of yesterday's tickets — belongs on the batch endpoint. Not using batch on async work is leaving half the budget on the floor.
Together these are a 4–10× lever on the bill. Not optimisations — defaults.
Latency is a product feature, not an engineering metric
Three seconds is the number where users switch tabs. Two seconds is where they start feeling the wait. Sub-500ms feels instant. The web-latency literature from the last twenty years did not get repealed when LLMs arrived; if anything, AI raised the floor on what users expect, because ChatGPT taught them responses stream.
Two latency numbers belong on the dashboard. Time to first token (TTFT) — from user action to the first chunk of useful output. This is the latency the user feels. Streaming makes total response time almost irrelevant up to about 10 seconds, as long as TTFT is under one second. If you are not streaming user-facing generative output in 2026, you are choosing to feel slower than you are. Total p50 and p95 — p50 is the median user; p95 is the 5% having a worse day. The gap tells you whether you have a tail problem (retries, cold cache, long-context users) or a body problem (the whole feature is slow). A feature can have a fine p50 and a brutal p95, and p95 users churn first because they are usually the heaviest users — exactly the cohort you cannot afford to lose.
For agentic features (Tool Use, Function Calling, Agents — The Maturity Ladder), latency budgets multiply across steps. A four-step agent at 2 seconds per call is an 8-second user experience before tool latency. Compress the chain, parallelise tool calls, or show progress at each step — a Thinking… indicator with the current step is honest and buys patience. Silent waiting kills.
The "cost ceiling per session" pattern
The most useful pattern I have shipped in the last year is a hard cap on what one session can spend, enforced in the application layer, with a graceful UX when it trips.
Pick a number that maps to the business model. Free-tier consumer might cap at ₹2 (~$0.025) per session. Paid B2B at $0.50. Enterprise at $5. The cap is not the typical use — it is the ceiling that catches the abusive case, the runaway loop, the prompt-injected agent, the user who pasted a novel into the input.
When the cap is hit, the feature does one of three things, in order of preference: fall back to a cheaper rung (remaining steps complete on Haiku instead of Sonnet — quality drops slightly, user keeps moving); truncate context (older turns summarised or dropped); or hard-stop with an honest message ("This conversation has reached the limit for this session. Start a new one." — better than a silent bill).
The pattern matters because a tiny fraction of users will burn a large fraction of your bill if you let them. Without a cap, worst-case spend per user is unbounded — and the worst case is the user who pastes their Slack history into the chat and asks for a summary every five minutes. With a cap, the worst case is bounded and forecastable.
Kill switches and circuit breakers
Two operational controls every shipping AI feature needs. A per-feature kill switch — a flag that disables the feature without a deploy. When the bill spikes 5× overnight, when a provider has an outage, when a prompt-injection compromise hits the news, you need to turn the feature off in under five minutes. If turning it off requires a deploy, you do not have a kill switch — you have a runbook. A spend circuit breaker — a daily/hourly threshold that, when exceeded, triggers either a rung downgrade (Sonnet reroutes to Haiku) or a hard pause with an alert. Provider dashboards offer this primitively; production-grade means wiring it into your own telemetry against your own budgets. The first Saturday morning it saves you $20,000, it has paid for itself for the year.
Escalation budgets — free, paid, enterprise
The cleanest pricing pattern in shipped 2026 AI products is rung-by-tier, not feature-by-tier. The free user gets the same feature the enterprise user gets — but the free user runs on Haiku, the paid user runs on Sonnet, and the enterprise user routes to Opus when the system decides the request needs it. Marginal cost on the free tier stays trivial (Haiku at ~$1/M input puts a typical 2K-token call at 0.2¢; 100 calls a day per free user is 20¢ a month — sustainable as a CAC line). Paid users absorb Sonnet rates if the paid plan is ≥10× free implied value. Enterprise users buy latency and quality SLAs at a price point that covers Opus plus reserved capacity. The mistake to avoid is the inverted version — frontier inference for free users because the demo was crisp. That is the road to the "we capped at 10,000 users because the unit economics broke" obituary. Plan the rung escalation at the same time as the pricing tiers, not after.
What to measure — the AI dashboard
Six numbers, weekly review, every product with an AI feature in it:
- $/MAU on AI features. Inference spend ÷ MAUs on AI surfaces. Compare to ARPU. Climbing share of ARPU is a problem you can see before finance does.
- $/feature-use. Spend on a feature ÷ uses. Surfaces which features are cost drivers and where caching/batching would pay off.
- p50 and p95 TTFT and total latency. Four numbers, one chart. p95-minus-p50 is the tail.
- Cache hit rate. Cached tokens ÷ cacheable tokens. Below 70% on a stable-prefix prompt means the cache key is wrong or the prompts are churning.
- Retry rate. Retried calls ÷ total calls. Structural failures (bad schema, bad tool defs) vs. intermittent provider issues.
- Cap-hit rate. Sessions that hit the per-session ceiling ÷ total sessions. Under 1% is healthy; above 5% means the cap is too tight or you have abuse to look at.
Put these on the same dashboard as DAU and retention. Nobody can argue for shipping an AI feature whose cost line climbs faster than its retention.
Three worked examples
Example 1 — A free-tier consumer feature capped at 50 messages per day
A consumer journaling app shipped an "ask your past self" feature — a RAG-backed chatbot over the user's own entries. The free tier was the dominant user base. The first version had no per-day cap; within three weeks, 2% of users were consuming 60% of the inference budget. Retention data showed the feature was working, so removal was the wrong move. Instead: cap free users at 50 AI messages per day, route all free traffic to Haiku, and cache the user's recent-entries prefix. Cost per free MAU dropped from $0.41 to $0.06 — 85% compression. The 50-message cap surfaced as an upgrade prompt: "Pro members get 500/day and faster responses on Sonnet." Free-to-paid conversion lifted 3.2 percentage points the following quarter. The cap was the wedge.
Example 2 — A B2B feature where $/use was the wedge
A finance-ops SaaS company shipped an invoice-anomaly explanation feature on Opus 4.7 because the demo flowed beautifully. The first month's bill was $47,000 against $12,000 of attributable ARR uplift. Three changes over six weeks: re-test on Sonnet 4.6 (accuracy dropped 96% to 94%, well above the 90% eval bar — see Eval Before Launch); batch the daily anomaly pass overnight at 50% rate; cache the chart-of-accounts and 30-day invoice context as a stable prefix (cache hit rate stabilised at 84%). Cost dropped from $47,000 to $5,800 a month — 12×. Same accuracy bar. The eval set made the model swap defensible; the dashboard made the team notice in time.
Example 3 — An enterprise feature where latency mattered more than cost
A regulated-industry research platform shipped a long-document Q&A feature for enterprise legal users. Seat prices were in the low four figures; documents were 50–200 pages. The team's initial instinct was to optimise cost — route to Haiku, compress context aggressively. Pilot customers reported the answers were "good enough" but the 14–20-second wait per question felt sluggish. The team reversed the optimisation: Sonnet 4.6 on full context, prompt caching of the document prefix (91% hit rate within a session), streaming from the first token, and a Thinking… indicator while waiting. TTFT dropped to 800ms, total latency to 6 seconds for a 200-page document, and the user-reported "feels fast enough" went from 23% to 81%. Cost per question rose ~4×, to about 12¢. Renewal at end-of-pilot: 100%. For an enterprise tier, latency was the wedge and cost was the rounding error — the reverse trade from Example 1, and the right one for the tier.
What to do on Monday morning
If you have an AI feature in production and the dashboard above does not exist, that is the next sprint. Six numbers, one chart, weekly review. Until then, the team is shipping AI features without operating them. If you have a feature shipping next quarter, the per-session cap, the kill switch, the rung-by-tier routing, and the caching/batch defaults are not v2 work — they are launch work. The version of these features you can defend in front of a finance review is the version that has them on at launch, not the version that adds them after the first scary bill.
Rules
$/MAU and p95 latency belong on the same dashboard as DAU and retention. If you cannot recite both numbers from last week without looking, you do not have an AI product; you have an AI demo.
Turn on prompt caching for any prompt with a stable prefix over 1,000 tokens, and turn on batch APIs for any workload that does not need a real-time response. These are 4–10× levers; not using them is not an optimisation backlog, it is a launch defect.
Stream every user-facing generative response. Optimise time-to-first-token aggressively and total latency second. p95-minus-p50 is the tail you watch for churn.
Every AI feature ships with a per-session cost ceiling, a per-user-per-day cap, and a graceful fallback (cheaper rung, truncated context, or honest stop). Unbounded spend per user is a bug, not an architecture.
Every AI feature ships with a kill switch that disables it without a deploy, and a spend circuit breaker that downgrades the rung or pauses the feature when a daily/hourly budget is exceeded.
Plan rung-by-tier at the same time as pricing. Free users get Haiku-class. Paid users get Workhorse-class. Enterprise users get Frontier-class on demand. Inverting this kills the unit economics of the free tier.
For enterprise tiers, latency is often the wedge and cost is the rounding error. For free tiers, cost is the wedge and latency is the constraint. Know which tier you are designing for before you choose the optimisation.
Measure cost per error and cost per use, not just cost per token. The bill is a lagging indicator; the per-use number is the leading one. Put both on the dashboard.
Where to go next
- Chapter 2 — The model-selection ladder: the rung-by-rung price spread these constraints sit on top of.
- Chapter 1 — When AI is the right answer: the gate before any of this matters.
- Chapter 4 — Eval before launch: the bar that makes a cost-driven model swap defensible.
- Chapter 6 — Tool use, function calling, agents: why each rung up the agent ladder compounds latency and cost.
- Chapter 7 — RAG, fine-tune, or context window? (forthcoming) — the architecture choice that determines how much of this bill is even cacheable.
- Companion: Product Prioritization — the framework for putting cost work above feature work when the dashboard says so.