A feature that costs $0.08 per query at 1,000 users costs $80,000 per month at 1,000,000. Most AI features are shipped without a cost model. Most AI teams face an uncomfortable conversation with finance six months after launch.
After this page, you’ll be able to:
- How to read a model pricing page and build a cost-per-query model
- The four cost levers: model tier, prompt length, caching, and cascading
- What prompt caching actually is and when it saves significant money
- Model cascading — the architecture that routes cheap vs. expensive model by task complexity
- Real P50/P95 latency targets for common AI product patterns
Token economics are real. They don't feel real when you're building — the OpenAI dashboard is abstract, the numbers are small, and the early user volume is tiny. They become real when you scale. This page gives you the cost model you should have before you ship, not after you get the finance email.
Reading a model pricing page
Model pricing is almost universally quoted as cost per million tokens — for input (the prompt) and output (the generated response) separately. Output tokens are always more expensive than input tokens because generation is more compute-intensive than reading.
2026 reference pricing (approximate, check current pages):
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Best for |
|---|---|---|---|
| GPT-4o-mini | $0.15 | $0.60 | High-volume classification, simple extraction |
| Claude 3.5 Haiku | $0.25 | $1.25 | High-volume tasks, good reasoning/cost ratio |
| GPT-4o | $2.50 | $10.00 | Complex reasoning, generation quality |
| Claude 4.5 Sonnet | $3.00 | $15.00 | Complex reasoning, long context |
| GPT-5 | $10.00 | $30.00+ | Frontier reasoning, complex agent tasks |
| Claude 4.5 Opus | $15.00 | $75.00 | Maximum quality, low-volume high-stakes |
A token is roughly 0.75 words. 1,000 words ≈ 1,333 tokens.
The cost-per-query calculation:
cost_per_query = (avg_input_tokens / 1M × input_price) + (avg_output_tokens / 1M × output_price)
For a customer support draft generator with a 500-token system prompt, a 200-token user message, and a 300-token output, using GPT-4o:
Input cost: (700 tokens / 1M) × $2.50 = $0.00175
Output cost: (300 tokens / 1M) × $10.00 = $0.003
Total per query: $0.00475
At 100k queries/month: $475/month. At 1M queries/month: $4,750/month.
Now model the same feature at 10M queries/month (a successful scaled product): $47,500/month. This is where you want to have already switched to a cheaper model for the cases that don't require GPT-4o quality.
The trap: teams estimate cost at their current volume and fail to model what happens at 10x. Build the cost model at 1x, 10x, and 100x before committing to an architecture.
Build the cost model at 1×, 10×, and 100× volume before committing to an architecture. A feature that looks affordable at launch becomes a finance emergency at scale. The conversation to have is before the sprint, not after the bill.
The four cost levers
You have four levers for controlling AI inference cost. Use them in order of simplicity.
1. Model tier
The most impactful lever. A 10x cheaper model (GPT-4o-mini vs GPT-4o) that performs equally well on your specific task is 10x cheaper. Many tasks don't need frontier-model quality.
The practical test: benchmark your task on mini-tier and full-tier models against your golden eval set. If the quality difference is < 5% on dimensions users care about, use the cheaper model. "We use GPT-4o because quality matters" is not a cost argument — it's a default that hasn't been tested.
2. Prompt length
Every token in your prompt costs money. Every unnecessary token in a long system prompt is a tax on every query. A 2,000-token system prompt costs $5/1M input tokens on GPT-4o — at 1M queries, that's $5,000/month just in system prompt overhead.
Audit your prompts. Remove instructions that don't change behavior. Move static context into fine-tuning if query volume is high enough. Use prompt compression (summarizing or trimming examples to the minimum that preserves quality).
3. Prompt caching
Prompt caching (available from Anthropic and OpenAI as of 2025) allows you to mark parts of a prompt as cacheable. When a cached segment is sent again — same bytes, same position — the API doesn't re-process it, and charges a reduced rate (typically 50-90% lower than full input pricing).
When caching saves significant money:
- You have a long system prompt (500+ tokens) that is static across requests
- You inject large reference documents into every prompt (documentation, product catalog, knowledge base context)
- You have multi-turn conversations where earlier turns repeat on each API call
When caching saves little:
- Your prompts are short (< 200 tokens) — the cache minimum is typically 1,024 tokens for Anthropic, 512 for OpenAI
- Your prompts change significantly per request (high dynamic content)
- Low query volume — cache benefits scale with volume
A concrete example: if your RAG system injects 3,000 tokens of retrieved context that is the same across a session, caching those tokens at Anthropic's 90% discount drops your input cost from $3.00 to $0.30 per 1M tokens for that segment. At 100k queries/month with 3k-token context, that saves roughly $810/month.
4. Model cascading (routing)
Cascading is the architecture where you route requests to different model tiers based on estimated complexity. Simple requests go to a cheap model; complex requests escalate to an expensive model.
A basic cascade:
- Run the request through a cheap classifier or the cheaper model
- If the output confidence is high and the task is simple, return that output
- If the output is flagged as low-confidence, complex, or a sensitive topic, escalate to the expensive model
Example routing logic for a customer support feature:
- FAQ / known-answer questions → GPT-4o-mini ($0.00075/query)
- Complex multi-step questions → GPT-4o ($0.00475/query)
- Policy / compliance questions → GPT-4o with explicit grounding + human review
If 70% of queries are FAQ-level, this cascade cuts average cost by roughly 60% vs. running everything through GPT-4o.
The engineering cost: cascading adds a classification step, increases system complexity, and requires eval infrastructure for both tiers. It's justified once your monthly AI cost exceeds ~$5,000 and you have a large, heterogeneous query distribution. Below that threshold, optimize prompt length first.
Real latency targets
Latency is the other side of the cost/quality triangle. Here are real-world P50/P95 targets for common patterns in 2026:
| Pattern | P50 target | P95 target | Notes |
|---|---|---|---|
| Single-call chat response (GPT-4o, ~300 output tokens, streaming) | 1.2s to first token, 3-4s complete | 2s / 8s | Streaming first-token latency is what users feel |
| RAG pipeline (retrieval + generation) | 2-3s to first token | 4-6s | Retrieval adds ~500ms-1s depending on vector store |
| Single-step agent tool call | 5-8s | 15s | Each additional tool call adds ~2-4s |
| 5-step agent task | 15-25s | 45s | Push to async UX at this range |
| Document analysis (20-page PDF, single call) | 8-15s | 25s | Acceptable for async; too slow for interactive |
First-token latency is what users perceive as "response time" in streaming interfaces. The total generation time matters for how long users wait overall, but the first-token latency determines whether the interface feels responsive. A response that starts streaming in 0.8 seconds and takes 6 seconds to complete feels faster than one that starts streaming in 3 seconds and completes in 4 seconds.
The latency-cost-quality tradeoff. Smaller models are cheaper AND faster. The decision is usually: does the quality loss from a cheaper, faster model matter for this use case? Test empirically against your golden set rather than assuming quality requires the expensive model.
"We use GPT-4o because quality matters" is not a cost argument — it is a default that has not been tested. Benchmark the cheaper model against your golden set before assuming the expensive one is necessary.
Building a real cost model
Before you ship an AI feature, build this table:
| Scenario | Monthly query volume | Avg input tokens | Avg output tokens | Model | Cost/query | Monthly cost |
|---|---|---|---|---|---|---|
| Current (MVP) | 10,000 | 700 | 300 | GPT-4o | $0.005 | $50 |
| 10x growth | 100,000 | 700 | 300 | GPT-4o | $0.005 | $500 |
| 100x growth | 1,000,000 | 700 | 300 | GPT-4o | $0.005 | $5,000 |
| 100x with cascade | 1,000,000 | varies | varies | 70% mini / 30% GPT-4o | ~$0.002 | ~$2,000 |
Add: embedding costs (if RAG), reranker costs, external tool API costs, your vector DB hosting cost. AI features often have a stack of costs beyond the LLM call itself.
The business model check: at your target scale, does the AI feature cost fit within your unit economics? If you're charging ₹499/month for a product and the AI feature costs ₹120/user/month at 10x growth, you have a margin problem that price increases or cascade architecture must solve before you hit that scale.
AI inference cost is a unit economics problem, not an infrastructure problem. If the AI feature cost at P90 usage does not fit within your pricing, no amount of engineering optimization fixes a broken business model. Price the tier before you ship the feature.
We launched our AI summary feature at ₹99/month add-on. It felt like margin at 2,000 users. At 20,000 users we were losing money on every AI user — the feature was being used 40+ times per month by power users who were essentially running document analysis sessions. We hadn't modeled the power-user distribution. We capped usage at 20 summaries/month in an emergency patch, got a wave of complaints, and spent two months redesigning the pricing. Build your cost model at P90 usage, not average usage. The power users are the ones who determine whether your economics work.
What to do this week
-
Run the cost model for one AI feature. Measure your actual average input and output tokens from logs or test runs. Apply current pricing. Project at 10x and 100x current volume. Write down the number.
-
Check whether you have any prompt caching implemented. If you have system prompts over 1,024 tokens and you're running on Anthropic or OpenAI, caching is likely available and not enabled. Enabling it is usually a one-line change.
-
Run a quality comparison: mini vs. full model. Take 20 examples from your golden set. Run both models. Score quality on your dimensions. If the quality delta is < 5%, you have a case for switching or cascading.
Where to go next
- Build / Buy / Wrap — cost is one input to the make/buy decision
- Financial Modeling — build the cost-per-query model into your unit economics
- Eval Design — the quality benchmarking you need before switching model tiers