The 2026 Model Landscape — the pm manual

The frontier is converging. The interesting work in 2026 isn't picking a model — it's picking what to commoditize and what to keep proprietary. Everything else is a vendor preference.

Talvinder Singh

This chapter has the shortest half-life in the manual. By design. The other eleven chapters are about judgment — judgment ages slowly. This one is about vendors, and vendors age in quarters. If you are reading this in late 2026 or 2027, treat the names below as period detail and the framing as the part that still applies.

The framing is the point. The names are scaffolding.

The state of the frontier, May 2026

Six labs matter at the frontier today. Three closed-weights US labs, one closed-weights Chinese effort that the US labs publicly pretend not to benchmark against, one closed-weights wildcard, and one open-weights ecosystem that has gotten close enough to matter for most jobs. Here is the honest read on each.

Anthropic — Claude Opus 4.7, Sonnet 4.6, Haiku 4.5. The reasoning and long-horizon agent leader as of this writing. Opus 4.7 is the model I reach for when a long agent run has to finish, when the spec is ambiguous, or when the cost of a wrong answer is high enough to pay the premium. Sonnet 4.6 is the workhorse — most of the production code in this manual's own repo runs on Sonnet because the price-to-quality curve sits on a sweet spot nobody else has matched on coding and tool use. Haiku 4.5 is the cheap rung that still feels like a Claude. The lag: native multimodal is weaker than Google's, and Anthropic has no consumer surface as sticky as ChatGPT.

OpenAI — GPT-5 family. Still the distribution leader and still the model your non-technical stakeholders mean when they say "AI." GPT-5 is strong at general reasoning and at the long tail of niche knowledge where the larger training corpus shows up. The mini variants are aggressively priced and have eaten the "summarize, classify, rewrite" market. The lag: tool-use reliability over long horizons is a real second to Anthropic, and the churn-and-rebrand of model names (GPT-5, mini, pro, thinking) makes routing brittle for teams that pin to specific SKUs.

Google — Gemini 2.5 Pro and Flash. The multimodal leader. If your product takes images, video, or long PDFs as primary input, you owe yourself a Gemini eval. The two-million-token context window is no longer the headline differentiator — Anthropic and OpenAI have closed most of the gap — but Gemini's use of long context (retrieval inside the window, cross-document reasoning) is still the strongest. Flash is the latency leader at the cheap tier. The lag: developer ergonomics are rougher; the API surface keeps moving; the consumer brand is permanently behind ChatGPT.

xAI — Grok 4. The wildcard. Strong at code, strong at math, surprisingly capable on agentic benchmarks. Real engineering work under the marketing noise. The lag: enterprise trust. Most CISOs will not let Grok into a regulated workflow regardless of capability — the brand and data-handling story do not survive a procurement review. If you are a regulated B2B product, this is a hard ceiling.

DeepSeek-V3 and the China cost curve. The story most Western AI coverage underplays. DeepSeek-V3, released in December 2024 at a publicly claimed training cost of roughly $5.6 million, matched or beat GPT-4 on most public benchmarks (DeepSeek's own paper; replicated by third parties). The 2025 follow-ups closed further. The point is not that DeepSeek replaces Anthropic in your stack tomorrow. The point is that the cost curve on the frontier has been falsified. Frontier-adjacent capability is no longer a billion-dollar artifact. That changes the calculus for everyone downstream.

Open weights — Llama 4, Mistral, Qwen. The gap has closed to the point that, for most jobs that used to need GPT-4, an open-weight model running on your own infrastructure now clears the bar. Llama 4 (Meta, 2025) and Mistral's mid-2025 MoE releases are good enough that the question stopped being "is open-weights capable" and became "is it worth the operational drag for your specific job." For most teams the answer is still no. For some, it is the obvious yes.

The pattern across the six: the frontier is converging on most tasks, the price floor has collapsed, and the differentiation is moving from raw capability to ergonomics, trust, and integration surface.

What's already commoditized

If you are building on these capabilities, you are renting a commodity. Treat them as such. Multi-source, eval-on-swap, optimize for price.

Chat. A reasonable chat experience is now a hundred-line wrapper. The frontier here moved years ago.

Single-turn classification and extraction. "Is this email a complaint or a request?" "Pull the line items from this invoice." Every lab does this competently. The smallest model that clears your eval wins. (Chapter 2, The Model-Selection Ladder, is the playbook.)

Summarization, rewriting, translation. Genuinely commoditized. Mini-tier models clear the bar on most content; the frontier premium only shows up on long, technical, or stylistic edge cases.

Basic code completion. Tab-complete-the-next-line is solved. The differentiation has moved upstack to multi-file edits, repo-aware refactors, and the IDE integration itself.

Single-turn tool use. "Look up the weather. Reply with the answer." Every frontier model does this. The interesting tool-use work has moved to long chains and agent loops, which is a different game (see Chapter 6, Tool Use, Function Calling, Agents — The Maturity Ladder).

If your AI product's core demo is one of the above, the foundation labs will absorb you in a release cycle. The only reasons to be in those markets are workflow integration, proprietary data, or distribution. Read Chapter 11 (Building With AI vs. Building AI Products) before you raise another round on those positioning slides.

What's not (yet) commoditized

These are where the strategic action is, and where you should be willing to pay frontier prices on frontier providers.

Long-horizon reliable agents. Twenty-, fifty-, hundred-step agent runs that have to actually finish without a human hitting the steering wheel. The capability gap between the best (Claude Opus 4.7, GPT-5-thinking on a good day) and the mini-tier models on this is still wide enough that you cannot route a real agent loop to a commodity model and survive. Chapter 6's budget-and-blast-radius framing is the lens.

Multimodal native reasoning. Reasoning over video, audio, and image jointly with text — not just transcribing or captioning, but treating the modalities as one input the model thinks across. Gemini 2.5 Pro is the current leader; Claude and GPT are catching up but not there yet. If your product's value is in seeing, the lab choice matters.

Frontier coding. Multi-file edits, repo-aware reasoning, large refactors that have to compile and pass tests. Sonnet 4.6 is the current taste leader for most teams; GPT-5 and Gemini 2.5 Pro are real alternatives; nobody at the mini tier clears the bar.

Eval-resistance. The capability that nobody talks about: models that perform well on your private eval, not just the public ones. Public benchmarks are now training targets. Your real differentiation is the eval set the lab has not seen. (Chapter 4, Eval Before Launch, is the chapter that survives this whole landscape unchanged.)

The Bezos "what won't change" frame, applied to AI

In a 1997 letter Jeff Bezos pointed out that strategy is more durable when it is built on what will not change than on what will. Customers will always want lower prices, faster delivery, and more selection — so build for those, not for whichever competitor is hot this quarter.

The same frame, applied to AI products, gives you a near-complete strategy spine:

Users will always want lower price per inference. Bet on this. Architect for it. Today's frontier prices are tomorrow's mid-tier prices.
Users will always want lower latency. A three-second response is not a feature; it is a churn driver (Chapter 9, Cost & Latency as First-Class Product Constraints). Design for the latency curve to keep dropping.
Users will always want fewer hallucinations and more verifiability. Build grounding, citations, and audit trails into the product surface (Chapter 5, Hallucination as a Product Problem). These investments compound.
Users will always want AI that respects their data. Default to data minimization, regional residency, and clear retention policies, regardless of what the labs currently allow (Chapter 10, Safety, Privacy, Compliance for Shipping Teams).

Notice what is not on the list: "users want the most advanced model." Nobody outside our industry cares which model is in the dropdown. They care about the answer being right, fast, cheap, and safe.

Platform risk: when to bet hard, when to hedge

Building deeply on one provider is a strategic choice with a cost. The choice is defensible. The pretense that it has no cost is not.

The cost of a hard bet: when the provider raises prices, deprecates a model, ships a misaligned policy change, or has an outage during your peak hour, you have no second move. The cost is real and you have to be honest about it in your strategy doc.

The cost of a hedge: every layer of abstraction is engineering you are doing instead of building features. Multi-provider routing is not free; eval-on-swap is not free; the lowest-common-denominator API surface costs you the provider-specific features that may have been the reason you picked that provider in the first place. A team that hedges by default ships less.

Three patterns are worth knowing.

Abstraction layer. A thin internal interface (our_llm_call(prompt, model_class)) with provider adapters underneath. Cost: a week of engineering plus ongoing maintenance. Benefit: swapping a provider is a config change, not a refactor. Do this if you can afford one engineer-week to buy a future option. Most teams should.

Eval-on-swap. Your private eval set (Eval Before Launch) runs on every candidate provider before any swap. Cost: building and maintaining the eval. Benefit: you can compare providers as a numeric comparison, not a vibes argument. Do this regardless of whether you hedge. The eval is the asset.

Multi-provider routing in production. Live traffic split across providers based on task, latency, or cost. Cost: significantly more operational complexity, two on-call rotations of API errors, and a UX that has to be valid across the worst-case output of every provider. Benefit: real resilience and real price arbitrage. Do this only if your scale or your reliability requirement justifies it. Most early-stage products should not. (Chapter 9 has the cost math.)

The default I recommend in 2026: pick one frontier provider for primary, one open-weights or alternative as the eval-validated fallback, build the abstraction layer, run the eval on every model release. Hedge cheaply by default; bet hard only when the provider-specific feature is the moat.

Three worked examples

A startup that bet hard and won — Cursor. Cursor bet early and deeply on Anthropic for their primary completion and chat surfaces, layered proprietary models on top for specific in-IDE jobs (the "Tab" predictive cursor, fast-apply edits), and made no apology for the lock-in. The bet paid: by 2025 they were the fastest-growing developer tool of the decade, and the depth of Claude integration was a feature developers chose them for, not a liability. The lesson: when the provider-specific capability is the product, hedging is the wrong answer. You are buying focus.

A startup that hedged and won — Notion. Notion AI runs over multiple providers under the hood, routes by task and price, and is allergic to any single-provider dependency the PM cannot explain in a quarterly review. The hedge has let them ride the cost curve down aggressively while keeping the product surface stable. The lesson: when AI is a feature in a product whose value lives elsewhere, the right posture is commodity sourcing and eval-on-swap discipline. The user does not care which lab is in the dropdown.

A startup that hedged and lost focus — name withheld. A Series-B B2B SaaS company I advised in 2025 went multi-provider on day one. Three labs in production, an abstraction layer consuming two engineers full-time, and an eval suite that had drifted into a benchmark of the abstraction layer rather than of the user job. They shipped less than half the AI features their single-provider competitors did, and lost the category to a focused incumbent. The lesson: hedging is a real cost. If you cannot articulate the threat model that justifies it, you are paying complexity tax for a future you may not need.

The diagnostic is honest. Read your own roadmap. If the multi-provider architecture is shipping value to users this quarter, it is a moat. If it is engineering that exists for its own sake, it is a tax.

The full-stack reading order

This is the closing chapter. The decisions it touches require every chapter that came before:

Chapter 1, When AI Is the Right Answer (and When It Isn't): before you pick a model, decide if the problem wants one.
Chapter 2, The Model-Selection Ladder: start at the smallest model that clears the bar; the landscape only matters once you know the rung.
Chapter 3, Prompt Design as Product Design: a prompt that works on Sonnet works less well on a Llama; the prompt is part of the bet.
Chapter 4, Eval Before Launch: the asset that makes every claim in this chapter testable for your product.
Chapter 5, Hallucination as a Product Problem: hallucination is a permanent property; provider choice barely moves it.
Chapter 6, Tool Use, Function Calling, Agents — The Maturity Ladder: the rung where provider choice matters most today.
Chapter 7, RAG, Fine-Tune, or Context Window?: data strategy, which outlives any model.
Chapter 8, AI UX Patterns That Work: the surface where commoditized inference becomes a differentiated product.
Chapter 9, Cost & Latency as First-Class Product Constraints: the economic frame for every decision in this chapter.
Chapter 10, Safety, Privacy, Compliance for Shipping Teams: the regulatory frame that constrains which providers you can use.
Chapter 11, Building With AI vs. Building AI Products: the staffing-and-business-model frame that determines whether platform risk is existential or annoying.

Everything in those eleven chapters survives a model release. This chapter does not. That is the deal.

Rules

Where to go next

This is the closing chapter. There is no "next" page in the AI manual — you have the full spine.

Take a single decision on your current roadmap and walk it through the twelve chapters in order. If you can answer all twelve — from is-it-an-AI-problem through platform-risk — you have done the work.

Companion reading from the rest of the manual:

Product Vision & Strategy — the strategic substrate AI bets sit on.
Idea to Launch Process — the loop AI features ship through.
Talvinder's Observations — where the landscape gets updated faster than this chapter does.