You don't need to understand the calculus to fly a plane. But you do need to know that lift comes from airspeed and wing shape — or you'll make decisions that kill you.
After this page, you’ll be able to:
- What a transformer actually does — in terms a PM can act on
- Why context windows are a product constraint, not an engineering detail
- The difference between pre-training, fine-tuning, RLHF, and LoRA — and which one is your decision
- Where LLMs fail, and why those failure modes are stable enough to design around
There are two ways to be ignorant about how LLMs work. The first is to know nothing and treat the model as pure magic. The second — more dangerous in 2026 — is to have absorbed just enough transformer vocabulary to feel informed while missing the two or three things that actually govern product decisions.
This page is the depth layer under ai/ai-fundamentals. It assumes you have already read that page. It will not make you an ML engineer. It will make you capable of having a productive argument with one.
What a transformer actually does
An LLM is a function that takes a sequence of tokens and predicts what token comes next. That is the entire mechanism. Everything else — the apparent reasoning, the code generation, the long-form essays — is the emergent result of training that function on most of the text humans have produced, until the prediction is very, very good.
A token is roughly 3/4 of a word. "Product management" is three tokens. A 128k-context model can hold roughly 95,000 words in working memory — about a 350-page book.
The part of the transformer that matters most to PMs is attention. Attention is the mechanism that lets the model relate any word in the input to any other word, regardless of distance. Before transformers, models read text linearly — like a human reading a book and forgetting the beginning by the time they reach the end. Attention lets the model hold the entire context simultaneously and weight which parts matter for predicting the next token.
This is why context windows matter so much: the attention computation scales quadratically with context length. Doubling the context window roughly quadruples the compute cost. When a model provider charges more for their 1M-context variant, this is why.
The practical product implication: Long context is not free, and it is not magic. A model with a 1M-context window does not "read" all of that with equal attention — it tends to attend strongly to the beginning and end of the context and to lose track of material buried in the middle. This phenomenon — called the "lost in the middle" effect — has been consistently documented across model generations. When you're designing a RAG system or a document-analysis feature, you cannot assume that feeding the model more context always improves quality.
Long context is not free and not magic. A 1M-token window does not read every token with equal attention — material buried in the middle routinely gets lost. More context is not always better context.
The training stack: what each layer is
Most LLMs you'll work with in 2026 have been trained in three stages. Understanding the distinction matters because different stages are appropriate for different product problems.
Pre-training is where the model learns from raw text — billions of documents, code repositories, books. This is where the model's core knowledge and capabilities come from. Pre-training costs tens of millions of dollars and takes months. You do not do pre-training. You buy it. When you choose GPT-5, Claude 4.5, Gemini 2.5, or Llama 4, you are choosing a pre-trained base.
Instruction tuning and RLHF (Reinforcement Learning from Human Feedback) is where the base model is shaped to be helpful, follow instructions, and avoid harmful outputs. RLHF involves human annotators rating model outputs, and those ratings training a "reward model" that in turn guides the main model. This is the stage that makes a raw GPT into the ChatGPT product. The model providers handle this. What it means for you: the "personality," safety boundaries, and instruction-following behavior of a model are RLHF artifacts. When a model consistently refuses certain requests or has a particular conversational style, that's RLHF.
Fine-tuning is where you take a pre-trained (and usually instruction-tuned) model and train it further on your own data. Full fine-tuning updates all the model's weights — expensive, requires ML infra. LoRA (Low-Rank Adaptation) updates a small set of adapter weights layered on top, dramatically reducing cost and making fine-tuning accessible to teams without a dedicated ML platform. Fine-tuning is the right call when:
- You have a narrow, well-defined task with thousands of labeled examples
- The task requires style, format, or vocabulary consistency the base model doesn't provide
- You want to reduce prompt length (fine-tuning can "bake in" instructions, reducing tokens and cost)
- Latency is critical and you need a smaller, faster model that punches above its class on your specific task
Fine-tuning is the wrong call when you don't have labeled data, when your task changes frequently, or when the general capability of the base model is what you actually need.
Fine-tuning is a cost-and-latency play for narrow tasks with labeled data. It is not a way to add reasoning capabilities the base model lacks, and it is not a moat — fine-tuned weights do not store your users' data in extractable form.
Your product is a customer support tool for a B2B SaaS company. The support team wants AI to auto-classify incoming tickets (bug / billing / feature-request / account) and draft a first response. You have 4,000 historical tickets with human classifications. An ML engineer suggests fine-tuning Llama-3.1-8B. A product engineer suggests using GPT-4o with a well-crafted system prompt. Engineering lead wants a decision by EOD.
The call: Which approach do you back, and what is the deciding factor?
Your reasoning:
Quantisation, model size, and what they mean for you
In 2026, most production AI teams are not running GPT-5 for every query. The real pattern is a cascade: a smaller, cheaper model handles the majority of requests, and the expensive model is reserved for the cases where quality genuinely requires it.
Model size (1B, 7B, 70B, 400B parameters) governs capability ceiling and inference cost. The relationship is not linear — a 70B model does not cost 70x more than a 1B model, but it costs much more. More importantly, for many tasks, a well-prompted 7B model achieves 90% of the quality of a 70B model at 15% of the cost.
Quantisation is the process of compressing model weights from 32-bit floats to 8-bit or 4-bit integers. This reduces memory footprint and inference cost at a small quality cost. Q4 quantisation of a 70B model runs on consumer hardware (a single A100 GPU). The quality trade-off is usually acceptable for tasks that don't require top-tier reasoning.
For PM purposes: model tier choice is one of your most important cost decisions. A team that defaults to GPT-5 for every query without considering whether GPT-4o-mini or Claude Haiku would serve equally well is burning cost and latency budget. See latency-cost for the full budgeting framework.
Where LLMs fail — and why it matters for product design
These failure modes are not going away. They are architectural properties of the current generation of models. Design around them rather than waiting for them to be fixed.
Hallucination. The model generates plausible-sounding text that is factually wrong. This is not a bug — it is the direct consequence of training a next-token predictor on pattern-matched text. The model does not "know" facts; it has learned statistical patterns of which tokens follow which. When asked about something it has no strong pattern for, it confabulates. Mitigation: retrieval-augmented generation (RAG), grounding prompts, citation requirements. See rag-architecture and safety-and-auditability.
Context sensitivity. The model's outputs change based on prompt framing in ways that are non-obvious. A small change in wording can dramatically alter the output. This is why eval design matters: you cannot test a prompt once and declare it robust. See eval-design.
Reasoning limitations. Current LLMs, including GPT-5 and Claude 4.5, struggle with long chains of logical reasoning, multi-step arithmetic, and tasks requiring precise symbolic manipulation. Chain-of-thought prompting (asking the model to reason step by step) improves performance substantially but does not eliminate the limitation. For tasks where reasoning errors are consequential — financial calculations, medical dosing, legal interpretation — the model output must be treated as a draft, not a decision.
Knowledge cutoff. Every pre-trained model has a training data cutoff. GPT-5's cutoff is mid-2025. Claude 4.5's is early 2025. Events, prices, personnel, and facts that changed after the cutoff are unknown to the base model. RAG is the standard solution; tool-use (web search) is another.
Sycophancy. RLHF-tuned models have a strong tendency to agree with the user and to confirm their existing beliefs. If you tell the model "I think our conversion rate dropped because of the new onboarding flow," it will often validate that hypothesis rather than challenging it. This is an RLHF artifact — annotators tended to rate agreeable responses higher. The practical implication: do not use an LLM to validate strategic decisions without explicitly prompting it to steelman the opposing view.
Hallucination, context sensitivity, reasoning limits, knowledge cutoff, and sycophancy are not bugs — they are architectural properties of current LLMs. Design around them. Do not wait for them to be fixed.
We spent three months fine-tuning a model for document summarisation before we understood the lost-in-the-middle problem. Our documents were long — 80+ pages of legal text. The model would produce summaries that were accurate for the beginning and end of the document and just missed entire sections in the middle. We had to completely restructure our chunking strategy, and we discovered it by reading the evals carefully, not by understanding the architecture. We could have saved two months if someone had told us about attention attenuation before we started.
The capability trajectory — calibrating 2026 judgment
The practical capability threshold has moved significantly since 2023. A few anchors:
- Coding: GPT-5 and Claude 4.5 can complete most single-function coding tasks autonomously. Multi-file refactors are achievable but unreliable. Full agentic coding (Cursor, Devin) works for well-defined tasks and fails for ambiguous ones.
- Reasoning: Extended thinking / o3-class models have substantially improved multi-step reasoning. Arithmetic is largely solved at the level needed for business logic. Legal and scientific reasoning remain error-prone enough to require human review.
- Long context: 1M-token contexts are available and mostly work. The "lost in the middle" effect is reduced but not eliminated. Don't trust long-context models to surface a critical piece of information buried on page 350 of a PDF without explicit retrieval.
- Multimodal: Vision (image + text), audio, and document inputs are standard in the major API models. Quality is high enough for most PM use cases. The UX design challenge is larger than the capability gap.
What to do this week
-
Take one AI feature you own or are speccing. Write down which stage of training matters for it — do you need pre-training capability, fine-tuning for format, or RLHF-style instruction following? Be specific.
-
Find your context window cost. Check the pricing page for the model you're using. Calculate the cost per 1,000 queries at your average prompt length. Then calculate it at 4x your current context. That's your scaling ceiling.
-
Test for hallucination on your specific task. Design five prompts that probe the factual accuracy of your use case. Run them. Note what the model gets wrong and whether those errors are recoverable by your UX or policy.
Where to go next
- Eval Design — how to systematically test whether your AI feature works
- RAG Architecture — the standard architecture for grounding model outputs in facts
- Build / Buy / Wrap — the decision framework for choosing your capability layer
- Latency and Cost — token economics and model cascading