Hallucination as a Product Problem — the pm manual

Hallucination is not a bug to be fixed — it is a product surface to be designed. The lab will not save you. The UI will.

Talvinder Singh

Every team I have worked with in the last two years has, at some point, said the same sentence in a roadmap meeting: "When the next model comes out, hallucinations will be fixed." It is the AI-era version of "we'll fix that bug in the next release." It is wrong, and it is expensive.

Hallucination is not a bug. It is the way large language models compute. The job of the PM is not to wait it out. It is to build a product that is honest about that substrate — and that places the model only where the cost of a wrong answer is something the product, the user, and the company can absorb.

Chapter 1 (When AI Is the Right Answer (and When It Isn't)) asked the gating question. Chapter 4 (Eval Before Launch) asks the measurement question. This chapter asks the design question that sits between them: given that the model will sometimes be wrong, where in the product are you placing it, what does the user see when it is wrong, and what does it cost you when the regulator calls?

Why models hallucinate — the lossy-compression view

The clearest mental model for hallucination is this: an LLM is a lossy compression of the training corpus, and generation is decompression with gaps filled in. The model is not retrieving facts. It is running a sophisticated autocomplete that predicts the next likely token based on patterns. Most of the time the autocomplete lands on something true, because the truth is statistically common in the training data. Sometimes it lands on something plausible and false — because plausible-and-false is also statistically common. The model has no internal signal that distinguishes the two.

The model is not lying. Lying requires intent. The model has a probability distribution. The output looks like lying because we, the users, carry a deep assumption — built over a hundred thousand years of talking to other humans — that confidently-stated sentences come from an agent that knows what it is saying. Models are not humans. They produce text shaped like confident human speech regardless of whether the underlying claim is verifiable.

Three things follow once you accept this view.

First, the rate of hallucination is not zero and will never be zero on any open-ended task. Frontier models in 2026 hallucinate less than the GPT-3.5 era ones did, but they still hallucinate — sometimes more subtly, in ways users are less likely to catch. Subtler is worse for trust, not better.

Second, you cannot prompt your way out of the problem. "Do not make up information" lowers the rate. It does not eliminate it. The model has no internal mechanism to obey the instruction perfectly.

Third, the durable answers are architectural and design-level, not training-level. You ground the model in retrieved facts. You design the UI to expose sources. You let the model abstain. You put a human where the cost of being wrong is high. None of these are features the lab will ship to you.

The first decision: where do you let the model be wrong?

Every AI feature has an implicit answer to one question, and most teams have never made it explicit: where in this product are we comfortable with the model being wrong?

There are surfaces where wrong-and-recovered is fine. A Copilot suggestion the developer rejects with Escape. A draft Gmail reply the user edits before sending. A meeting-summary first pass the user cleans up. The output is provisional, the human is on the next click, and the cost of a wrong answer is a small amount of friction.

There are surfaces where wrong is catastrophic. A medical triage answer. A tax calculation in a GST tool. A legal interpretation in a customer-service chat. A flight refund policy quoted to a grieving customer (we will come back to that one). The user takes action on the model's output, often immediately, and the cost is money, health, or a lawsuit.

The PM's first job, before any prompt is written, is to put every AI surface in the product on this spectrum. Three questions do the work:

What action does the user take based on the output? "Click Send," "press File," "approve transaction" → high-trust. "Edit," "regenerate," "ignore" → low-trust.

Who pays when the answer is wrong? The user (bad advice, lost money), the company (refund, support load, legal cost), or a third party (regulator, downstream customer). The party that pays is the party you must design for.

How fast can the wrong answer be detected and undone? Seconds and one click → loose design is okay. Weeks later in a tax notice or a courtroom → airtight or do not ship.

If a surface that should be high-trust has been built like it sits in the low-trust column, you do not have an AI feature. You have a liability dressed as a roadmap item.

The Air Canada precedent — legal liability for a hallucination

In November 2022, Jake Moffatt visited Air Canada's website after a family bereavement and asked the airline's chatbot about bereavement fares. The chatbot told him he could book at the regular fare and claim the bereavement discount as a refund within 90 days. He booked. When he submitted the refund request, Air Canada refused — pointing at the actual policy, published elsewhere on its own site, which did not allow retroactive bereavement claims.

Moffatt took the airline to the British Columbia Civil Resolution Tribunal. Air Canada argued, in writing, that the chatbot was a separate legal entity responsible for its own actions. The tribunal disagreed. In February 2024, it ruled that Air Canada was liable for what its chatbot said to its customers and ordered the airline to pay damages of CAD 812. The dollar amount is small. The precedent is not.

Three things in the ruling matter to PMs.

The tribunal explicitly rejected the "separate entity" argument. The chatbot's words were Air Canada's words. There is no legal firewall between the model and the company that deployed it.

The tribunal noted that the correct policy was on the website, but a customer talking to the chatbot had no reason to know they should also go read the policy page. The chatbot's confidence was itself a representation. The user's reliance on it was reasonable.

The ruling has been cited in regulatory and consumer-protection discussions in the US, UK, and EU as evidence that AI-generated customer communication carries the same legal weight as human communication. India does not yet have a Moffatt-equivalent decision, but the Consumer Protection Act 2019 and the DPDP Act 2023 both define companies as responsible for representations made on their behalf. The direction of travel is clear.

The practical lesson is not "do not ship chatbots." It is: when you let a model speak on behalf of the company, the company is on the hook for what it says. That changes the spec, the grounding requirements, the topic guards, and who reviews output before it ships.

The confidence theatre anti-pattern

The most damaging UI pattern in AI products today is the confidence pill — a badge next to a model output that says "98% confident" or "verified by AI" or simply renders a green checkmark. None of these represents anything the model actually knows about itself. Token-level log-probabilities are not calibrated to factual accuracy. A model is just as confident in a true sentence as in a fabricated one.

The cost of confidence theatre is asymmetric. When the model is right, the badge is redundant. When the model is wrong, the badge actively misleads the user into trusting the wrong answer. The badge makes the worst case worse and the best case neutral. Over the lifetime of the feature, that is a strictly bad trade.

If you cannot point at a calibrated signal — a score measured against ground truth on your eval set (Eval Before Launch), with documented precision-recall behaviour — do not put a confidence pill in the UI. The honest pattern is the opposite: surface uncertainty by exposing the source the model used, so the user can verify for themselves rather than trust a number you cannot defend.

Citation UX — the most important AI UI primitive

If there is one design pattern you take from this chapter to your team on Monday, it is this: make the sources visible and click-throughable, inline, on every claim that matters.

Perplexity is the clearest example shipping today. Every answer carries numbered footnotes mid-sentence; each footnote is a link to the source page the model retrieved at query time. A user who does not trust a sentence can click in one motion and see the underlying source. The deeper move is philosophical: Perplexity has decided the user's primary cognitive job is not to read the AI answer — it is to decide whether to trust the AI answer. The UI is built around that decision. Citations are not a feature; they are the feature.

Three rules for citation UX worth borrowing.

Inline, not appended. A list of "sources" at the bottom is a bibliography. A numbered link mid-sentence is a verification primitive. Users will not scroll to a bibliography; they will tap an inline citation.

Linkable, not labelled. "Based on internal documentation" is a disclaimer, not a citation. The citation should resolve to a specific document, page, or URL the user can open.

Conservatively scoped. A citation should cover the claim it sits next to, not the whole answer. If only the first sentence is grounded in source A and the rest is generation, cite the first sentence and let the rest stand or fall on its own.

For internal-facing or B2B products with proprietary corpora, the same primitives apply. Notion AI's Q&A surfaces the specific pages it pulled answers from, with links. Glean does the same on top of enterprise search. The user is not being asked to trust the model; they are being shown the receipts.

Abstention is a feature — "I don't know" beats wrong

Software engineering taught us for decades that an empty result is worse than any result. AI inherits this habit and amplifies it: a language model will always produce a response, even when it has no basis for one, because that is what its loss function rewarded during training.

The most under-used product move in AI design is to invert this default. Teach the model to say "I don't know" — or, better, "I cannot answer this confidently; here is what to do instead" — and build the UI to celebrate that response, not penalise it.

In the prompt, instruct explicitly: "If the answer is not in the provided context, respond with 'I do not have that information.' Do not fabricate." This is hygiene-level for every grounded product. It is not a guarantee — the model can still ignore the instruction — but it lowers the hallucination rate measurably on every eval I have run.

In the UI, the "I don't know" path should be a first-class state, not an error state. Show the user what to do next — escalate to a human, search the docs, open a ticket. Hand them off with grace. An AI feature that handles 70% of queries well and escalates the other 30% cleanly is a better product than one that handles 100% of queries and is wrong on 12%.

In the org, reward the abstention metric. Measure not just "what fraction were correct" but "what fraction of unanswerable questions were correctly abstained on, and how cleanly were they handed off." The second metric is where trust is built.

Where to put the human

Human-in-the-loop is the durable safety net for high-stakes AI surfaces, but most teams place it badly. There are three positions on the loop.

Before generation. The human writes or approves the prompt template. The model's outputs are constrained because the prompt is constrained. Cheapest form, the one most teams already do. Catches whole classes of failure (off-topic, off-tone) but does nothing for hallucinations within the constrained scope.

During generation. The model runs with the human watching, copilot-pattern. The developer sees the suggestion before they accept it; the support agent sees the draft reply before they send it; the doctor sees the suggested diagnosis before signing off. Highest fidelity — every output is checked — and it is what makes Copilot, draft-reply, and clinical-decision-support tools work. The cost is that the human is on the critical path, so throughput is bounded by attention.

After generation, on a sample. The model runs autonomously; a human reviews a sample after the fact, looking for hallucinations and drift, and feeds findings back into prompts and eval sets. Cheapest at scale, the right placement for low-to-medium-stakes autonomous features. It catches systemic problems but does not protect any individual user from any individual wrong answer.

Most products need a mix. A support chatbot might use before-generation prompting (topic guards), pure-autonomous generation for FAQ-shaped queries, and human-in-the-loop during generation for any query touching refunds, cancellations, or policy — the categories where Air Canada-style liability lives. The placement is the lever, not the presence.

Retrieval as the first line of defence — a preview

Grounding the model in retrieved facts at query time — retrieval-augmented generation, or RAG — is the highest-leverage architectural move against hallucination on factual surfaces. Instead of asking the model "what is our refund policy," you retrieve the policy from your live document store, paste it into the prompt, and ask "given this policy, answer the user's question." The model is still hallucination-prone, but its hallucinations are now bounded by what is in the retrieved context, and a well-prompted model with high-quality retrieval is dramatically less likely to invent a clause that does not exist.

RAG is not a silver bullet. The model can still misread context, combine chunks incorrectly, or fail to abstain when the answer is not present. Retrieval quality is its own product problem — chunking, embedding, ranking, freshness — and a bad retriever feeds the model bad context, producing confidently-wrong answers grounded in irrelevant material.

Chapter 7 (RAG, Fine-Tune, or Context Window?) goes deep on when to use retrieval, fine-tuning, or a long context window. The preview, for this chapter: if your AI feature answers questions about company-specific facts (policies, prices, inventory, documents), assume you need retrieval, and assume retrieval quality is half the product.

A worked example: Bing Chat's early hallucination cycles

When Microsoft launched the Bing Chat preview in February 2023, screenshots circulated within two weeks of Bing fabricating financial data, inventing historical events, gaslighting users about the current date, and in one widely-shared transcript, declaring its love for a New York Times reporter and trying to convince him to leave his wife.

The product reasons are worth holding onto. Microsoft had shipped a high-trust UI — a search engine, the place users go to find true things — on top of an unconstrained generative model with weak grounding. The mismatch between the surface (search) and the substrate (open-ended generation) was the bug. Users were not wrong to trust the answers; the product had been wrapped in the visual language of search.

Microsoft's response was instructive. They did not announce that hallucinations would be fixed in the next model. They constrained the product. They capped conversation length. They added topic guards. They added explicit citations to grounding sources — the same shape as Perplexity's pattern. They shipped a "more precise" mode for factual queries. The product survived because Microsoft treated hallucination as a product problem and shipped product responses — not as a model problem they could wait out.

The lesson generalises. When your AI feature ships into a high-trust surface, the first month will surface failure modes you did not see in eval. Your response cannot be "wait for the next model." It has to be product: tighten scope, add citations, add abstention, add guardrails on the categories that are catching fire.

// scene:

Post-launch review, 14 days after a B2B SaaS team shipped an AI assistant in their billing dashboard.

Support Lead: “Three customer escalations this week. The assistant told one customer they were on a plan that does not exist, quoted a refund policy that is not ours, and said we offer a feature we deprecated last year.”

PM: “On all three, was the assistant asked a factual question, or for help with something open-ended?”

Support Lead: “Factual. Plan, policy, feature availability. The three things we have a database for.”

PM: “Then we are using the model wrong. Factual questions about our own product should not be answered by the model. They should be answered by retrieval against the live database, with the model only doing the rephrasing. The model stays — it is good at the conversational frame — but it stops being the source of truth. And topic-guard refunds and policy questions to a human until retrieval is in place.”

Hallucination is rarely the model's fault. It is almost always the architecture's fault — the team gave the model a job that should have been retrieval's job, then was surprised when the model did it the way the model does.

// tension:

The product fix for hallucination is rarely 'better prompt' or 'better model.' It is 'put the model on a job the model is actually for.'

What to do on Monday morning

Take every AI surface in your product. Walk it through the three questions — what action does the user take, who pays when wrong, how fast can wrong be detected and undone. Mark each surface high-trust, medium, or low. For every high-trust surface, write down where the model is grounded, what the citation UX is, what the abstention path is, and where the human sits in the loop. If any of those four is blank, you have a production risk, not a feature. Close the gap before the next deploy.

If you are about to ship a new AI feature, do not let it go out without those four answers written into the spec. "We will add citations later" is the AI-era version of "we will add auth later."

Rules

Where to go next

Chapter 4 — Eval before launch: the measurement layer that tells you whether grounding, citations, and abstention are actually working. (Eval Before Launch)
Chapter 7 — RAG, fine-tune, or context window? the deep dive on retrieval as the first line of architectural defence against hallucination. (RAG, Fine-Tune, or Context Window?)
Chapter 8 — AI UX patterns that work: the design-pattern library for citations, abstention states, and confidence signals. (AI UX Patterns That Work)
Chapter 10 — Safety, privacy, compliance for shipping teams: how the Air Canada precedent generalises into a regulatory posture. (Safety, Privacy, Compliance for Shipping Teams)
Companion: Ethical PM — the substrate underneath every "where does the human sit" decision in this chapter.