Most AI UX is a chatbot, and most chatbots are a regression from the button they replaced. The job of AI UX is not to look intelligent — it is to keep the user in charge while the model does the work.
After this page, you’ll be able to:
- Pick the right surface — inline, diff, suggest-then-confirm, chat — for the job in front of you, instead of defaulting to the chat box
- Spec streaming, citations, undo, and cancelability the way you spec validation rules in a form
- Recognise the five AI UX anti-patterns that erode trust faster than a bad model ever could
Somewhere in the last three years, "AI feature" became synonymous with "chat box in the bottom right corner." A team decides the product needs AI, opens a meeting, and within an hour somebody has shipped a Figma frame with a floating circle, a sparkle icon, and a prompt that says "Ask me anything." The model behind it might be very good. The UX is almost always worse than the button it replaced.
AI UX is the set of design decisions that determine whether your users stay in control, trust the output, and come back tomorrow — or whether they hit your feature once, get burned, and tell their team to ignore it. The model decisions in chapters 2 through 7 buy you a chance at a working product. The UX decisions in this chapter convert that chance into retention.
The frame: every AI interaction is a negotiation between three parties — the user, the model, and the product. Good AI UX makes the model do the work, the user keep the agency, and the product hold the trust. Bad AI UX collapses two of those three into one and asks the user to live with the result.
The working patterns, in the order you will reach for them
1. Streaming — and when not to stream
Streaming output token-by-token is the highest-impact UX pattern of the LLM era. A three-second wait for a paragraph feels slow; the same paragraph rendering as it generates feels fast, even though total time is identical. Default to it for any response longer than a sentence.
The "and when not to stream" is what teams miss. Stream when the output is for reading. Do not stream when the output is for parsing — a JSON blob the UI will deserialize, a tool call the agent will execute, a code change you will apply to a file. Streaming a half-formed JSON renders a broken UI; streaming a half-formed function call risks firing on a partial argument. The rule: stream prose, buffer structure. If the output has both, stream the prose chunk and buffer the structured chunk with a visible boundary between them.
2. Inline citations, not a citations panel
If your AI makes claims about the world — a summary, a knowledge-base answer, a recommendation grounded in user data — cite the sources. The pattern that works is inline: a footnote-style marker next to the sentence the claim comes from, expanding to the source on hover or tap. The pattern that does not work is a "Sources" panel at the bottom listing five URLs the user has to map back to claims themselves. The user will not do the mapping. They will trust the whole answer or none of it; both are the wrong default.
Perplexity made the inline pattern famous; the reason it works is not the visual, it is the cognitive contract. A claim with a citation right next to it is auditable. A claim without one is a vibe. Users trust the system more even when they never click. (See Hallucination as a Product Problem.)
3. Edit-in-place plus undo
For any AI feature that modifies the user's content — rewrites a paragraph, refactors a function, recolors a layer — the right shape is: the AI proposes the change in the document, the user sees the new version where the old one was, and undo is one keystroke away. Edit-in-place keeps the user in the surface they already know. Undo keeps the model from being a one-way door. The cognitive cost of trying an AI suggestion drops to roughly zero, and that cheap reversibility is doing most of the trust work.
4. Suggest-then-confirm for destructive or irreversible actions
The default for any action that costs money, sends a third-party message, deletes data, or otherwise cannot be undone in one click is: the AI proposes, the user confirms, then it executes. Never auto-apply. The cost of an extra "Confirm" click is two seconds. The cost of the AI auto-charging the wrong card, emailing the wrong person, or dropping the wrong table is unbounded. Pick the cheap loss.
This is chapter 6's "auto-approved, requires confirmation, off-limits" call, rendered at the UX layer. Any action whose blast radius exceeds its convenience win gets a gate. The convenience case is the user clicking through in two seconds. The disaster case is the Friday afternoon the model misreads the intent and there is nothing between it and production.
5. Diff-view for proposed changes
When the AI proposes a non-trivial change to existing content — a code edit, a contract redline, a multi-paragraph rewrite — show the diff. A diff is the difference between "trust me" and "here is exactly what I changed." Red for removed, green for added, an Accept button per hunk so the user can take some changes and leave others.
The diff does two things at once. It reduces cognitive load (the user inspects only what changed) and it preserves agency (the user picks which hunks land). It also surfaces mistakes immediately — a confidently wrong rewrite is obvious in red-and-green where it would have hidden in clean prose. Where it falls short: diffs are great for code, tolerable for text, broken for layouts and visuals, where "what changed" is not a line-level question.
6. Multi-suggestion presentation
When the output is one of several reasonable choices — a subject line, a tagline, a category — show three or four side by side instead of one. The user's task shifts from "is this right?" (high-stakes, slow) to "which is best?" (low-stakes, fast). It forecloses the worst failure mode of single-suggestion UX: the user accepting a mediocre output because it was the only one offered. Three is usually right. Two looks like a coin flip. Five becomes a menu.
7. "Show your work" disclosures
For any output the user might dispute — a classification, a routing decision, a recommendation — make the reasoning available, but do not put it in the primary view. A "Why this?" link or expandable panel the user opens if they want. Hiding the reasoning destroys auditability; foregrounding it destroys the feature.
8. Latency UX — skeleton, partial render, cancelability
While the model is working: show a skeleton so the user sees the shape of the answer that is coming, stream partial content the moment it is available, and give the user a visible Cancel or Escape so a slow generation does not feel like a hostage situation.
Cancelability is the piece most teams skip. A generation the user cannot interrupt is worse UX than a slower model the user can stop. If your feature has a "thinking…" spinner with no Cancel button, you have shipped a hostage taker.
9. Failure UX — graceful degradation and honest framing
The model will fail. It will rate-limit, time out, hit a content filter, get malformed output, or refuse. The framing is a product decision, not an engineer's 500-string. "We couldn't generate a response right now — try again, or write it yourself" is graceful degradation. "Error 500" is a churn driver. "I cannot help with that" with no follow-up is worse — it leaves the user staring at a refusal with no path forward. State the failure, propose an alternative (manual entry, a different prompt, a retry), never blame the model out loud.
10. Trust calibration — confidence shown sparingly
Showing confidence scores ("78% sure") looks rigorous in a demo and lies in production. LLM confidence is poorly calibrated; a model that says 90% can be wrong as often as one that says 60%. The pattern to use: show confidence categorically (High / Medium / Low, or Confident / Uncertain / Refusing) only when you have calibrated the threshold against an eval set, and only when the user can act on the signal — by escalating to a human, by double-checking the source, by regenerating. Confidence the user cannot act on is confidence theatre.
11. Keyboard-first interactions for power users
The users who get the most value use the feature twenty times a day. They will not move their hand to a mouse twenty times a day. Every AI surface inside a working tool — editor, doc, design tool, ticketing system — earns its keep on the keyboard: a shortcut to invoke, Tab or Enter to accept, Escape to reject, Cmd-Z to undo. The feature that requires a click to open, a click to type, a click to send, and a click to apply has lost its power users before it shipped.
The anti-patterns — five things to stop doing this quarter
Bare chatbot when a button works. If the user's job has three discrete intents, give them three buttons. Forcing them to type "summarize this" into a chat input is making the user do work the UI should have done. The chatbot is the right surface for open-ended intent (Claude.ai, ChatGPT). It is the wrong surface for any task with a finite menu of jobs, which is most product features.
Auto-apply destructive actions. Already covered in pattern 4, repeated because it keeps shipping. If the action sends an email, files a return, charges a card, deletes content, or modifies another user's state, the model proposes and the user confirms. A 1% wrong-charge rate at scale is a regulatory event.
Confidence theatre. Showing precise confidence scores on uncalibrated systems. Showing "AI is thinking…" when the model already finished but the network is slow. Animating a sparkle for 1200ms to make a 200ms response feel "more AI." All of it makes the product feel less trustworthy to the users who matter, while feeling more magical to the users in the demo.
Magic that hides agency. Features that "just do it" without showing what was done, what was changed, or what the AI inferred about intent. The user does not feel served; they feel overruled. Surface the inference, the action, and the reversal. "I assumed you meant X — was that right?" is stronger UX than "Done."
AI that creates work instead of removing it. The clearest tell of a bad AI feature is that, after it ships, the user has more clicks, decisions, or output to review than before. If the AI generates five things the user now has to triage, you have shipped a delegation problem dressed as productivity. The bar is that human work goes down. If it does not, the AI is the wrong tool for the job. (Chapter 1 covered this from the strategy side; it is also a UX failure.)
Three worked examples — what each gets right, where each falls short
Cursor's diff-then-accept
Cursor pairs a code editor with an AI that proposes edits across files. The shipped UX combines almost every working pattern in this chapter: streaming response, inline edit, red/green diff, per-hunk Accept and Reject, Cmd-Z undo, Escape to cancel, keyboard-first throughout. The developer sees exactly what changed before any of it lands.
Where it falls short: the UX assumes the user can read code well enough to evaluate the diff. For senior developers, that is fine. For junior developers or non-engineers on Bolt, Lovable, v0, the diff is overhead they cannot evaluate, and "accept all" quietly becomes the default. A diff is only as useful as the user's ability to read it.
Notion AI's edit-in-place plus undo
You highlight a paragraph, ask for a rewrite, the paragraph mutates in place, Cmd-Z reverses it. The cognitive cost of trying a rewrite is approximately zero, which is the whole point — the user can ask three times in thirty seconds and keep the best version. It is the canonical example of trust through reversibility.
Where it falls short: undo is all-or-nothing. If the AI rewrote four paragraphs and you liked two, Notion AI does not let you take those two and keep working — you accept the whole rewrite or reject it. Cursor solved this for code with per-hunk accept. Prose has not been solved. The frontier is chunked accept/reject for unstructured content; whoever ships it well sets the new standard.
Linear's AI-summary inside the comment thread
Linear added an AI-generated summary that sits inside a long comment thread, so a teammate joining late can catch up in two sentences instead of reading thirty comments. The summary is embedded in the surface where the work already happens — no separate panel, no separate prompt. That is the right call.
Where it falls short: the summary does not cite which comments it drew from. "The team decided to ship Friday" does not link back to the comment where that decision happened. For low-stakes catch-up, the gap is acceptable. For a manager scanning summaries to decide what to escalate, or a customer-facing PM treating the summary as the truth of what was promised, the missing per-claim grounding is the difference between a useful summary and an unauditable one. The fix is the inline-citation pattern from earlier in this chapter.
The five questions before any AI UX ships
What is the user's job, and is the chat input the surface for it? If the job has finite intents, you are reaching for a button. What does the model write to that cannot be undone in one click? Map every irreversible action and gate it. Where do the model's claims come from, and can the user audit them inline? Where does the model fail — timeout, refusal, rate-limit, malformed — and is that path specced? Has the AI reduced the human's work, or moved it?
If three of those five come back weak, do not ship. Iterate the surface, then ship.
Rules
The chatbot is the right surface for open-ended intent. For finite intent, ship buttons. Defaulting to chat is conceding the UX argument before you started.
Stream prose, buffer structure. Token-by-token rendering is a perceived-latency win for reading and a parsing hazard for JSON, tool calls, and structured output.
Citations live inline, next to the claim they support — never in a sources panel at the bottom. The user will not map five URLs back to ten sentences. Make the audit free.
Edit-in-place plus one-keystroke undo is the cheapest trust contract in AI UX. Reversibility makes the user willing to try; trying is most of the loop.
Any AI action whose blast radius exceeds its convenience win gets a confirm step. Auto-apply destructive operations is the failure mode most teams ship and most teams regret. (Companion: Tool Use, Function Calling, Agents — The Maturity Ladder.)
Show the diff. For any non-trivial change to existing content, the user inspects what changed, not the whole result. Per-hunk accept/reject is the gold standard.
The Cancel button is the feature. A model generation the user cannot interrupt is a hostage situation, no matter how good the output is.
The bar for any AI feature is that the human's work goes down. If the user has more to triage, review, or decide after it ships, the feature is the wrong shape.
Where to go next
- Chapter 5 — Hallucination as a product problem: inline citations and confidence calibration in this chapter are also hallucination controls. (Hallucination as a Product Problem)
- Chapter 6 — Tool use, function calling, agents: the confirm-before-destructive rule is the same rule that gates agent tool calls. (Tool Use, Function Calling, Agents — The Maturity Ladder)
- Chapter 9 — Cost & latency as first-class product constraints: streaming, skeletons, and cancelability are how UX absorbs the latency the model cannot remove.
- Chapter 4 — Eval before launch: every UX claim here (confidence thresholds, refusal rates, citation accuracy) needs an eval to back it. (Eval Before Launch)
- Companion: Working with Engineers — most of these patterns live at the PM-engineer seam; spec them like validation rules, not vibes.