Multimodal UX — the pm manual

Multimodal is table stakes in 2026. The question isn't 'can we add vision' — it's 'does multimodal actually improve the user's job, or does it just make the demo impressive?'

Talvinder Singh, Pragmatic Leaders

In 2026, the major model APIs (GPT-5, Claude 4.5, Gemini 2.5) all accept image, audio, and document inputs as standard. That means every product team has a question: should we use this? The answer is usually not "yes because we can" — it's "yes because it makes a specific user job materially easier than the text-only alternative."

This page is about designing AI UX well. The modality question is part of it. The larger part is the interaction patterns that turn probabilistic, slow, sometimes-wrong outputs into an experience users trust.

When multimodal earns its complexity cost

Multimodality adds: integration cost, latency (vision models run slower than text-only), testing complexity (vision inputs require different eval infrastructure), and accessibility obligations (you now need to handle inputs you can't perfectly interpret).

The question that justifies those costs: does the multimodal input materially reduce friction for the user compared to text? Not "is it impressive" — is the job easier?

Vision earns its cost when:

The user's primary artifact is visual (photos, diagrams, documents, screenshots)
The user would otherwise describe the visual artifact in text, losing precision
Manual transcription or re-entry is the current friction (a user photographing a receipt is faster than typing expense details)

Voice earns its cost when:

The user's hands or eyes are occupied (driving, cooking, examining a physical object)
The user thinks in conversation rather than typing (customer service, interview practice, brainstorming)
Latency is acceptable and the use case is genuinely conversational (not a search replacement with voice input bolted on)

Document understanding (PDF, spreadsheet) earns its cost when:

The document is the source of truth the user is working from
Parsing the document yourself (OCR + structure extraction) would produce lower quality than the model
The user's workflow already involves navigating the document (contracts, reports, financial statements)

When multimodal doesn't earn its cost:

You're adding it because the competition added it
The user could describe the input in a sentence with equal precision
The modality is optional and most users won't use it (adding voice to a form nobody asked to voice-input)

// learn the judgment

Your product is an expense tracking app for small businesses. An engineer proposes adding receipt scanning (photo-to-expense form). Marketing is excited. The current flow requires users to manually type merchant name, amount, category, and date. You have 60 days to ship before Q3.

The call: Do you prioritize receipt scanning? What information do you need before deciding?

Your reasoning:

Designing for streaming text

Most LLM APIs return text as a stream — tokens arrive incrementally as the model generates them. This is universally better UX than showing a spinner for 5 seconds and then displaying the full answer. But streaming introduces its own design requirements.

The basics:

Render tokens as they arrive. Don't buffer the full response before displaying.
Show a blinking cursor or typing indicator to signal that generation is in progress.
Disable the submit button while streaming to prevent re-submission.
Provide a stop/cancel button. Users who see the model going in the wrong direction want to interrupt.

The less obvious patterns:

Preserve scroll position during streaming. If a user has scrolled up to read earlier content, don't autoscroll them to the bottom as new tokens arrive. Most chat UIs get this wrong. The rule: autoscroll only if the user is already at the bottom.

Handle the "bad start" problem. The first few tokens often determine whether the response is going to be useful. Some users will want to cancel and retry immediately if the model starts with a boilerplate opener ("Of course! I'd be happy to help..."). Consider streaming with a short initial buffer — display the first 50 tokens all at once rather than one by one, to reduce the jarring experience of watching the model warm up.

Streaming structured output requires care. If you're streaming JSON or markdown tables, the structure is invalid until complete. Either: (1) stream the prose and render structured elements after generation completes, or (2) use partial rendering (show table rows as they arrive). Option 2 is more complex but better UX for long outputs.

Partial states and the "thinking" moment

Every AI feature has a latency gap. The user asked something; the model hasn't answered yet. What does the user see?

Most teams show a spinner. Spinners communicate "something is happening" but not "here's where I am and why this is worth waiting for." For tasks longer than 2-3 seconds, you have a UX design problem that a spinner doesn't solve.

The pattern that works: progressive disclosure of intermediate state. Show the user what the model is working on, not just that it's working. Cursor shows which file the agent is editing. Perplexity shows the sources being retrieved before the answer appears. GitHub Copilot streams code as it's generated. Each of these turns a waiting experience into a watching experience.

For longer tasks (agent workflows, complex document analysis), a step-by-step progress indicator with brief descriptions is more honest UX than a progress bar: "Retrieving relevant policies... Drafting response... Checking for compliance..." Users who understand what's happening are more patient and more likely to trust the result.

The magic moment problem. AI features often generate an impressive output the first time, leading a PM to feel the UX is strong. Then you watch ten users actually use it and discover: users don't know what to do with the output. They read it, maybe copy it, and leave — without taking the action you designed the feature to enable.

This is the magic moment problem: the model delivered something impressive, but the product didn't connect that output to a user action. The fix is to design the next-step UX at the same time as the generation UX. What can the user do with this output? Edit, apply, save, share, confirm, reject? Those actions should be immediately available, not buried in a menu.

Trust and correction UX

AI outputs are sometimes wrong. Every PM who has shipped an AI feature knows this. The UX design question is: when the model is wrong, how expensive is the error for the user, and how easy is correction?

Design for correction from the start. Every AI output that can be acted upon should have an inline correction path. This isn't a fallback — it's the primary UX. Users who trust a product enough to use AI-generated outputs are the users who know they can quickly fix errors. Users who don't trust the product do everything manually.

Confidence surfacing. Where the model's confidence genuinely varies, surface it. But be honest about what you know: "I'm not sure about this" is better than a spurious confidence percentage. The model does not output calibrated probabilities for most tasks. If you display "72% confident" without a principled basis, you're manufacturing false precision.

The inline edit pattern (from Cursor, Notion AI, Grammarly) is the highest-trust AI UX pattern: the model proposes a change, the user sees the before/after diff, and accepts or rejects with one click. This pattern works because:

It's transparent (the user can see exactly what changed)
It's low-commitment (rejection is one click)
It builds trust incrementally (users get comfortable with the model's judgment by repeatedly seeing it be right)

For high-stakes outputs (financial data, medical information, legal language, code that runs in production), require explicit user confirmation before acting on the output. The UX pattern: display the output, add a "Confirm and apply" step, and make the confirmation read the content (not just click through). This is slower — and that's the point.

Voice UX specifics

Voice input has its own UX requirements separate from the underlying model capability.

Latency budgets are stricter. For turn-taking conversation, users expect a response in under 1.5 seconds. Beyond 2 seconds, the conversational frame breaks and the interaction feels broken. Current voice-to-voice pipelines (speech-to-text → LLM → text-to-speech) have latency in the 1-3 second range depending on utterance length and model tier. Plan your architecture against this budget.

Handling ambiguity in voice input. Speech-to-text errors are more common than typing errors and less predictable. Your AI feature needs a graceful handling of mishear/misparse: "I think I heard [X] — is that right?" is better than proceeding confidently on a misheard input.

Visual confirmation for voice-triggered actions. If a voice command triggers an action (send message, add item to cart, schedule meeting), show the action on screen before confirming. Voice-only confirmation is error-prone. The UX rule: voice as the trigger, visual as the confirmation.

What to do this week

Audit your AI features for partial state UX. Pick one feature. Open it, submit a query, and watch what happens in the 0-5 seconds before the response arrives. Is that experience honest and informative, or just a spinner?
Define your correction flow. For one AI output your product generates, write down: what can the user do if the output is wrong? How many steps does correction take? If it's more than two steps, you have a UX debt.
Test the magic moment. Watch three users use your AI feature end-to-end (recording or moderated). Note: what do they do immediately after seeing the AI output? If they pause and look confused, you have found the gap between generation and next-step UX.

Where to go next

Safety and Auditability — the policy and UX patterns for when the model is wrong
Latency and Cost — the engineering tradeoffs behind the UX latency experience
UX Principles — the baseline design judgment that underlies every AI UX decision