AI-Native UX: Streaming, Optimistic UI, and Designing for Uncertainty
LLM features break the request/response habits of the web. Here is how to stream tokens, give instant feedback with rollback, and design honestly for a model that is sometimes wrong.
You wire up an LLM endpoint, fire a request, and show a spinner. Eight seconds later a wall of text appears. It works in the demo. Then real users arrive, the spinner feels broken, people click again, costs double, and nobody trusts the answer because they cannot tell where it came from. The model is not the problem here, your interaction model is.
Who this is for
Frontend and full-stack engineers shipping their first (or fifth) LLM-powered feature. You know React. You have an [LLM API working](/blog/working-with-the-llm-api). Now you need a UI that handles latency, non-determinism, and the fact that the model is sometimes confidently wrong.
AI-native UX is the set of patterns that make slow, probabilistic, occasionally-wrong responses feel fast, trustworthy, and in the user's control. Three pillars: stream so latency disappears, give optimistic feedback so actions feel instant, and design for uncertainty so people can verify, edit, and undo.
A new mental model: thinking out loud vs the vending machine
A traditional API call is a vending machine: you pay, you wait, you either get the exact item or an error. An LLM is a person thinking out loud, they start talking before they have finished the thought, they might be wrong, and you can interrupt them.
A colleague starts answering before they have the full thought, words arriving one at a timeToken streaming over SSE, render each chunk as it arrives instead of waiting for the full response
You raise a hand and say 'stop, that is not what I meant'An abort/stop button that cancels the in-flight stream (and the billing)
They cite where they read something so you can checkShowing sources, retrieval chunks, or tool calls alongside the answer
They say 'I think it was Tuesday, but double-check'Confidence signals and edit/undo affordances instead of presenting output as fact
A vending machine: pay, wait, get the item or a flashing errorThe old request/response model, one blocking call, one final payload, a spinner in between
Stop designing AI features like vending machines. Design them like conversations.
The streaming flow, end to end
Before any code, hold the whole pipeline in your head. The browser opens one request to your own server route. Your route calls the model provider, which streams tokens back. Your route relays those tokens to the browser as Server-Sent Events, and the UI appends each chunk to the screen as it lands.
Tokens flow left to right and render progressively; nothing waits for the full answer.
1
User submits
The hook POSTs the full message history to your own route, never call the provider directly from the browser, that leaks your API key.
2
Route opens a model stream
streamText starts the provider call and returns immediately with a readable stream, not a finished string.
3
Tokens relay as SSE
Each model chunk is forwarded to the browser over a long-lived text/event-stream response. One HTTP request, many events.
4
UI appends per chunk
The hook concatenates each delta into the assistant message and re-renders, so words appear as they are generated.
5
Stream closes
On the final chunk the connection ends, the stop button flips back to send, and you persist the completed message.
Why SSE and not WebSockets
Streaming an answer is one-directional server-to-client text, which is exactly what SSE is built for, it is plain HTTP, auto-reconnects, and needs no extra infra. Reach for WebSockets only when you need true bidirectional, low-latency exchange. See [realtime APIs: WebSockets, SSE, and long polling](/blog/realtime-apis-websockets-sse) for the full comparison.
A streaming chat UI with the Vercel AI SDK
The Vercel AI SDK collapses the entire pipeline above into two pieces: a useChat hook on the client and a streamText call in the route. The hook manages the message list, the streaming append, the input state, and the loading flag. Here is a complete client component, including the all-important stop button.
The route is where tokens are born. streamText kicks off the provider call and hands back a stream you turn straight into a streaming HTTP response. Note it is an Edge-friendly handler returning a Response, not JSON.
app/api/chat/route.ts
typescript
import { streamText } from'ai';
import { openai } from'@ai-sdk/openai';
exportconst runtime = 'edge';
exportconst maxDuration = 30;
exportasyncfunctionPOST(req: Request) {
const { messages } = await req.json();
const result = streamText({
model: openai('gpt-4o-mini'),
system: 'You are concise. If unsure, say so and suggest how to verify.',
messages,
// abort propagates: if the client disconnects, stop billing
abortSignal: req.signal,
});
// relays tokens to the browser as an SSE-style streamreturn result.toDataStreamResponse();
}
The system prompt is part of your UX
Telling the model to admit uncertainty ("if unsure, say so") changes what the user sees. Pair it with the security posture from [prompt injection and LLM security](/blog/prompt-injection-and-llm-security), never trust message content as instructions, and never render model output as raw HTML.
Optimistic UI with rollback
Streaming hides latency for the answer. Optimistic UI hides it for the side effects, saving a generated draft, applying an AI edit, accepting a suggestion. The pattern: update local state immediately as if the server already said yes, fire the request in the background, and if it fails, roll the state back and tell the user.
The single most common optimistic-UI bug is mutating state in place so there is nothing to roll back to. Capture the pre-change value first, then update. React 19's `useOptimistic` hook automates the snapshot/rollback dance when the mutation lives in a transition.
Traditional UX vs AI-native UX
The shift is not cosmetic. Four assumptions baked into classic web UX break the moment an LLM is in the loop.
Dimension
Traditional
AI-native
Latency
Sub-second; a spinner is fine
Seconds; stream tokens, never block
Determinism
Same input, same output
Same input, varying output, design for variance
Errors
4xx/5xx with clear states
Plausible-but-wrong; needs verification UI
Trust
Implicit; the system is right
Earned; show sources, allow edit/undo
What changes when the response is slow, probabilistic, and fallible.
Designing for uncertainty
The hardest part is not technical. A model that is wrong 5% of the time but presented as authoritative erodes trust faster than one that is wrong 15% of the time but honest about it. Your UI is where that honesty lives.
Three rules for honest AI UX
**Show your work**, surface the sources, retrieval chunks, or tool calls behind an answer so users can verify instead of believing. **Make it editable**, every generated artifact should be tweakable and undoable; AI output is a draft, not a verdict. **Never hide that it is AI**, label generated content clearly; disguising the model as a human is both a trust and a compliance risk.
Structured output deserves special paranoia. When you ask the model for JSON to drive your UI, treat that JSON as hostile input: it can be truncated mid-stream, contain extra prose, or simply omit a required field. Validate before you render.
lib/parseStructured.ts
typescript
import { z } from'zod';
const Suggestion = z.object({
title: z.string().min(1),
confidence: z.number().min(0).max(1),
});
// model output is untrusted: never assume the shapeexportfunctionparseSuggestion(raw: string) {
try {
const json = JSON.parse(raw);
const result = Suggestion.safeParse(json);
if (!result.success) {
return { ok: falseasconst, reason: 'invalid shape' };
}
return { ok: trueasconst, value: result.data };
} catch {
// partial / malformed during stream, show a 'still thinking' statereturn { ok: falseasconst, reason: 'incomplete' };
}
}
Render a low-confidence result with a visible badge, not silently as fact.
If parsing fails mid-stream, keep the skeleton, do not flash an error for a response that is simply not finished.
Give the user a one-click way to regenerate when the structured output is unusable.
Accessibility of streaming content
Text that appears a token at a time is invisible to screen-reader users unless you tell assistive tech to announce it. The tool is aria-live, but the wrong setting is worse than none, assertive on a streaming region interrupts the user on every single token.
Use aria-live="polite", not assertive
Wrap the message list in `aria-live="polite"` with `aria-atomic="false"` so the screen reader queues updates and reads new text after the current utterance, instead of stuttering on every chunk. Announce the start of generation ("Assistant is responding") and completion so users know the boundaries. Keep the stop button reachable and labelled by keyboard at all times.
Cancellation and cost
Every streamed token is a billed token. If a user navigates away or hits stop, an un-cancelled stream keeps generating, and keeps charging you, for the full max-tokens budget. Wiring up abort is both a UX feature and a cost control.
Cancelling the browser fetch only stops rendering, the model keeps generating server-side unless you forward the signal. Pass `req.signal` into `streamText` (as in the route above) so the abort propagates all the way to the provider and the meter stops.
Common mistakes that cost hours (and dollars)
A blocking spinner instead of streaming. An 8-second spinner reads as broken; the same 8 seconds with tokens flowing reads as fast. Stream from day one, retrofitting it later means rewriting the data layer.
Trusting structured output blindly.JSON.parse(modelOutput) will crash in production on a truncated or chatty response. Validate with a schema and handle the partial-stream case explicitly.
No stop or cancel. Users feel trapped watching a wrong answer generate, click again out of frustration, and you pay for two full completions. A stop button that aborts the server stream fixes the UX and the bill at once.
Presenting output as fact. No sources, no edit, no confidence, the first hallucination the user catches destroys trust in the whole feature.
aria-live="assertive" on the stream. It interrupts screen-reader users on every token. Use polite, and announce start and end.
Takeaways
The whole article in seven lines
LLM features are conversations, not vending machines, slow, probabilistic, sometimes wrong.
Stream tokens over SSE so latency disappears; a blocking spinner is a UX bug.
The Vercel AI SDK reduces it to `useChat` on the client and `streamText` in the route.
Always ship a stop button, and propagate the abort to the provider to stop the bill.
Use optimistic UI with a captured snapshot so failures roll back cleanly.
Design for uncertainty: show sources, allow edit/undo, never hide that it is AI.
Validate structured output as hostile input, and announce streams politely for screen readers.
Where to go next
AI-native UX sits on top of three layers worth deepening. Get the transport right, get the API call right, and lock down the security boundary, the front-end patterns here only shine when those are solid.
Practice the request/response fundamentals in the networking lab so the streaming layer has a solid base.
Want to go deeper?
This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.