Your demo worked, but the bill and the p95 latency did not. A practical playbook for making LLM apps cheap and fast enough to ship: budgeting, caching, model routing, and streaming.
The demo was magic. You wired up a model, pasted in a fat system prompt, stuffed the whole knowledge base into the context, and it answered beautifully. Then you shipped it to real traffic. Two things broke at once: the monthly bill climbed past what the feature is worth, and the p95 latency, the slow-tail experience your impatient users actually feel, crept toward six, eight, ten seconds. Suddenly the magic feels like a liability.
Here is the good news: almost none of this requires a smarter model. It requires *engineering*. The same disciplines you already apply to databases and APIs, caching, routing, trimming, batching, map cleanly onto LLM calls. This article is the playbook: the levers that make an LLM app cheap and fast enough to keep in production, in the order you should reach for them.
Who this is for
Engineers who have an LLM feature *working* and now need it to be **affordable and responsive** at scale. You should be comfortable reading Python and thinking about percentiles and per-request cost. No ML background required, this is systems work, not model training. New to how these models tick? Start with [How LLMs Actually Work](/blog/how-llms-actually-work).
The principle: don't send the token in the first place
The cheapest, fastest token is the one you never send.
Every optimization here is a variation on that idea. You pay, in dollars *and* in milliseconds, roughly in proportion to the tokens that flow in and out of the model. Input tokens cost money to send and add to the time-to-first-token. Output tokens cost more per unit *and* are generated one at a time, so they dominate latency. So the entire game is: send fewer tokens, generate fewer tokens, and avoid the call entirely when you can.
Crucially, optimizing cost and optimizing latency are not always the same move. Some levers buy you both (a cache hit is free *and* instant). Some buy one at the expense of the other (batching is cheaper but slower per request). Knowing which is which is what lets you tune deliberately instead of flailing.
A pricey consultant who bills by the wordThe LLM, every input and output token costs money and time
Checking your notes before booking the meetingA cache lookup before calling the model at all
Asking the junior analyst the easy questions firstRouting simple requests to a small, cheap model
Sending a one-page brief instead of the whole filing cabinetTrimming the prompt and context to what's relevant
The consultant talking while they think, not afterStreaming tokens so the user sees output immediately
Treat the model like an expensive specialist consultant, not a search box you spam.
The picture: a request's journey to the cheapest answer
Before reaching for any single model, a well-built LLM service runs each request through a funnel. Every stage is a chance to answer *cheaper* or *not at all*. Only the requests that survive every filter reach the expensive large model.
A request flows through a semantic cache and a router before ever touching a model; streaming carries the answer back token-by-token.
1
Check the semantic cache
Embed the incoming request and look for a near-identical question answered before. On a hit, return the stored answer instantly, zero model tokens, single-digit-millisecond latency.
2
Route on a cache miss
A lightweight classifier (rules, a tiny model, or a heuristic) decides whether this is an easy request or a hard one. Most production traffic is easy.
3
Try the small model first
Send easy requests to a small, cheap, fast model. It handles the long tail of simple lookups, classifications, and short rewrites for a fraction of the cost.
4
Escalate only when needed
Hard requests, or small-model answers that fail a confidence/validation check, escalate to the large model. You pay the premium only for the requests that truly need it.
5
Stream the answer back
Whichever model answers, stream tokens to the client as they generate. Total time is unchanged, but perceived latency collapses because the user sees words immediately.
6
Write back to the cache
Store the answer keyed by the request embedding so the next similar question skips the whole funnel.
The levers, and what each one actually buys you
These are the six moves worth knowing, ranked roughly by impact-per-effort. The honest framing matters: each lever has a trade-off, and stacking them blindly can hurt. Read the table as a menu, not a checklist.
Lever
Saves cost?
Saves latency?
Trade-off
Exact / prompt caching
Yes, cached input is heavily discounted
Yes, skips reprocessing
Only helps on repeated prefixes; needs stable prompt ordering
Semantic caching
Yes, full call avoided on hit
Yes, instant on hit
Risk of serving a stale or subtly-wrong match; tune the threshold
Model routing (small first)
Yes, most traffic on cheap model
Yes, small models respond faster
Routing mistakes send hard queries to a weak model; needs a fallback
Prompt / context trimming
Yes, fewer input tokens
Yes, less to process
Trim too aggressively and you cut the context the answer needed
Batching requests
Yes, better throughput per dollar
No, adds queueing delay
Wrong for interactive UX; great for offline / bulk jobs
Smaller model outright
Yes, lower per-token price
Yes, faster generation
Quality drop on complex tasks; verify on your eval set
Streaming
No, same tokens billed
Perceived only, same total time
More complex client code; partial output can mislead if it errors mid-stream
Pick levers by what you're optimizing, cost, latency, or both, and budget for the trade-off.
Pro tip
Reach for the levers that buy **both** axes first, caching, routing, trimming, a smaller model. Streaming and batching are special-purpose: streaming fixes *perceived* latency for interactive apps; batching trades latency for cost in offline pipelines.
A minimal router and token-budget trimmer
Here is a stripped-down version of the funnel as code: a token-budget trimmer that keeps context under a hard cap, and a router that tries a small model first and escalates only when the answer looks weak. The point is the *shape*, swap in your own client, models, and validation.
router.py
python
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
deftoken_len(text: str) -> int:
returnlen(enc.encode(text))
deftrim_to_budget(chunks: list[str], budget: int) -> list[str]:
"""Keep the most relevant chunks first; drop the rest once the budget is spent."""
kept, used = [], 0for chunk in chunks: # chunks arrive pre-sorted by relevance
cost = token_len(chunk)
if used + cost > budget:
break# the cheapest token is the one you never send
kept.append(chunk)
used += cost
return kept
SMALL, LARGE = "small-fast-model", "large-strong-model"deflooks_weak(answer: str) -> bool:
"""Cheap heuristics to decide if we must escalate."""
flags = ("i'm not sure", "i cannot", "as an ai")
returnlen(answer) < 20orany(f in answer.lower() for f in flags)
defanswer(question: str, context_chunks: list[str], client) -> str:
context = "\n\n".join(trim_to_budget(context_chunks, budget=1500))
prompt = f"Context:\n{context}\n\nQuestion: {question}"# 1) try the small model first
draft = client.complete(model=SMALL, prompt=prompt, max_tokens=300)
# 2) escalate only if the cheap answer is weakiflooks_weak(draft):
return client.complete(model=LARGE, prompt=prompt, max_tokens=300)
return draft
Two details earn their keep. First, trim_to_budget enforces a hard input ceiling, without it, a few oversized retrieved chunks silently double your bill on every call. Second, looks_weak is deliberately crude; in production you'd validate against the task (did it return valid JSON? did it cite a source?) rather than sniffing for phrases. The architecture is identical to what you'd build for an AI agent: a cheap default path with a deliberate escalation rule.
Common mistakes that quietly drain the budget
Always using the biggest model. The flagship is the most expensive and the slowest. Most production traffic is easy and a small model handles it fine, you just never measured the split. Route first; reserve the big model for the requests that earn it.
No caching of any kind. Real traffic is repetitive: the same questions, the same system prompt, the same retrieved chunks. Without exact, prompt, or semantic caching you re-pay for identical work all day. Caching is usually the single highest-leverage change you can ship.
Ignoring output tokens. Teams obsess over shrinking the prompt and forget that output tokens cost more *and* dominate latency. A max_tokens cap and a prompt that says "answer in two sentences" often saves more than any input trimming.
Letting context grow unbounded. Stuffing the whole knowledge base "just in case" inflates every single call. Retrieve and trim to a token budget; relevance beats volume.
Optimizing latency you can't feel. Chasing total generation time when the real fix is *perceived* latency. Stream the response and the same model feels twice as fast.
Tuning blind. Shipping levers without per-request cost and p50/p95 latency dashboards. If you can't see the bill and the tail, you can't tell whether a change helped or hurt.
Takeaways
The whole article in seven lines
The cheapest, fastest token is the one you never send, that single idea drives every lever.
You pay in dollars and milliseconds roughly in proportion to input and output tokens.
Run requests through a funnel: semantic cache, then router, then small model, then large model only on escalation.
Caching, routing, trimming, and a smaller model save **both** cost and latency, reach for these first.
Batching trades latency for cost (offline only); streaming trades nothing but transforms *perceived* latency.
Output tokens cost more and dominate latency, cap them and ask for shorter answers.
You cannot optimize what you cannot see: instrument per-request cost and p50/p95 before tuning.
Where to go next
Cost and latency optimization is the final discipline that turns an LLM demo into a shippable product. The strongest next step is to wire a real dashboard, per-request cost, cache hit rate, and the small-vs-large routing split, so every lever you add is measured, not guessed.
Solidify the fundamentals these levers rest on with How LLMs Actually Work, why tokens cost what they cost.
See the routing-and-escalation pattern in a fuller system in Building AI Agents.
Put it all together along the AI Engineer career path, where cost and latency sit alongside evaluation and reliability.
Want to go deeper?
This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.