LLM Cost & Latency Optimization

On this page

Your demo worked. The bill and the p95 did not.
The principle: don't send the token in the first place
The picture: a request's journey to the cheapest answer
The levers, and what each one actually buys you
A minimal router and token-budget trimmer
Common mistakes that quietly drain the budget
Takeaways
Where to go next

Your demo worked. The bill and the p95 did not.

The demo was magic. You wired up a model, pasted in a fat system prompt, stuffed the whole knowledge base into the context, and it answered beautifully. Then you shipped it to real traffic. Two things broke at once: the monthly bill climbed past what the feature is worth, and the p95 latency, the slow-tail experience your impatient users actually feel, crept toward six, eight, ten seconds. Suddenly the magic feels like a liability.

Here is the good news: almost none of this requires a smarter model. It requires *engineering*. The same disciplines you already apply to databases and APIs, caching, routing, trimming, batching, map cleanly onto LLM calls. This article is the playbook: the levers that make an LLM app cheap and fast enough to keep in production, in the order you should reach for them.

Who this is for

Engineers who have an LLM feature *working* and now need it to be **affordable and responsive** at scale. You should be comfortable reading Python and thinking about percentiles and per-request cost. No ML background required, this is systems work, not model training. New to how these models tick? Start with [How LLMs Actually Work](/blog/how-llms-actually-work).

The principle: don't send the token in the first place

The cheapest, fastest token is the one you never send.
The whole article in one line

Every optimization here is a variation on that idea. You pay, in dollars *and* in milliseconds, roughly in proportion to the tokens that flow in and out of the model. Input tokens cost money to send and add to the time-to-first-token. Output tokens cost more per unit *and* are generated one at a time, so they dominate latency. So the entire game is: send fewer tokens, generate fewer tokens, and avoid the call entirely when you can.

Crucially, optimizing cost and optimizing latency are not always the same move. Some levers buy you both (a cache hit is free *and* instant). Some buy one at the expense of the other (batching is cheaper but slower per request). Knowing which is which is what lets you tune deliberately instead of flailing.

A pricey consultant who bills by the wordThe LLM, every input and output token costs money and time

Checking your notes before booking the meetingA cache lookup before calling the model at all

Asking the junior analyst the easy questions firstRouting simple requests to a small, cheap model

Sending a one-page brief instead of the whole filing cabinetTrimming the prompt and context to what's relevant

The consultant talking while they think, not afterStreaming tokens so the user sees output immediately

Treat the model like an expensive specialist consultant, not a search box you spam.

The picture: a request's journey to the cheapest answer

Before reaching for any single model, a well-built LLM service runs each request through a funnel. Every stage is a chance to answer *cheaper* or *not at all*. Only the requests that survive every filter reach the expensive large model.

A request flows through a semantic cache and a router before ever touching a model; streaming carries the answer back token-by-token.

1
Check the semantic cache
Embed the incoming request and look for a near-identical question answered before. On a hit, return the stored answer instantly, zero model tokens, single-digit-millisecond latency.
2
Route on a cache miss
A lightweight classifier (rules, a tiny model, or a heuristic) decides whether this is an easy request or a hard one. Most production traffic is easy.
3
Try the small model first
Send easy requests to a small, cheap, fast model. It handles the long tail of simple lookups, classifications, and short rewrites for a fraction of the cost.
4
Escalate only when needed
Hard requests, or small-model answers that fail a confidence/validation check, escalate to the large model. You pay the premium only for the requests that truly need it.
5
Stream the answer back
Whichever model answers, stream tokens to the client as they generate. Total time is unchanged, but perceived latency collapses because the user sees words immediately.
6
Write back to the cache
Store the answer keyed by the request embedding so the next similar question skips the whole funnel.

The levers, and what each one actually buys you

These are the six moves worth knowing, ranked roughly by impact-per-effort. The honest framing matters: each lever has a trade-off, and stacking them blindly can hurt. Read the table as a menu, not a checklist.

Lever	Saves cost?	Saves latency?	Trade-off
Exact / prompt caching	Yes, cached input is heavily discounted	Yes, skips reprocessing	Only helps on repeated prefixes; needs stable prompt ordering
Semantic caching	Yes, full call avoided on hit	Yes, instant on hit	Risk of serving a stale or subtly-wrong match; tune the threshold
Model routing (small first)	Yes, most traffic on cheap model	Yes, small models respond faster	Routing mistakes send hard queries to a weak model; needs a fallback
Prompt / context trimming	Yes, fewer input tokens	Yes, less to process	Trim too aggressively and you cut the context the answer needed
Batching requests	Yes, better throughput per dollar	No, adds queueing delay	Wrong for interactive UX; great for offline / bulk jobs
Smaller model outright	Yes, lower per-token price	Yes, faster generation	Quality drop on complex tasks; verify on your eval set
Streaming	No, same tokens billed	Perceived only, same total time	More complex client code; partial output can mislead if it errors mid-stream

Pick levers by what you're optimizing, cost, latency, or both, and budget for the trade-off.

Pro tip

Reach for the levers that buy **both** axes first, caching, routing, trimming, a smaller model. Streaming and batching are special-purpose: streaming fixes *perceived* latency for interactive apps; batching trades latency for cost in offline pipelines.

A minimal router and token-budget trimmer

Here is a stripped-down version of the funnel as code: a token-budget trimmer that keeps context under a hard cap, and a router that tries a small model first and escalates only when the answer looks weak. The point is the *shape*, swap in your own client, models, and validation.

router.py

python

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

def token_len(text: str) -> int:
    return len(enc.encode(text))

def trim_to_budget(chunks: list[str], budget: int) -> list[str]:
    """Keep the most relevant chunks first; drop the rest once the budget is spent."""
    kept, used = [], 0
    for chunk in chunks:  # chunks arrive pre-sorted by relevance
        cost = token_len(chunk)
        if used + cost > budget:
            break  # the cheapest token is the one you never send
        kept.append(chunk)
        used += cost
    return kept

SMALL, LARGE = "small-fast-model", "large-strong-model"

def looks_weak(answer: str) -> bool:
    """Cheap heuristics to decide if we must escalate."""
    flags = ("i'm not sure", "i cannot", "as an ai")
    return len(answer) < 20 or any(f in answer.lower() for f in flags)

def answer(question: str, context_chunks: list[str], client) -> str:
    context = "\n\n".join(trim_to_budget(context_chunks, budget=1500))
    prompt = f"Context:\n{context}\n\nQuestion: {question}"

    # 1) try the small model first
    draft = client.complete(model=SMALL, prompt=prompt, max_tokens=300)

    # 2) escalate only if the cheap answer is weak
    if looks_weak(draft):
        return client.complete(model=LARGE, prompt=prompt, max_tokens=300)
    return draft

Two details earn their keep. First, trim_to_budget enforces a hard input ceiling, without it, a few oversized retrieved chunks silently double your bill on every call. Second, looks_weak is deliberately crude; in production you'd validate against the task (did it return valid JSON? did it cite a source?) rather than sniffing for phrases. The architecture is identical to what you'd build for an AI agent: a cheap default path with a deliberate escalation rule.

Common mistakes that quietly drain the budget

Always using the biggest model. The flagship is the most expensive and the slowest. Most production traffic is easy and a small model handles it fine, you just never measured the split. Route first; reserve the big model for the requests that earn it.
No caching of any kind. Real traffic is repetitive: the same questions, the same system prompt, the same retrieved chunks. Without exact, prompt, or semantic caching you re-pay for identical work all day. Caching is usually the single highest-leverage change you can ship.
Ignoring output tokens. Teams obsess over shrinking the prompt and forget that output tokens cost more *and* dominate latency. A max_tokens cap and a prompt that says "answer in two sentences" often saves more than any input trimming.
Letting context grow unbounded. Stuffing the whole knowledge base "just in case" inflates every single call. Retrieve and trim to a token budget; relevance beats volume.
Optimizing latency you can't feel. Chasing total generation time when the real fix is *perceived* latency. Stream the response and the same model feels twice as fast.
Tuning blind. Shipping levers without per-request cost and p50/p95 latency dashboards. If you can't see the bill and the tail, you can't tell whether a change helped or hurt.

Takeaways

The whole article in seven lines

The cheapest, fastest token is the one you never send, that single idea drives every lever.
You pay in dollars and milliseconds roughly in proportion to input and output tokens.
Run requests through a funnel: semantic cache, then router, then small model, then large model only on escalation.
Caching, routing, trimming, and a smaller model save **both** cost and latency, reach for these first.
Batching trades latency for cost (offline only); streaming trades nothing but transforms *perceived* latency.
Output tokens cost more and dominate latency, cap them and ask for shorter answers.
You cannot optimize what you cannot see: instrument per-request cost and p50/p95 before tuning.

Where to go next

Cost and latency optimization is the final discipline that turns an LLM demo into a shippable product. The strongest next step is to wire a real dashboard, per-request cost, cache hit rate, and the small-vs-large routing split, so every lever you add is measured, not guessed.

Solidify the fundamentals these levers rest on with How LLMs Actually Work, why tokens cost what they cost.
See the routing-and-escalation pattern in a fuller system in Building AI Agents.
Put it all together along the AI Engineer career path, where cost and latency sit alongside evaluation and reliability.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

AI Engineering

RAG Architecture Explained for Backend Engineers

Read

AI Engineering

What Is an AI Engineer?

Read

AI Engineering

How LLMs Actually Work (for Engineers)

Read