A practical mental model for shipping with LLMs: tokens, the context window, next-token prediction, temperature and top-p sampling, and what inference actually costs.
You don't need the math, you need a model that predicts behavior
You wire up an LLM API, send a prompt, and get back something brilliant. Then you change one word and it falls apart. Or it invents a function that doesn't exist. Or your bill triples overnight and you have no idea why. The model feels like a slot machine, and you're the one holding the lever.
Here's the thing: you don't need transformer math or attention diagrams to stop being surprised. You need a working mental model, one that lets you reason about why the model behaves the way it does, what it costs, and where its limits are. That's what this article gives you. No gradients, no softmax derivations. Just the moving parts an engineer touches.
Who this is for
Engineers who are calling, or about to call, an LLM API and want to predict its behavior instead of being surprised by it. If you can read a JSON request body, you're ready. No ML background required.
The one-sentence definition
A large language model is a function that, given a sequence of tokens, predicts the next token.
That's it. Everything else, chat, code generation, summarization, agents, is that single trick repeated. The model looks at everything so far and produces a probability distribution over what comes next. It picks one token, appends it, and looks again. Repeat until done. The "intelligence" is an emergent property of doing this billions of times over a vast amount of text.
Phone keyboard suggesting the next wordNext-token prediction
It learned from everything you've typedTrained on a huge text corpus
Tapping a suggestion, then seeing new suggestionsAutoregressive generation (one token feeds the next)
A keyboard that read the whole internet and never gets tiredA large language model
If you've ever let your phone keyboard finish a sentence, you already understand the core loop.
Autocomplete on steroids is a genuinely good first model. The differences are scale and memory: instead of guessing one word from the last three, an LLM weighs thousands of prior tokens to guess the next one, and it does so with a sense of grammar, facts, and reasoning patterns absorbed during training.
The inference pipeline, end to end
When you send a prompt, it travels through a fixed sequence of stages before you get text back. Knowing these stages tells you exactly where cost is incurred, where text gets cut off, and which knobs change behavior.
One request through an LLM: your text becomes tokens, the model scores every possible next token, the sampler picks one, and the loop repeats until it stops.
1
Prompt
You send raw text, a system prompt, the conversation history, and the user's latest message all concatenated together.
2
Tokenizer
The text is split into tokens (sub-word chunks) and mapped to integer IDs. "tokenizer" might become two tokens; " the" is usually one. This is the unit everything is counted and billed in.
3
Model forward pass
The token IDs run through the network once. The output isn't a word, it's a score for every token in the vocabulary (tens of thousands of them).
4
Probabilities
Those scores are turned into a probability distribution: token X has a 40% chance of being next, token Y 12%, and so on.
5
Sampler
This is where temperature and top-p live. The sampler decides how to pick from the distribution, always the top choice, or a weighted roll of the dice.
6
Detokenize
The chosen token ID is mapped back to text and streamed to you, which is why responses appear word-by-word.
7
Append & repeat
The new token is appended to the input and the whole loop runs again. Generating 500 tokens means 500 forward passes. That's why output is slower and pricier than input.
The three knobs you actually control
You can't change the model's weights, but you can change how the sampler reads its output and when generation stops. These three parameters cover ~90% of what you'll tune in practice.
Control
What it does
Turn it up when…
Turn it down when…
temperature
Flattens or sharpens the probability distribution before sampling. Low = always picks the likeliest token (predictable); high = gives unlikely tokens a real chance (creative, riskier).
You want brainstorming, varied phrasing, or creative writing.
You need deterministic, factual, or structured output like JSON or code.
top_p
Nucleus sampling. Only considers the smallest set of tokens whose probabilities add up to p, then samples from those. top_p = 0.1 keeps only the very top candidates.
You want some variety but with a safety rail against truly weird tokens.
You want tight, focused output. Pair a low top_p with a low temperature.
max_tokens
A hard cap on how many tokens the model may generate. Hit the cap and generation stops mid-sentence, the model is not 'done', it's cut off.
You expect long output (an essay, a full file) and need headroom.
You want short answers and predictable cost, but never set it so low it truncates valid output.
The sampling controls every LLM API exposes, and what each one actually does.
Don't fight temperature and top_p at once
They both narrow randomness. Tuning both together is hard to reason about. Pick one as your primary dial, most teams hold top_p at a default and move temperature, and adjust the other only if you must.
Here's a request showing all three in context. The shape is the same across providers; only the field names and endpoint differ.
generate.py
python
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=512, # hard cap on the OUTPUT, stops here even mid-sentence
temperature=0.2, # low = focused & repeatable; raise toward 1.0 for variety
top_p=0.95, # nucleus sampling; leave near default unless you have a reason
system="You are a terse senior engineer. Answer in <=3 sentences.",
messages=[
{"role": "user", "content": "Explain what a token is, like I ship code."},
],
)
print(response.content[0].text)
# Tokens consumed are reported on response.usage, # input_tokens (your prompt) + output_tokens (what it generated).print(response.usage)
Verify model IDs and pricing before you ship
Model names, context limits, and per-token prices change often. Always confirm the current values from the provider's official docs rather than hard-coding what you read in a blog (including this one).
Why context windows and tokens matter
The context window is the maximum number of tokens the model can see at once, your system prompt, the full conversation history, retrieved documents, and the answer it's generating all share one budget. Everything important about cost and reliability flows from this single constraint.
Cost is per token, both directions. You pay for input tokens (everything you send) and output tokens (everything generated), and output is usually several times more expensive. A chat app that resends the whole history on every turn pays for that history again every single message.
Truncation is silent and brutal. When the conversation plus the requested output exceeds the window, the oldest tokens fall off the front. The model literally forgets the start of the chat, and it won't tell you it did. Long agent loops are where this bites hardest.
A full window can mean a cut-off answer. If input fills most of the window, there's little room left for output. The model hits the ceiling and stops mid-thought, looks like a bug, is actually arithmetic.
Context pressure feeds hallucination. When the model can't see the relevant fact (because it was truncated, or never fit), it does what it always does: predicts a *plausible* next token. Plausible-but-wrong is exactly what a hallucination is. More context is not always better, irrelevant filler dilutes the signal too.
Tokens are not words. A rough rule of thumb for English is ~4 characters per token, or ~0.75 words per token. Code, JSON, and non-English text tokenize less efficiently. Estimate before you assume something fits.
The practical upshot: treat the context window as a budget you actively manage, trim history, summarize old turns, retrieve only what's relevant, and measure token usage from the response rather than guessing.
Common misconceptions that trip up new builders
"It looks things up." It doesn't, at least not by default. A base model has no live knowledge and no database; it predicts from patterns frozen at training time. Current facts come from tools, search, or documents *you* put in the context.
"Same prompt, same answer." Only at temperature 0 (and even then, not always guaranteed across infrastructure). With any randomness in the sampler, identical prompts can produce different outputs by design.
"It understands like a person." It models statistical relationships in language extraordinarily well. That produces useful, often reasoning-like behavior, but it has no goals, beliefs, or awareness. Don't anthropomorphize your debugging.
"Bigger context window = better answers." A bigger window lets you fit more in; it doesn't make the model focus. Stuffing irrelevant text in raises cost and can *hurt* accuracy. Relevance beats volume.
"It generates the whole answer at once." It generates one token at a time, each conditioned on the last. That's why it streams, why output cost scales with length, and why early tokens shape the rest of the response.
Takeaways
The whole article in seven lines
An LLM predicts the next token, one token at a time, in a loop.
Text becomes tokens; tokens are the unit you count, bill, and reason in.
The pipeline is: prompt → tokenizer → model → probabilities → sampler → detokenize → output, repeated.
Temperature and top_p control randomness; max_tokens caps output and can truncate it.
The context window is one shared budget for prompt, history, and answer.
Cost is per token in both directions, and output tokens cost more.
Hallucinations are confident next-token guesses when the right context isn't there.
Where to go next
Now that you can predict *why* the model behaves the way it does, the next steps are shaping its input, calling it cleanly, and keeping the bill sane. These build directly on this mental model.
Follow the full AI Engineer career path to go from these foundations to shipping real LLM systems.
Want to go deeper?
This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.