Serving Local & Open-Weight Models

On this page

The bill (and the rule) that forced the issue
The principle: own the weights when control or cost demands it
The picture: from weights on disk to a response
Hosted API vs self-hosted open-weight
Stand up a local endpoint
Quantization, GPU sizing, and throughput
Common mistakes that cost hours (and GPUs)
Takeaways
Where to go next

The bill (and the rule) that forced the issue

Two things tend to push a team off a hosted LLM API. The first is the invoice. A chatbot that summarizes support tickets looks cheap at $0.50 per million tokens, until it's classifying every message in a high-volume queue and the monthly bill clears five figures. The second is a sentence in a contract: "Customer data must not leave the EU" or "No third-party processing of PHI." Suddenly the cheapest, fastest model in the world is off the table because of *where* it runs, not how well it works.

When either of those hits, you start eyeing the open-weight models, Llama, Qwen, DeepSeek, Mistral, that you can download and run on your own GPUs. This article is the practical path from "I have an API key" to "I have an inference endpoint I own," without the parts that waste a weekend.

Who this is for

Engineers shipping an LLM feature who are staring at an API bill, a data-residency requirement, or a latency floor, and wondering whether self-hosting an open-weight model is worth it. You know what a token and a GPU are; you have not necessarily run `vllm serve` before.

The principle: own the weights when control or cost demands it

Use a hosted API by default. Own the weights the moment control or cost, not novelty, demands it.

Self-hosting is not a flex; it's a trade. You give up the magic of "someone else keeps the GPUs warm and the model patched" in exchange for control over cost, data path, and latency. That trade only pays off past a certain volume, or when a rule makes the hosted option illegal. Below that line, an API is almost always the right call, you're renting capability you'd otherwise babysit.

Renting a car for an airport tripCalling a hosted API, pay per use, zero ownership

Buying a delivery van because you drive all daySelf-hosting once volume makes per-call pricing hurt

A company car that legally can't leave the countryOn-prem weights for data-residency / air-gapped rules

Owning the van means you also fix the vanYou now own GPUs, patching, scaling, and on-call

Self-hosting an LLM is a make-vs-buy call you already understand.

The picture: from weights on disk to a response

A self-hosted endpoint is four moving parts in a line. You download an open-weight checkpoint, optionally quantize it so it fits your GPU, load it into a serving engine that handles batching and the API surface, and point your app at the resulting OpenAI-compatible endpoint.

Open-weight checkpoint → quantize → serving engine on a GPU → your app talks to an OpenAI-compatible endpoint.

1
Pick the model
Choose an open-weight checkpoint sized to your task and your GPU, e.g. an 8B for routing/classification, a 70B (quantized) for harder reasoning.
2
Quantize if needed
Drop fp16 weights to a smaller format (AWQ/GGUF/fp8) so the model fits in VRAM with room for the KV cache.
3
Load into a serving engine
vLLM, Ollama, or TGI loads the weights onto the GPU and exposes an HTTP endpoint with batching built in.
4
Swap the base URL
Point your existing OpenAI SDK at the local endpoint. The request/response shape is identical.
5
Measure
Watch tokens/sec, p95 latency, and GPU utilization under real load, and run an eval set before you trust the output.

Hosted API vs self-hosted open-weight

The decision rarely comes down to a single number. Lay the two options side by side across the dimensions that actually bite in production:

Dimension	Hosted API	Self-hosted open-weight
Cost shape	Per-token; cheap at low volume, brutal at scale	Fixed GPU spend; cheap per-token once utilized
Latency	Network + provider queue; good but not yours to tune	You own it, batching and locality cut p95
Privacy / residency	Data leaves your boundary; trust the contract	Data never leaves your VPC / region / rack
Model choice	Their menu, their deprecations	Any open checkpoint, pinned forever
Ops burden	Near zero, they patch and scale	Yours: GPUs, scaling, upgrades, on-call
Time to first call	Minutes	A day to stand up, longer to harden

The honest trade-off across cost, latency, privacy, and operational burden.

The rough crossover

A single A10/L4 class GPU runs roughly $0.50–$1.00/hr. If your steady token volume would cost more than one or two well-utilized GPUs on a hosted API, self-hosting starts to win on cost, and it wins on day one if residency is a hard rule.

Stand up a local endpoint

Two engines cover most of the spectrum. Ollama is the fastest way to a working endpoint on a laptop or a single box, great for prototyping and small CPU/GPU setups. vLLM is the production workhorse: high-throughput batching, tensor parallelism, and an OpenAI-compatible server. Both expose the same API shape, so your client code doesn't change.

ollama-quickstart.sh

bash

# Ollama: easiest path to a local endpoint (laptop / single box)
ollama pull llama3.1:8b          # downloads a quantized GGUF by default
ollama run llama3.1:8b           # interactive sanity check

# Ollama also serves an OpenAI-compatible API on :11434
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "One-line summary of vLLM."}]
  }'

vllm-serve.sh

bash

# vLLM: production throughput on a GPU box
pip install vllm

# Serve a quantized 8B; vLLM exposes an OpenAI-compatible server on :8000
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
  --quantization awq \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --port 8000

Now the payoff: your application code is the *same* OpenAI client you already use, you only change the base_url. That's the whole point of an OpenAI-compatible surface. No rewrite, no new SDK.

call_local.py

python

from openai import OpenAI

# Point the same SDK at your local engine instead of the hosted API
client = OpenAI(
    base_url="http://localhost:8000/v1",  # or :11434/v1 for Ollama
    api_key="not-needed-locally",          # any non-empty string
)

resp = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Classify this ticket: 'card declined twice'"}],
    temperature=0,
)
print(resp.choices[0].message.content)

Quantization, GPU sizing, and throughput

Quantization: fit the model on the GPU

Models ship in fp16 by default, two bytes per parameter. An 8B model is ~16 GB before you've loaded a single token of context; a 70B is ~140 GB and won't touch a single consumer card. Quantization stores weights in fewer bits (often 4), cutting memory roughly 3–4x with a small, usually acceptable quality hit. Rules of thumb: GGUF is the format Ollama and llama.cpp use (great for CPU and mixed setups); AWQ and GPTQ are 4-bit formats vLLM serves efficiently on GPU; fp8 is the modern middle ground on newer NVIDIA cards, near-fp16 quality at half the memory.

A practical sizing estimate: VRAM needed ≈ (params × bytes-per-param) + KV cache. The KV cache is the often-forgotten part, it grows with batch size and context length, and it's why a model that "fits" on paper OOMs under real concurrency. Leave headroom; vLLM's --gpu-memory-utilization exists precisely to reserve it.

Precision	Bytes/param	~8B weights	Fits on
fp16	2	~16 GB	A10 (24GB) tight, L40/A100
fp8	1	~8 GB	L4 (24GB), A10 comfortably
4-bit (AWQ/GGUF)	0.5	~4–5 GB	Most GPUs, even a laptop

Approximate weight memory for an 8B model by precision (KV cache is extra).

Batching: where throughput actually comes from

A GPU running one request at a time is mostly idle, generation is memory-bound, and you're paying for silicon that's waiting. Continuous batching (vLLM's core trick) packs many in-flight requests through the GPU together, slotting new requests into the batch as old ones finish instead of waiting for a fixed batch to complete. The result is dramatically higher tokens/second at a given GPU, often 5–20x over naive one-at-a-time serving. There's a tension: bigger batches raise throughput but can nudge individual-request latency up, so you tune batch limits against your p95 target. The headline: throughput is a serving-engine property, not a model property. The same weights serve a handful of req/s or hundreds depending on whether batching is on.

Common mistakes that cost hours (and GPUs)

Under-sizing the GPU. You size for the weights and forget the KV cache, then OOM the moment two users hit it at once. Size for weights plus cache at your real batch size and context length.
No batching. Running requests one at a time leaves 80%+ of the GPU idle and makes self-hosting look expensive. Use an engine with continuous batching (vLLM) before you conclude the economics don't work.
Serving fp16 when quantized would do. Paying for a bigger GPU to hold fp16 weights when a 4-bit AWQ build fits a cheaper card at near-identical quality. Quantize first, measure quality, upgrade hardware only if the eval demands it.
Ignoring eval. Swapping a hosted frontier model for a local 8B without an eval set is how a silent quality regression ships to users. Build a small task-specific eval *before* the migration and gate on it.
Treating it as fire-and-forget. A self-hosted endpoint is infrastructure: it needs metrics (tokens/s, p95, GPU util), capacity planning, and on-call. Budget for the ops, not just the GPU.

Takeaways

The whole article in seven lines

Default to a hosted API; self-host when cost at scale or a data-residency rule demands it.
The pipeline is always: open weights → quantize → serving engine on a GPU → OpenAI-compatible endpoint → your app.
Ollama is the fastest path to a local endpoint; vLLM is the production throughput engine.
Your client code barely changes, swap the `base_url`, keep the OpenAI SDK.
Quantization (GGUF/AWQ/fp8) shrinks weights 3–4x so models fit cheaper GPUs with acceptable quality.
Throughput comes from continuous batching, not the model, and KV cache, not just weights, drives VRAM.
Size the GPU for weights + cache, keep batching on, quantize before upgrading hardware, and always eval before you cut over.

Where to go next

Self-hosting is one lever in the broader cost-and-control story. Pair it with the optimization techniques that apply whether you host or rent, and with the infrastructure layer underneath the GPU.

LLM Cost & Latency Optimization, caching, routing, and prompt economics that cut spend before you ever touch a GPU.
AI/GPU Infrastructure on the Cloud, provisioning, autoscaling, and spot GPUs to run the endpoint you just designed.
Follow the full AI Engineer career path to see where serving fits among retrieval, evals, and agents.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

AI Engineering

RAG Architecture Explained for Backend Engineers

Read

AI Engineering

What Is an AI Engineer?

Read

AI Engineering

How LLMs Actually Work (for Engineers)

Read