The AI team needs 8 GPUs by Friday. Here is what every cloud engineer now has to know: instance families, quotas, spot vs reserved, inference serving, model storage, training networking, and how to keep the bill sane.
It is Tuesday. The ML lead drops a message: "We need 8 A100s by Friday to fine-tune the new model, and an endpoint to serve it after." You open the cloud console, pick a GPU instance, hit launch, and get InsufficientInstanceCapacity. You request a quota increase and it is "under review." The one region that has capacity costs 40% more. And nobody has said whether these 8 GPUs run for three days or three months.
GPU infrastructure breaks the instincts you built on CPU clouds. The hardware is scarce, the on-demand price is brutal, idle time is pure waste, and "just autoscale it" does not work the way it does for web servers. This article is the field guide: how to get the GPUs, how to serve a model on them, how to wire up a training cluster, and how to stop the bill from quietly tripling.
Who this is for
Cloud and platform engineers who suddenly own AI workloads, provisioning GPUs, standing up inference endpoints, or supporting a training cluster. You know EC2/VMs, autoscaling, and containers. You do **not** need to know how transformers work. We cover the infra, not the math.
One sentence, then a picture
GPU infrastructure is just compute where the accelerator is the scarce, expensive resource, so every design decision becomes a question of keeping that accelerator busy.
A rentable industrial 3D printer charged by the minuteA GPU instance billed per second, whether or not it prints
Reserving a printer for a year vs walking in off the streetReserved / committed capacity vs on-demand pricing
Using leftover off-peak printer time at a steep discountSpot / preemptible GPU instances
A finished blueprint file you load before any printThe model artifact (weights) loaded onto the GPU before serving
A print queue that batches small jobs into one runInference request batching to maximize GPU utilization
GPU infra maps cleanly onto things you already run.
The shape of a GPU platform
There are two distinct workloads and they want opposite things. Inference serves a trained model to users: it wants low latency, autoscaling, and high uptime. Training builds the model: it wants raw throughput, fast inter-node networking, and is happy to be interrupted if it checkpoints. The diagram below shows both, the inference path on the main row, the training cluster branching off.
Model artifact and GPU nodes feed an inference service behind an autoscaler; a training cluster branch produces new artifacts.
1
A request arrives
A client calls the inference API. The request lands on the inference service running in a container on a GPU node.
2
Weights are already resident
On startup, each replica pulled the model artifact from object storage and loaded it into GPU memory once. Requests reuse that loaded model, you never reload per request.
3
Requests get batched
The server groups concurrent requests into a batch and runs them through the GPU in one pass. One GPU pass serving 16 requests is dramatically cheaper per request than 16 passes.
4
The autoscaler watches pressure
It tracks GPU utilization or queue depth, not CPU. When the queue backs up it adds a replica (and a GPU node); when traffic falls it removes them so you stop paying for idle accelerators.
5
Training runs on its own branch
Separately, the training cluster streams the dataset from storage, trains across multiple GPUs over a fast network fabric, and writes a new model artifact back to storage, which the inference service can then roll out.
GPU instance types and accelerators
Cloud GPUs come in families tuned for different jobs. The dimensions that matter: the accelerator model (the GPU chip itself), GPU memory (VRAM, this caps the model size you can load), how many GPUs are attached per instance, and whether those GPUs are linked by a high-speed interconnect for multi-GPU work.
Training-class GPUs (NVIDIA A100, H100; equivalently AWS p4/p5, Azure ND, GCP A3): huge VRAM (40–80GB+), connected by NVLink/NVSwitch inside the box and high-bandwidth networking between boxes. Expensive, scarce, and overkill for most inference.
Inference / general GPUs (NVIDIA L4, A10G, T4; AWS g5/g6, Azure NV, GCP G2): less VRAM, far cheaper, great latency-per-dollar for serving small and mid-size models. This is where most production endpoints belong.
Custom accelerators (AWS Trainium/Inferentia, Google TPUs): can be much cheaper per unit of work, but need a compiled/adapted model and lock you into that vendor's toolchain. Worth it at scale, friction at the start.
Fractional / MIG: an A100/H100 can be sliced (Multi-Instance GPU) so several small models share one physical card. Excellent utilization for many tiny endpoints.
Size by VRAM first
The model's weights plus its activation memory must fit in GPU memory, or it simply will not load. Estimate VRAM before you pick a family, a model needing 60GB rules out a 24GB card no matter how cheap it is.
Choosing: families and trade-offs
Two decisions dominate cost and reliability: which instance family, and which pricing model. The table puts the trade-offs side by side. Read column 0 as the question you are answering.
Decision
Pick this when…
Trade-off / watch-out
Inference-class GPU (L4/T4/A10G)
Serving a model in production with latency SLAs and variable traffic
Limited VRAM caps model size; not for large-scale training
Training-class GPU (A100/H100)
Fine-tuning or training; large models that need big VRAM + fast interconnect
Scarce and very expensive; wasteful if left idle for serving
On-demand pricing
Production endpoints, anything that must not be interrupted
Highest per-hour cost; pay full rate even at low utilization
Spot / preemptible
Training, batch, and fault-tolerant jobs that checkpoint
Can be reclaimed with little warning, never for stateful serving without a fallback
Reserved / committed (1–3 yr)
Steady, predictable baseline GPU usage you will run for months
Large up-front commitment; you pay even if usage drops
Pick the cheapest option that meets the workload's real constraint, not the biggest GPU available.
Stand up a GPU inference deployment
Here is a containerized inference deployment on Kubernetes that requests a GPU, loads the model from object storage on startup, and is targeted by an autoscaler. The nvidia.com/gpu resource request is the key line, it tells the scheduler to place this pod only on a GPU node and reserve a whole GPU for it.
inference-deployment.yaml
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-inference
labels: { app: model-inference }
spec:
replicas: 2selector:
matchLabels: { app: model-inference }
template:
metadata:
labels: { app: model-inference }
spec:
# Only schedule onto GPU nodes (label your GPU node pool to match).nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4
containers:
- name: server
image: registry.example.com/model-server:1.4.0ports:
- containerPort: 8080env:
# Weights are pulled from object storage once, at startup.
- name: MODEL_URI
value: "s3://models/intent-classifier/v7/"
- name: MAX_BATCH_SIZE
value: "16"resources:
limits:
nvidia.com/gpu: "1"# reserve one whole GPU per replicamemory: "16Gi"cpu: "4"readinessProbe:
httpGet: { path: /healthz, port: 8080 }
initialDelaySeconds: 40# weights take time to load into VRAM
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: model-inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: model-inference
minReplicas: 1maxReplicas: 8metrics:
# Scale on a GPU-aware custom metric (utilization or queue depth),# NOT CPU, a busy GPU often shows near-idle CPU.
- type: Pods
pods:
metric: { name: gpu_duty_cycle }
target: { type: AverageValue, averageValue: "70" }
Scale on the GPU, not the CPU
The most common autoscaling mistake: targeting CPU. A GPU pegged at 100% can show 10% CPU, so a CPU-based HPA never scales up and your endpoint falls over under load. Always scale on a GPU metric, utilization (duty cycle) or request-queue depth.
Behind this you need a GPU node pool: a node group whose machine type has the accelerator attached and the GPU device drivers installed (on managed Kubernetes, install the vendor device plugin / use the GPU node image). Configure the pool to scale to zero when no GPU pods are pending, that is how you stop paying for idle accelerators overnight.
Storage and networking
Storage for models and datasets
Treat the model artifact as an immutable, versioned object: store weights in object storage (S3/GCS/Blob) under a versioned path like models/<name>/v7/, and have each replica download it once at boot. Never bake giant weights into the container image, it bloats pulls and couples model rollout to image rollout. For datasets, large training corpora live in object storage too, streamed and sharded across workers; when an epoch must re-read data at high throughput, mount a fast parallel/file system or a local NVMe cache so the GPUs are not starved waiting on I/O.
Networking for distributed training
Single-GPU jobs do not care about the network. Multi-node training does, intensely. When dozens of GPUs synchronize gradients every step, the inter-node link becomes the bottleneck. This is why training instances offer high-bandwidth, low-latency fabrics (AWS EFA, InfiniBand, GCP equivalents) and why you place the cluster in a single zone with a placement group / compact placement so nodes are physically close. Spread those same nodes across zones and your expensive GPUs sit idle waiting on the network.
Co-locate training nodes
For multi-node training, request a cluster placement group and an interconnect-capable instance family. The GPUs are only as fast as their slowest synchronization, cross-zone chatter quietly halves throughput.
Cost and quota control
GPU cost discipline is its own skill. The on-demand price of a single high-end GPU instance can run thousands of dollars a month, and the failure mode is not a crash, it is a GPU sitting idle at full price. Two levers: capacity/quotas (can you even get the hardware) and spend (are you paying only for work done).
Request quota early. Cloud accounts start with a GPU quota of zero or near-zero. Increases are reviewed by humans and can take days, file the request the moment a GPU project is on the roadmap, per region and per family.
Match pricing to interruptibility. On-demand for serving; spot/preemptible for training and batch (it can be 60–90% cheaper); reserved/committed for a known steady baseline. Mixing all three is normal.
Scale to zero. GPU node pools and inference replicas should drop to zero when idle. A dev endpoint left running over a weekend is a five-figure surprise.
Right-size the GPU. A 7B model served on an H100 wastes most of the card. Move it to an L4 or share a card via MIG.
Make spend visible. Tag every GPU resource by team/project and alert on GPU-hour spend, not just dollars, utilization is the metric that catches waste a cost dashboard misses. See Cloud Cost & FinOps: Engineering for Spend.
Spot is for training, not serving
Spot/preemptible GPUs can vanish on minutes of notice. That is fine for a training job that checkpoints and resumes. It is not fine for a user-facing endpoint, unless you keep an on-demand fallback pool. Never put a stateful or latency-critical service on spot alone.
Common mistakes that cost hours (and thousands)
Idle GPUs at full price. A GPU node pool or endpoint with no scale-to-zero runs nights and weekends doing nothing. This is the single biggest GPU waste, check utilization, not just uptime.
No autoscaling, or autoscaling on CPU. A fixed replica count over-provisions off-peak and falls over at peak; a CPU-based HPA never reacts because a busy GPU looks CPU-idle. Scale on a GPU metric.
Wrong instance for the job. Serving a small model on a training-class GPU burns money; trying to fine-tune a large model on a low-VRAM card simply fails to load. Size by VRAM and workload, not by "biggest available."
No spot for training. Paying full on-demand for interruptible, checkpointed training jobs leaves the easiest 60–90% savings on the table.
Weights baked into the image. Multi-gigabyte images make pulls slow and couple model versioning to container builds. Load weights from versioned object storage at startup.
Forgetting quota until launch day. "Insufficient capacity" plus a multi-day quota review turns a Friday deadline into next week. Request quota before you need it.
Takeaways
The whole article in seven lines
Inference and training are different workloads: serving wants latency + autoscaling; training wants throughput + fast networking.
Pick the instance by VRAM and job first, inference-class GPUs (L4/T4/A10G) for serving, training-class (A100/H100) for building models.
Match pricing to interruptibility: on-demand for serving, spot for training, reserved for steady baselines.
Store weights as versioned objects and load once into VRAM; never bake them into the image or reload per request.
Batch requests and autoscale on a GPU metric (utilization/queue depth), never on CPU.
Multi-node training needs co-located nodes and a high-bandwidth fabric, or the GPUs starve.
The cardinal sin is the idle GPU at full price: scale to zero, right-size, request quota early.
Where to go next
Once the GPUs are running, the next questions are about the model itself, squeezing latency and cost out of each request, and about treating GPU spend with the same rigor as the rest of your cloud bill.
Optimize the model you are serving:LLM Cost & Latency Optimization, batching, quantization, caching, and the latency/cost levers above the infra layer.
Practice the container layer: the kubectl lab and Docker lab to get comfortable with deployments, node selectors, and resource requests.
Follow the full track: the Cloud Engineer path for the foundation-to-senior progression these GPU topics sit on top of.
Want to go deeper?
This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.