Capacity Planning for Reliability: Forecast, Load Test, and Right-Size

On this page

The 9am outage nobody forecasted
The principle: plan for the peak, pay for the average
The capacity planning loop
Organic vs inorganic growth
Provisioned vs autoscaling: the real trade-off
Do a capacity plan: a walkthrough
Configure it: load test + HPA
Common mistakes that cost hours (or dollars)
Takeaways
Where to go next

The 9am outage nobody forecasted

Your service runs fine for months. Then a marketing email goes out at 9am, traffic triples in ninety seconds, every pod pins CPU, latency climbs past your SLO, and the autoscaler, which you assumed would save you, is still pulling images while the queue backs up. Or the opposite failure: you survived that spike a year ago by over-provisioning, and you've been paying for 60 idle instances every night since. Both are capacity failures. One shows up in your incident channel, the other shows up in the cloud bill, and both come from the same root cause: nobody actually planned capacity.

Capacity planning is the discipline of answering one question before your users force the answer: *how much do we need, when, and how do we get it there in time?* It sits at the intersection of reliability and cost. Get it right and you absorb peaks invisibly while spending close to what average load justifies. Get it wrong in either direction and you pay, in pages or in dollars.

Who this is for

Engineers and SREs who own a service in production and have been bitten by either a traffic spike or a surprise bill. You should be comfortable reading dashboards (CPU, RPS, latency) and know roughly what autoscaling is. You do not need formal statistics. For the architectural side of growing a system, read the sibling piece on [Scalability Principles](/blog/scalability-principles) first.

The principle: plan for the peak, pay for the average

Capacity planning is the continuous process of matching available resources to forecasted demand at a chosen level of risk, never just enough, never far too much.
The working definition

The hard part is that demand is not one number. It has a daily shape, a weekly shape, seasonal swings, and the occasional self-inflicted spike from a launch or a campaign. You cannot provision for the average, the average never happens at the moment that matters. You provision for the peak you expect plus a safety margin, then you use elasticity to claw back the cost during the troughs.

The restaurant's busiest dinner rushPeak demand (P99 traffic, launch day)

Average covers across the whole weekMean utilization, what the bill should track

Stoves, counter space, and prep stationsProvisioned baseline capacity

On-call cooks you phone in for a rushAutoscaling, elastic, but takes time to arrive

One empty stove kept hot, ready to useHeadroom, slack you pay for deliberately

Saturday always busier than TuesdaySeasonality in the demand forecast

A kitchen sized only for the daily average can't serve the dinner rush, and one sized only for New Year's Eve goes broke on a quiet Tuesday.

A kitchen built only for the average Tuesday turns away half the Friday crowd, that is an outage. A kitchen built for New Year's Eve every single night pays rent on cold stoves, that is waste. Capacity planning is choosing the stove count *and* the speed you can call in extra cooks, on purpose, with numbers behind it.

The capacity planning loop

Capacity planning is not a one-time spreadsheet, it is a loop. You forecast demand, measure what one unit of capacity actually buys you, add headroom, decide how to supply it, then watch real utilization and feed that back into the next forecast.

The capacity planning loop: historical metrics drive a forecast, a load test calibrates capacity-per-unit, you provision, and real utilization feeds back (dashed) into the next forecast.

1
Pull historical metrics
Gather at least one full seasonal cycle of RPS, CPU/memory utilization, and latency per service. You cannot forecast a shape you have never measured.
2
Forecast demand
Project the peak forward, separating organic growth (gradual, trend-following) from inorganic events (launches, migrations, marketing) you must add by hand.
3
Load test to find capacity-per-unit
Drive synthetic load until one instance/pod breaches your SLO. Now you know how many requests one unit safely serves, the conversion factor from demand to capacity.
4
Apply a headroom target
Pick a utilization ceiling (often 60–70%) so a node failure or a forecast miss does not instantly tip you over. Headroom is insurance you buy on purpose.
5
Provision or autoscale
Decide which portion is a fixed baseline and which flexes with traffic. Set autoscaler floors, ceilings, and triggers from the load-test numbers, not from guesses.
6
Monitor and re-forecast
Watch real utilization against the target. Drift means your conversion factor or growth assumption changed, feed it back and the loop tightens over time.

Organic vs inorganic growth

The single biggest forecasting mistake is treating all growth as one smooth curve. There are two kinds and they need completely different handling.

Organic growth is the gradual trend, more signups, more usage per user, the slow rightward creep of your weekly peak. It follows history, so you can fit a trend line and extrapolate. If you grew 4% month-over-month for the last year, 4% next month is a defensible bet.

Inorganic growth is the step change you cause: a product launch, a Super Bowl ad, onboarding a huge customer, a region migration that doubles a fleet overnight. History says nothing about these, they are not in the trend line. You must add them to the forecast manually, sourced from the teams who own them (product, marketing, sales). The outage in our opening was an inorganic spike planned against an organic forecast. They never reconciled.

Make inorganic events a calendar, not a surprise

Keep a shared launch/event calendar that feeds capacity. Every entry should carry an expected traffic multiplier and a date. A launch with no capacity number attached is an incident with a delay timer.

Provisioned vs autoscaling: the real trade-off

Once you know how much capacity you need, you decide how to supply it. Provisioned capacity is always-on: you pay for it whether or not traffic uses it, and it is there the instant demand arrives. Autoscaling adds and removes capacity in response to load: cheaper at the trough, but it takes real wall-clock time to react, and that lag is exactly when a spike hurts.

Dimension	Provisioned	Autoscaling
Cost at average load	Higher, you pay for peak-shaped capacity all the time	Lower, capacity tracks demand
Latency to scale	Zero, capacity already exists	Seconds to minutes, schedule, pull image, warm up
Burst handling	Excellent if sized for the burst; brittle if under-sized	Good for gradual ramps; risky for sudden spikes
Operational load	Manual resizing, periodic re-planning	Tune triggers, floors, ceilings, cooldowns; debug flapping
Failure blast radius	Predictable, fixed pool	Can mask problems by scaling out, then surprise you with a bill
Best fit	Steady baseline, latency-critical, predictable peaks	Spiky-but-not-instant traffic, batch, cost-sensitive troughs

Provisioned baseline vs autoscaling, most production systems use both: a provisioned floor plus an autoscaled flex layer.

The pragmatic answer is rarely one or the other. Provision a baseline that covers your reliable floor of traffic and absorbs the first seconds of any spike, then autoscale the layer above it to chase the rest. The baseline buys you the reaction time the autoscaler needs; the autoscaler buys back the cost the baseline would waste overnight.

Do a capacity plan: a walkthrough

Here is the plan end to end for a single service, with numbers, so it is concrete rather than abstract.

1
Establish the unit and its limit
Load test one pod until P99 latency breaches your SLO. Say it holds 200 RPS safely. That 200 RPS/pod is your conversion factor, everything downstream uses it.
2
Forecast the peak
Current peak is 4,000 RPS. Organic trend adds ~4%/month, so in 6 months expect ~5,060 RPS. A launch next quarter adds an estimated 1.5x on top during its window: plan for ~7,600 RPS at that event.
3
Choose a headroom target
Target 65% utilization at peak so one AZ/node loss or a 20% forecast miss does not tip you over. Effective capacity per pod becomes 200 × 0.65 = 130 RPS.
4
Convert demand to capacity
Steady 6-month peak: 5,060 / 130 ≈ 39 pods. Launch peak: 7,600 / 130 ≈ 59 pods. These are your autoscaler ceiling and your launch pre-scale number.
5
Split baseline and flex
Trough traffic is ~1,500 RPS → ~12 pods. Set that as the provisioned/min floor so the first burst is absorbed instantly; let autoscaling cover 12 → 39, and pre-scale to 59 manually before the launch window.
6
Write it down and set alarms
Alert when sustained utilization crosses the headroom target, when forecast vs actual drifts beyond ~15%, and when you approach 80% of any cloud quota. Re-run the whole loop monthly.

Notice the launch is handled by pre-scaling, not by trusting the autoscaler to react in real time. For inorganic events you know the date, provision ahead of them. The autoscaler is for the traffic you did not schedule.

Configure it: load test + HPA

First, the load test that gives you the conversion factor. This drives a target service with a ramping arrival rate and prints latency percentiles so you can see exactly where the SLO breaks.

loadtest.sh

bash

#!/usr/bin/env bash
set -euo pipefail

TARGET="https://svc.internal/health-weighted-endpoint"

# Ramp arrival rate until P99 latency breaches the 250ms SLO.
# Find the highest rate one pod sustains, then divide demand by it.
for RATE in 50 100 150 200 250 300; do
  echo "== ${RATE} req/s =="
  vegeta attack \
    -targets=<(echo "GET ${TARGET}") \
    -rate="${RATE}" -duration=60s \
  | vegeta report -type=text \
  | grep -E 'Latencies|Success|Status'
  echo
done

# Read the report: the last RATE where P99 stays under 250ms
# AND success is 100% is your safe RPS-per-pod (the conversion factor).

Then the HPA that turns those numbers into policy. The minReplicas is your provisioned floor; maxReplicas is the ceiling you computed; the CPU target encodes your headroom; and the scale-up/down behavior controls how aggressively it reacts so it does not flap.

hpa.yaml

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-api
  minReplicas: 12        # provisioned floor, absorbs the first burst
  maxReplicas: 60        # launch-peak ceiling from the capacity plan
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65   # the headroom target, encoded
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30   # react fast to ramps
      policies:
        - type: Percent
          value: 100                    # at most double per step
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # shed slowly to avoid flapping
      policies:
        - type: Percent
          value: 25
          periodSeconds: 60

The autoscaler is only as fast as the pod's cold start

If your image is 2GB and the app takes 90 seconds to warm caches and pass readiness, your HPA cannot save you from a 90-second spike, capacity arrives after the damage. Either shrink cold start (smaller image, lazy init, pre-warmed pools) or raise your provisioned floor. Measure cold start; do not assume it.

Common mistakes that cost hours (or dollars)

Provisioning for the average. The average moment never needs the capacity; the peak moment always does. Size for the peak plus headroom, then claw back cost with elasticity.
Trusting the autoscaler to handle instant spikes. Scaling has latency, image pull, boot, warm-up, health checks. A spike faster than that lag is an outage no matter how high your ceiling is. Keep a provisioned floor.
No headroom. Running at 95% utilization means one node loss or a small forecast miss tips you into SLO violation. Target 60–70% and treat the gap as paid-for insurance.
Forecasting only organic growth. Trend lines miss launches, campaigns, and big-customer onboarding entirely. Maintain an inorganic-events calendar with traffic multipliers and pre-scale for them.
Never load testing. Without a measured capacity-per-unit, your pod counts are guesses. Re-test after major releases, a dependency or query change silently moves the number.
Ignoring cloud quotas and dependencies. You can autoscale into an instance quota, a database connection cap, or a downstream rate limit. Capacity is the whole chain, not just your fleet. Alert at 80% of every quota.
Setting it and forgetting it. Demand shape drifts; the plan rots. Re-run the loop on a schedule and reconcile forecast against actuals every month.

Takeaways

Capacity planning in eight lines

Plan for the peak, pay for the average, both outages and waste are capacity failures.
It is a loop: metrics → forecast → load test → headroom → provision/autoscale → monitor → re-forecast.
Load testing gives you capacity-per-unit; that conversion factor turns demand into pod counts.
Separate organic growth (extrapolate the trend) from inorganic events (add by hand from a calendar).
Headroom (target 60–70% utilization) is insurance you buy on purpose, not slack you forgot to remove.
Provisioned = instant but always paid for; autoscaling = cheaper at the trough but lags on spikes.
Use both: a provisioned floor buys the reaction time the autoscaler needs; pre-scale for known events.
Watch utilization and quotas, reconcile forecast vs actual monthly, and the loop tightens itself.

Where to go next

Capacity planning is one pillar of running reliable systems at scale. Pair it with the architecture that makes capacity addable in the first place, then practice the tooling that provisions and inspects it.

Read the companion piece on Scalability Principles, capacity planning decides *how much*; scalability decides whether adding more even helps.
Practice fleet operations and HPA in the kubectl lab, scale deployments, watch autoscaling react, and read utilization live.
Codify provisioned baselines and autoscaler config as infrastructure in the Terraform lab so capacity decisions live in version control.
Follow the full SRE career path to connect capacity planning with SLOs, load shedding, and incident response.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

DevOps

Kubernetes in Production: Beyond the Tutorial

Read

Cloud

Reliability & Resilience: Designing for Failure

Read

DevOps

Incident Management & On-Call That Doesn't Burn People Out

Read