Load Balancing & Auto-Scaling Explained

On this page

How one URL serves millions
What a load balancer actually is
See it as a picture
Health checks: the part that prevents outages
How it decides where to send each request
Auto-scaling: matching servers to demand
Common mistakes that cost hours
Where to go next

How one URL serves millions

Type a popular site's address and millions of other people are doing the same thing right now. Yet no single computer on Earth could handle that, one server has finite CPU, memory, and network. So how does one URL serve a crowd that would crush any single machine? The answer is the most important pattern in scalable systems, and it's made of exactly two ideas.

Idea one: run many identical copies of your app and spread traffic across them, that's load balancing. Idea two: automatically add more copies when it's busy and remove them when it's quiet, that's auto-scaling. Together they turn "my one server fell over under load" into "the system grew to meet demand and shrank to save money." This article builds both from scratch.

Who this is for

Anyone who's deployed an app to one server and wondered how real products handle traffic. No prior scaling experience needed. If you understand that a server can get overloaded, you're ready.

What a load balancer actually is

A load balancer is a single front door that receives all the traffic and spreads it across a pool of identical servers behind it, so no single server gets crushed, and a failing one is quietly skipped.

Users only ever see the load balancer's address. They have no idea whether there are two servers behind it or two hundred. You can add, remove, or replace servers and the public URL never changes. The everyday picture:

🛎️ The host at the doorLoad balancer

🍽️ Identical tables/serversApp instances

👀 "Is this table free and clean?"Health check

🚫 Skipping a table that's a messRouting around an unhealthy instance

📈 Opening more tables on a busy nightAuto-scaling out

A load balancer is the host at a busy restaurant deciding which open table gets the next party.

See it as a picture

All traffic hits the load balancer. It health-checks each app instance and forwards each request to a healthy one. A separate component, the auto-scaler, watches metrics like CPU and adds or removes instances behind the load balancer as load changes.

Users reach one load balancer, which spreads requests across a pool of healthy instances. The auto-scaler watches metrics (dashed) and changes the size of the pool, adding instances when busy, removing them when idle.

The instances must be stateless for this to work, any instance can serve any request. If instance 2 stored your shopping cart in its own memory, your next request landing on instance 3 would lose it. Keep state in a shared place (a database, a cache) and any instance can serve anyone. That's the Twelve-Factor discipline paying off.

Health checks: the part that prevents outages

A load balancer is only as good as its health checks. Every few seconds it asks each instance "are you OK?" by hitting an endpoint (like /healthz). Healthy instances get traffic; an instance that fails its checks is pulled out of rotation automatically, no human, no 3am page. When it recovers, it's added back. This is how one crashed server stops affecting users almost instantly.

Pro tip

Your health check should verify the app can actually do its job (e.g. reach the database), not just that the process is running. A check that always returns 200 even when the database is down will keep routing users into a broken instance, defeating the whole point.

How it decides where to send each request

When a request arrives, the load balancer picks an instance using an algorithm. The defaults are simple and usually right, but knowing the options is a classic interview question and occasionally matters for real performance.

Algorithm	How it picks	Best when
Round-robin	Next instance in order, one after another	Instances are equal and requests are similar, the default
Least connections	The instance with the fewest active connections	Requests vary a lot in duration (some slow, some fast)
Weighted round-robin	Round-robin, but bigger instances get more	Your instances aren't all the same size
IP hash / sticky	Same client always hits the same instance	App keeps per-user state in memory (avoid if you can)
Least response time	The instance answering fastest right now	You want to favour healthy, fast instances automatically

Common load balancing algorithms, round-robin is the sensible default; reach for others when you have a specific reason.

Sticky sessions are a trap

Sticky sessions (IP hash) pin a user to one instance so its in-memory state survives. It feels convenient, but it breaks even load distribution and means losing that instance logs the user out. The real fix is to make instances stateless and store session data in a shared cache. Reach for stickiness only when you genuinely can't.

Auto-scaling: matching servers to demand

Load balancing spreads traffic across the instances you have. Auto-scaling changes how many you have. You set a metric and thresholds, "keep average CPU around 50%", and the auto-scaler adds instances when you're over and removes them when you're under, between a minimum and maximum you define.

1
Set a target and bounds
e.g. target 50% CPU, minimum 2 instances (for redundancy), maximum 10 (to cap the bill).
2
The scaler watches the metric
Every minute or so it checks average CPU (or request count, or queue depth) across the pool.
3
Busy → scale out
Sustained CPU above target adds instances. They boot, pass health checks, and the load balancer starts sending them traffic.
4
Quiet → scale in
Sustained low CPU removes instances so you stop paying for idle capacity.

Always keep a minimum of two instances across two availability zones, even at zero traffic. One instance means one crash equals an outage. Two across AZs means you survive losing one, the cheapest reliability win there is, the same logic Netflix uses at massive scale.

Scaling out is not instant

A new instance takes time to boot, warm up, and pass health checks, often a minute or more. If traffic spikes faster than instances come up, users still see slowness during the gap. Scale on a leading signal (rising request count) rather than a lagging one, keep some headroom, and pre-warm before known spikes (a sale, a launch).

Common mistakes that cost hours

Stateful instances behind a load balancer. Storing sessions or carts in an instance's memory means requests that land elsewhere lose the data. Keep state in a shared database or cache.
A health check that lies. Always returning 200 even when the app is broken keeps the load balancer routing users into failure. Check real dependencies.
Scaling on a lagging metric, too slowly. By the time CPU is pegged, users are already suffering and new instances take a minute to help. Scale earlier, on a leading signal, with headroom.
Minimum of one instance. No redundancy, one crash is a full outage. Minimum two, across two availability zones.
No maximum on the scaler. A traffic flood (or a bug, or an attack) scales you to a five-figure bill overnight. Always cap the maximum.
Forgetting scale-in. Scaling out but never back down means you pay peak prices forever. Make sure instances are removed when load drops.

Where to go next

The whole article in 6 lines

A **load balancer** is one public front door that spreads traffic across many identical instances.
**Health checks** pull failing instances out of rotation automatically, that's how a crash doesn't become an outage.
**Round-robin** is the sensible default algorithm; pick others only for a specific reason.
**Auto-scaling** changes how many instances you run based on a metric, between a min and max you set.
Instances must be **stateless** and you should run a **minimum of two across AZs**.
Cap the **maximum** and ensure **scale-in** works, or you'll pay peak prices forever.

Load balancing and auto-scaling are where a single-server deploy grows into a real system. Keep building:

Go deeper with the interactive lesson: Load Balancers.
See these patterns at planetary scale: How Netflix Built Its Streaming Pipeline.
Start from the single-server basics this builds on: Deploying Your First Production App.
Practise the networking these sit on in the browser: the Networking Lab.

Put even a single instance behind a load balancer today. The day you add the second one with zero downtime and watch traffic split across both, this whole article will click into place.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

Cloud

Cloud Networking Fundamentals: How a VPC Actually Works

Read

Cloud

How the Cloud Actually Works: Regions, AZs & the Edge

Read

Cloud

IaaS vs PaaS vs SaaS, What You Actually Manage

Read