Back to Blog
Cloud12 min readJun 2026

Load Balancing & Auto-Scaling Explained

How does one URL serve millions of people when a single server would fall over? Two ideas: a load balancer that spreads traffic across many identical servers, and an auto-scaler that adds and removes servers as demand changes. This is the beginner-friendly mental model, with a diagram, the real algorithms, and the mistakes that cause outages.

Load BalancingAutoscalingCloudFoundations
SB

Sri Balaji

Founder ยท TheSimplifiedTech

On this page

How one URL serves millions

Type a popular site's address and millions of other people are doing the same thing right now. Yet no single computer on Earth could handle that, one server has finite CPU, memory, and network. So how does one URL serve a crowd that would crush any single machine? The answer is the most important pattern in scalable systems, and it's made of exactly two ideas.

Idea one: run many identical copies of your app and spread traffic across them, that's load balancing. Idea two: automatically add more copies when it's busy and remove them when it's quiet, that's auto-scaling. Together they turn "my one server fell over under load" into "the system grew to meet demand and shrank to save money." This article builds both from scratch.

Who this is for

Anyone who's deployed an app to one server and wondered how real products handle traffic. No prior scaling experience needed. If you understand that a server can get overloaded, you're ready.

What a load balancer actually is

A load balancer is a single front door that receives all the traffic and spreads it across a pool of identical servers behind it, so no single server gets crushed, and a failing one is quietly skipped.

Users only ever see the load balancer's address. They have no idea whether there are two servers behind it or two hundred. You can add, remove, or replace servers and the public URL never changes. The everyday picture:

๐Ÿ›Ž๏ธ The host at the doorLoad balancer
๐Ÿฝ๏ธ Identical tables/serversApp instances
๐Ÿ‘€ "Is this table free and clean?"Health check
๐Ÿšซ Skipping a table that's a messRouting around an unhealthy instance
๐Ÿ“ˆ Opening more tables on a busy nightAuto-scaling out
A load balancer is the host at a busy restaurant deciding which open table gets the next party.

See it as a picture

All traffic hits the load balancer. It health-checks each app instance and forwards each request to a healthy one. A separate component, the auto-scaler, watches metrics like CPU and adds or removes instances behind the load balancer as load changes.

requestswhen scaledadd / removeregister
Users

Millions

Load Balancer

One public address

Instance 1

healthy

Instance 2

healthy

Instance 3

added on load

Auto-scaler

watches CPU/req

Users reach one load balancer, which spreads requests across a pool of healthy instances. The auto-scaler watches metrics (dashed) and changes the size of the pool, adding instances when busy, removing them when idle.

The instances must be stateless for this to work, any instance can serve any request. If instance 2 stored your shopping cart in its own memory, your next request landing on instance 3 would lose it. Keep state in a shared place (a database, a cache) and any instance can serve anyone. That's the Twelve-Factor discipline paying off.

Health checks: the part that prevents outages

A load balancer is only as good as its health checks. Every few seconds it asks each instance "are you OK?" by hitting an endpoint (like /healthz). Healthy instances get traffic; an instance that fails its checks is pulled out of rotation automatically, no human, no 3am page. When it recovers, it's added back. This is how one crashed server stops affecting users almost instantly.

Pro tip

Your health check should verify the app can actually do its job (e.g. reach the database), not just that the process is running. A check that always returns 200 even when the database is down will keep routing users into a broken instance, defeating the whole point.

How it decides where to send each request

When a request arrives, the load balancer picks an instance using an algorithm. The defaults are simple and usually right, but knowing the options is a classic interview question and occasionally matters for real performance.

AlgorithmHow it picksBest when
Round-robinNext instance in order, one after anotherInstances are equal and requests are similar, the default
Least connectionsThe instance with the fewest active connectionsRequests vary a lot in duration (some slow, some fast)
Weighted round-robinRound-robin, but bigger instances get moreYour instances aren't all the same size
IP hash / stickySame client always hits the same instanceApp keeps per-user state in memory (avoid if you can)
Least response timeThe instance answering fastest right nowYou want to favour healthy, fast instances automatically
Common load balancing algorithms, round-robin is the sensible default; reach for others when you have a specific reason.

Sticky sessions are a trap

Sticky sessions (IP hash) pin a user to one instance so its in-memory state survives. It feels convenient, but it breaks even load distribution and means losing that instance logs the user out. The real fix is to make instances stateless and store session data in a shared cache. Reach for stickiness only when you genuinely can't.

Auto-scaling: matching servers to demand

Load balancing spreads traffic across the instances you have. Auto-scaling changes how many you have. You set a metric and thresholds, "keep average CPU around 50%", and the auto-scaler adds instances when you're over and removes them when you're under, between a minimum and maximum you define.

  1. 1

    Set a target and bounds

    e.g. target 50% CPU, minimum 2 instances (for redundancy), maximum 10 (to cap the bill).

  2. 2

    The scaler watches the metric

    Every minute or so it checks average CPU (or request count, or queue depth) across the pool.

  3. 3

    Busy โ†’ scale out

    Sustained CPU above target adds instances. They boot, pass health checks, and the load balancer starts sending them traffic.

  4. 4

    Quiet โ†’ scale in

    Sustained low CPU removes instances so you stop paying for idle capacity.

Always keep a minimum of two instances across two availability zones, even at zero traffic. One instance means one crash equals an outage. Two across AZs means you survive losing one, the cheapest reliability win there is, the same logic Netflix uses at massive scale.

Scaling out is not instant

A new instance takes time to boot, warm up, and pass health checks, often a minute or more. If traffic spikes faster than instances come up, users still see slowness during the gap. Scale on a leading signal (rising request count) rather than a lagging one, keep some headroom, and pre-warm before known spikes (a sale, a launch).

Common mistakes that cost hours

  1. Stateful instances behind a load balancer. Storing sessions or carts in an instance's memory means requests that land elsewhere lose the data. Keep state in a shared database or cache.
  2. A health check that lies. Always returning 200 even when the app is broken keeps the load balancer routing users into failure. Check real dependencies.
  3. Scaling on a lagging metric, too slowly. By the time CPU is pegged, users are already suffering and new instances take a minute to help. Scale earlier, on a leading signal, with headroom.
  4. Minimum of one instance. No redundancy, one crash is a full outage. Minimum two, across two availability zones.
  5. No maximum on the scaler. A traffic flood (or a bug, or an attack) scales you to a five-figure bill overnight. Always cap the maximum.
  6. Forgetting scale-in. Scaling out but never back down means you pay peak prices forever. Make sure instances are removed when load drops.

Where to go next

The whole article in 6 lines

  • A **load balancer** is one public front door that spreads traffic across many identical instances.
  • **Health checks** pull failing instances out of rotation automatically, that's how a crash doesn't become an outage.
  • **Round-robin** is the sensible default algorithm; pick others only for a specific reason.
  • **Auto-scaling** changes how many instances you run based on a metric, between a min and max you set.
  • Instances must be **stateless** and you should run a **minimum of two across AZs**.
  • Cap the **maximum** and ensure **scale-in** works, or you'll pay peak prices forever.

Load balancing and auto-scaling are where a single-server deploy grows into a real system. Keep building:

Put even a single instance behind a load balancer today. The day you add the second one with zero downtime and watch traffic split across both, this whole article will click into place.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.