Skip to main content
Career Paths
Concepts
Bep Api Gateway Load Balancing
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Resume Builder
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

API Gateway and Load Balancing: Traffic at Scale

Every request to your backend passes through these layers — understand them deeply

🎯Key Takeaways
API gateways centralize cross-cutting concerns: auth, rate limiting, routing, observability.
Least Response Time load balancing is the production default for stateless HTTP services.
Consistent Hashing minimizes cache invalidation when adding/removing backend nodes.
Token Bucket rate limiting: allows bursts up to bucket capacity, throttles to steady-state rate.
Liveness = process alive (restart if failing). Readiness = can serve traffic (remove from LB if failing).
Graceful shutdown: stop accepting, finish in-flight requests, close connections, exit.

API Gateway and Load Balancing: Traffic at Scale

Every request to your backend passes through these layers — understand them deeply

~4 min read
Be the first to complete!
What you'll learn
  • API gateways centralize cross-cutting concerns: auth, rate limiting, routing, observability.
  • Least Response Time load balancing is the production default for stateless HTTP services.
  • Consistent Hashing minimizes cache invalidation when adding/removing backend nodes.
  • Token Bucket rate limiting: allows bursts up to bucket capacity, throttles to steady-state rate.
  • Liveness = process alive (restart if failing). Readiness = can serve traffic (remove from LB if failing).
  • Graceful shutdown: stop accepting, finish in-flight requests, close connections, exit.

The API Gateway: Your System's Front Door

Before API gateways, every client connected to every backend service directly. Adding authentication meant adding it to every service. Rate limiting meant reimplementing it everywhere. An API gateway centralizes these cross-cutting concerns.

What an API Gateway Does

  • 🔐Authentication & Authorization — Verify JWT/OAuth tokens once at the edge. Services trust the gateway.
  • Rate Limiting — Protect backend services from abuse. 100 requests/minute per API key. Implemented at the edge.
  • 🗺️Request Routing — /api/v1/users → user-service. /api/v1/orders → order-service. Clients see one URL.
  • 🔄Protocol Translation — REST-to-gRPC: clients send JSON REST, gateway converts to gRPC for internal services.
  • 📊Observability — Centralized logging, metrics, and distributed tracing injection for all requests.

Load Balancing: Distributing Traffic Intelligently

AlgorithmHow It WorksBest ForLimitation
Round RobinCycle through servers: 1, 2, 3, 1, 2, 3…Stateless services with similar request costsIgnores server load; slow servers still get traffic
Weighted Round RobinServer A gets 2× requests of Server BHeterogeneous server capacitiesStatic weights don't adapt to real-time load
Least ConnectionsRoute to server with fewest active connectionsLong-lived connections, variable request timeDoesn't account for request cost differences
IP Hashhash(client_ip) % N → same server alwaysSession affinity (sticky sessions)Uneven distribution; adding servers changes mappings
Least Response TimeRoute to server with lowest avg latency + fewest connectionsGeneral production (nginx, HAProxy default)Requires tracking response times
Consistent HashingVirtual nodes on a ring; minimal redistribution on changesCaches, distributed KV storesComplex implementation

Production Recommendation

Stateless HTTP services: Least Response Time (or Round Robin with health checks). Stateful services or caches: Consistent Hashing. Sticky sessions required: IP Hash (but prefer making services stateless instead).

rate-limiter.ts
1// Token Bucket Rate Limiter using Redis
2// Handles distributed rate limiting across multiple API gateway instances
3
4const redis = new Redis(process.env.REDIS_URL);
5
6async function checkRateLimit(
7 identifier: string, // 'user:123' or 'ip:1.2.3.4'
8 maxTokens: number,
9 refillRate: number // tokens per second
10): Promise<{ allowed: boolean; remaining: number }> {
11 const key = `ratelimit:${identifier}`;
12 const now = Date.now() / 1000;
Lua script executes atomically in Redis — no race conditions between check and decrement
13
14 // Lua script: atomic check-and-update (prevents race conditions)
15 const luaScript = `
16 local key = KEYS[1]
17 local max_tokens = tonumber(ARGV[1])
18 local refill_rate = tonumber(ARGV[2])
19 local now = tonumber(ARGV[3])
20
21 local data = redis.call('HMGET', key, 'tokens', 'last_refill')
22 local tokens = tonumber(data[1]) or max_tokens
23 local last_refill = tonumber(data[2]) or now
24
25 -- Refill based on elapsed time
Token bucket: full bucket allows burst up to maxTokens; steady state limited to refillRate/sec
26 local elapsed = now - last_refill
27 tokens = math.min(max_tokens, tokens + elapsed * refill_rate)
28
29 if tokens >= 1 then
30 tokens = tokens - 1
31 redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
32 redis.call('EXPIRE', key, 86400)
33 return {1, math.floor(tokens)} -- {allowed, remaining}
34 else
35 return {0, 0} -- denied
36 end
37 `;
38
39 const [allowed, remaining] = await redis.eval(
40 luaScript, 1, key,
41 maxTokens.toString(), refillRate.toString(), now.toString()
42 ) as [number, number];
43
44 return { allowed: allowed === 1, remaining };
Always return rate limit headers — clients need these to implement backoff
45}
46
47// Middleware
48export async function rateLimitMiddleware(req: Request, res: Response, next: NextFunction) {
429 Too Many Requests is the correct HTTP status code for rate limiting (RFC 6585)
49 const result = await checkRateLimit(`user:${req.user.id}`, 100, 10);
50
51 res.setHeader('X-RateLimit-Remaining', result.remaining);
52
53 if (!result.allowed) {
54 return res.status(429).json({ error: 'Rate limit exceeded' });
55 }
56 next();
57}

Health Checks and Graceful Shutdown

Health Check Types (Kubernetes)

  • 💓Liveness check — Is the process running? GET /healthz → 200. If this fails, Kubernetes restarts the container.
  • ✅Readiness check — Can the service handle requests? Checks DB connection, cache. If this fails, Kubernetes removes the pod from the load balancer (but doesn't restart it).
  • 🔬Startup check — For slow-starting services (JVM warmup): don't run liveness checks until startup complete.

Graceful Shutdown Pattern

On SIGTERM: (1) stop accepting new connections, (2) finish processing in-flight requests, (3) close DB connections, (4) exit. Kubernetes waits terminationGracePeriodSeconds (default 30s). Without graceful shutdown, in-flight requests return 500 errors.

How this might come up in interviews

Load balancing and gateway questions test system design fundamentals. Know the algorithms and tradeoffs. The sticky sessions and consistent hashing questions are classics.

Common questions:

  • Explain the difference between L4 and L7 load balancing
  • How would you implement rate limiting in a distributed system?
  • What is consistent hashing and when is it used?
  • How do you implement graceful shutdown for a Node.js service?

Strong answers include:

  • Knows consistent hashing and its use cases
  • Discusses token bucket vs leaky bucket tradeoffs
  • Mentions distributed rate limiting with Redis for horizontal scale
  • Knows graceful shutdown patterns

Red flags:

  • Only knows round-robin
  • Doesn't know the difference between liveness and readiness probes
  • Implements rate limiting in app code only (doesn't scale horizontally)

Quick check · API Gateway and Load Balancing: Traffic at Scale

1 / 2

What is the difference between a liveness check and a readiness check?

Key takeaways

  • API gateways centralize cross-cutting concerns: auth, rate limiting, routing, observability.
  • Least Response Time load balancing is the production default for stateless HTTP services.
  • Consistent Hashing minimizes cache invalidation when adding/removing backend nodes.
  • Token Bucket rate limiting: allows bursts up to bucket capacity, throttles to steady-state rate.
  • Liveness = process alive (restart if failing). Readiness = can serve traffic (remove from LB if failing).
  • Graceful shutdown: stop accepting, finish in-flight requests, close connections, exit.

From the books

Web Scalability for Startup Engineers — Artur Ejsmont (2015)

Chapter 7: Scaling with a Load Balancer

Load balancers are not just for horizontal scaling — they provide health checking, SSL termination, and graceful deployments (rolling restarts) that would otherwise require downtime.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Sign in to track your progress and mark lessons complete.

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.