Graceful Degradation & Load Shedding: Staying Partly Up When Everything Wants to Fall Down

On this page

The cascade that took everything down
A degraded service beats a dead one
The picture: a request's gauntlet
The five techniques, and when each one fits
How to actually roll this out
Code: timeout + retry with exponential backoff and jitter
Common mistakes that turn a blip into an outage
Takeaways
Where to go next

The cascade that took everything down

It starts small. One downstream dependency, the recommendations service, gets slow. Requests that used to return in 40ms now take 4 seconds. Nobody notices yet, because every caller is patiently waiting. Then those callers run out of threads. Their own callers start timing out. Retries kick in, doubling the traffic to the already-drowning service. Within ninety seconds, a single slow dependency has turned into a full-site outage, the checkout page, the login page, the health check, all gone. None of those needed recommendations at all.

This is the defining failure mode of distributed systems: not a clean crash, but a cascade. One component's slowness becomes everyone's slowness, retries amplify the load, thread pools and connection pools fill up, and the failure spreads outward like a flood. The cruel part is that the system had plenty of capacity to serve the *important* traffic, it just spent all of it waiting on, and retrying, the unimportant stuff.

Graceful degradation and load shedding are the discipline of refusing to let that happen. Instead of trying (and failing) to serve everything, you deliberately serve less, shed the excess, trip a breaker, fall back to a cache, drop the specials, so the core of the system stays alive. This article is the playbook.

Who this is for

Backend and platform engineers, and SREs, who own a service that talks to other services. If you've ever watched a dashboard go red top-to-bottom from one slow dependency, or you're designing a system you don't want that to happen to, this is for you. Comfort with HTTP, timeouts, and basic concurrency is assumed. Pairs with [Reliability & Resilience: Designing for Failure](/blog/reliability-resilience-design-for-failure).

A degraded service beats a dead one

When you cannot serve every request well, serve the important requests well and reject the rest fast. A degraded service that answers the critical 80% beats a perfect service that answers nothing.
The core principle of load shedding

The instinct under load is to try harder, queue the request, retry the dependency, hold the connection open a little longer. Every one of those instincts makes a cascade worse. The counterintuitive truth is that under overload, the kindest thing you can do is say no quickly. A fast rejection frees the resources a slow success would have hogged, and it lets the caller fail over or back off instead of piling on.

Kitchen gets slammed on a Friday nightService hits overload, CPU, threads, or queue depth maxed

Drop the specials and the 40-minute braise from the menuFeature degradation, turn off recommendations, personalization

Stop seating walk-ins, honor existing reservationsLoad shedding, reject low-priority traffic, protect critical paths

Pre-plated bread instead of made-to-order appetizersFallback, serve a cached or default response instead of the live one

Tell the table that won't stop ordering to waitRate limiting, cap how much any single client can demand

Stop sending orders to the broken fryerCircuit breaker, quit calling a dependency that keeps failing

A busy kitchen doesn't refuse every customer when it's slammed, it shrinks the menu.

The picture: a request's gauntlet

Resilience patterns aren't a grab-bag; they form a pipeline. Each one is a gate a request passes through, and each gate's job is to fail cheaply so the next gate (and the expensive service behind them) is protected. Here's the path a single request takes, and where excess gets shed.

A request runs a gauntlet: rate limiter caps the client, the circuit breaker guards the dependency, the service enforces a timeout, and a fallback/cache answers on the branch. Shed traffic exits fast (dashed).

1
Client sends a request
It arrives at the edge of your service. Nothing expensive has happened yet, this is the cheapest place to say no.
2
Rate limiter checks the quota
Is this client (by API key, IP, or user) within its allowed rate? If not, return 429 immediately on the dashed shed-excess branch. One noisy client can't starve everyone else.
3
Circuit breaker checks the dependency's health
If recent calls to the downstream service have been failing, the breaker is open, skip the call entirely and go straight to the fallback. Don't waste a thread on a known-broken dependency.
4
Service calls downstream with a timeout
The breaker is closed, so make the real call, but bound it. If it doesn't answer within the deadline, abandon it. A request that waits forever is a thread that helps no one.
5
Fallback answers on timeout, error, or open breaker
Serve a cached value, a default, or a trimmed-down response. The user gets *something* useful instead of a spinner or a 500.

The five techniques, and when each one fits

Each pattern solves a different shape of overload. Reach for the wrong one and you'll either drop traffic you needed or keep serving traffic you couldn't afford. Here's the cheat sheet.

Technique	What it does	When to use
Load shedding	Rejects low-priority requests fast when past capacity, protecting critical traffic	Total demand exceeds capacity and you must choose what to serve, protect checkout, drop analytics
Rate limiting	Caps how much any single client can request per window	One caller can overwhelm a shared resource; you need fairness and abuse protection at the edge
Circuit breaker	Stops calling a dependency that keeps failing, then probes to see if it recovered	A downstream dependency is down or slow and retrying it just wastes your threads
Retry + backoff + jitter	Retries transient failures with growing, randomized delays	Failures are genuinely transient (network blip, brief 503) and the call is idempotent, never for overload
Fallback / degradation	Serves a cached, default, or reduced response when the real one is unavailable	A non-critical feature fails and a stale-or-simplified answer is far better than an error

Match the technique to the failure shape, they compose, but each has a primary job.

Retries and overload don't mix

Retry is for **transient** faults, not for an overloaded system. If a service is shedding load because it's saturated, retrying just hands it the same request again, now multiplied across every caller. Pair retries with a circuit breaker so that once failures cross a threshold, you stop retrying entirely.

How to actually roll this out

You don't bolt on all five patterns at once. Add them in the order that buys the most safety per unit of effort, and measure as you go.

1
Set timeouts on every outbound call first
This is the single highest-leverage change. An unbounded call is the seed of every cascade. Pick deadlines from your latency SLO (e.g. p99 + headroom), not from a round number.
2
Classify your traffic by priority
Tag requests as critical (login, checkout, health checks) vs. sheddable (recommendations, analytics, batch refreshes). You can't shed intelligently until you know what's expendable.
3
Add a circuit breaker around each risky dependency
Trip on an error-rate threshold over a rolling window. When open, route to the fallback instead of the dependency. Probe periodically (half-open) to detect recovery.
4
Add retries with exponential backoff and jitter, carefully
Only for idempotent, transient-failure-prone calls. Cap the attempts. Always add jitter so callers don't synchronize into a thundering herd.
5
Add a load shedder at the edge
Watch a saturation signal (queue depth, CPU, in-flight count). Past a threshold, reject sheddable traffic with 503 before it enters the system. Keep a reserved slice of capacity for critical requests.
6
Wire it all to observability
Emit metrics for shed count, breaker state changes, retry counts, and fallback hits. You can't tune what you can't see, and a silently-open breaker is its own incident.

Code: timeout + retry with exponential backoff and jitter

Here's the pattern you'll reach for most: bound every call with a timeout, retry only transient failures, and grow the delay exponentially with full jitter so a fleet of clients doesn't retry in lockstep. The jitter is not optional, it's what turns a synchronized retry storm into a smooth trickle.

resilient_client.py

python

import random
import time
import httpx

# Only these are worth retrying, they're usually transient.
RETRYABLE_STATUS = {502, 503, 504}
MAX_ATTEMPTS = 4          # original try + 3 retries, then give up
BASE_DELAY = 0.2          # seconds
MAX_DELAY = 5.0           # cap so we never wait absurdly long
CALL_TIMEOUT = 1.0        # bound EVERY call, this is non-negotiable


class NonRetryable(Exception):
    """A 4xx or logic error, retrying won't help."""


def fetch_with_resilience(client: httpx.Client, url: str) -> httpx.Response:
    last_exc: Exception | None = None

    for attempt in range(MAX_ATTEMPTS):
        try:
            # Hard deadline on the call. A request that waits forever
            # is a thread that helps no one.
            resp = client.get(url, timeout=CALL_TIMEOUT)

            if resp.status_code < 400:
                return resp
            if resp.status_code not in RETRYABLE_STATUS:
                raise NonRetryable(f"non-retryable status {resp.status_code}")

            last_exc = httpx.HTTPStatusError(
                "retryable", request=resp.request, response=resp
            )

        except (httpx.TimeoutException, httpx.TransportError) as exc:
            last_exc = exc  # network blip, fall through to backoff

        # If this was the final attempt, stop here.
        if attempt == MAX_ATTEMPTS - 1:
            break

        # Exponential backoff with FULL JITTER:
        #   sleep = random_between(0, min(MAX_DELAY, base * 2 ** attempt))
        # Jitter de-synchronizes callers so they don't retry as a herd.
        ceiling = min(MAX_DELAY, BASE_DELAY * (2 ** attempt))
        time.sleep(random.uniform(0, ceiling))

    raise last_exc or RuntimeError("all retry attempts exhausted")

Three details earn their keep here. `CALL_TIMEOUT` bounds every attempt, without it, a single hung call defeats the whole pattern. `MAX_ATTEMPTS` guarantees the loop terminates; infinite retries are how you turn a brief blip into a self-inflicted DDoS. And full jitter is what prevents the thundering herd: if a thousand clients all failed at the same instant, fixed backoff would have them all retry at the same instant too.

Pro tip

In production you'd combine this with a circuit breaker so that once the dependency's error rate crosses a threshold, you stop retrying altogether and serve a fallback. Retry handles the **blip**; the breaker handles the **outage**. Service meshes give you both declaratively, explore that hands-on in the [Istio service mesh lab](/labs/istio-service-mesh).

Common mistakes that turn a blip into an outage

Retry storms. Every layer retries the layer below it. Three layers each retrying 3× turns one user request into 27 calls to a dependency that's already on its knees. Pick one layer to own retries, usually the one closest to the dependency.
No jitter. Fixed or purely-exponential backoff without randomization means clients that fail together retry together. You get traffic in synchronized waves that hammer the recovering service exactly when it's most fragile. Always add jitter.
Infinite retries. A retry loop with no attempt cap is a denial-of-service weapon aimed at yourself. Cap attempts, cap total elapsed time, and give up gracefully into a fallback.
Retrying non-idempotent calls. Retrying a POST /charge can double-charge a customer. Only auto-retry idempotent operations, or use an idempotency key so the server can dedupe.
Unbounded timeouts (or none at all). A call with no timeout, or a 30-second one on a path with a 200ms SLO, is the root cause of most cascades. Derive timeouts from the SLO of the path, and make inner timeouts shorter than outer ones.
Shedding blindly. Dropping requests at random under load means you drop checkout as often as analytics. Classify traffic by priority first; shed the expendable, protect the critical.
A breaker that's open and silent. If a circuit breaker trips and nobody gets paged, you've quietly degraded a feature indefinitely. Alert on breaker-state changes and fallback-hit rates.

Takeaways

The whole article in seven lines

Overload's real danger is the **cascade**: one slow dependency, amplified by waiting and retries, takes down everything.
A **degraded service beats a dead one**, under overload, say no fast to free the resources a slow success would waste.
The patterns form a pipeline: **rate limit** the client, **break** the bad dependency, **timeout** the call, **fall back** on failure, **shed** the excess.
**Timeouts first**, an unbounded call is the seed of every cascade. Then classify traffic, then breakers, then careful retries, then shedding.
**Retry is for transient blips, not overload**, pair it with a circuit breaker and never let it multiply load on a saturated service.
Backoff **must have jitter** and a **cap**, fixed backoff synchronizes herds, infinite retries DDoS yourself.
Shed by **priority**, not at random: protect checkout and login, drop recommendations and analytics.

Where to go next

These patterns are most powerful when they're enforced at the platform layer instead of hand-coded in every service. A service mesh gives you timeouts, retries with backoff, and circuit breaking as configuration, and a load balancer or API gateway is where edge rate limiting and shedding usually live.

Practice mesh-level retries, timeouts, and outlier detection in the Istio service mesh lab, it's where circuit breaking goes from code to config.
Brush up on the kubectl lab to inspect pod health, readiness probes, and rollouts, the substrate that load shedding runs on top of.
Read the companion piece Reliability & Resilience: Designing for Failure for the broader picture: redundancy, blast radius, and failure-mode design.
Follow the full SRE career path to put degradation, SLOs, observability, and incident response together.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

Cloud

Multi-Region Architecture: When You Actually Need It

Read

Cloud

Reliability & Resilience: Designing for Failure

Read

DevOps

Incident Management & On-Call That Doesn't Burn People Out

Read