Back to path
IntermediateCarta · Project 7 of 14 ~7h· 6 milestones

Contain the blast radius

Continues from the last build: From rung 6 you have timeout budgets, exponential backoff with jitter, retry budgets, and idempotency keys so retries on checkout are safe.

Rung 6 made every dependency call fail fast and retry safely, so a single slow payment no longer froze a whole request.

Circuit breaker state machine (closed/open/half-open)Graceful degradation and fallback designBulkhead isolation with bounded connection poolsCache-backed read fallback (Redis)Async order queueing for deferred workReading breaker and pool metrics in Prometheus/GrafanaReasoning about blast radius and failure containment

What you'll build

You walk away with an API that stays up when a dependency dies: circuit breakers that trip and self-heal, a catalog fallback that serves stale-but-useful data, a payments fallback that queues orders instead of dropping them, and isolated connection pools proven to keep one dependency's flood from starving another. You can read a breaker's state in Grafana and explain blast-radius containment to a senior.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

services/api/app/breaker.pypython
import time

class BreakerOpen(Exception):
    """Raised when the breaker is open and the call is short-circuited."""

class CircuitBreaker:
    def __init__(self, name, fail_max=5, reset_timeout=10.0):
        self.name = name
        self.fail_max = fail_max
        self.reset_timeout = reset_timeout
        self.state = "closed"
        self.failures = 0
        self.opened_at = 0.0

    def _allow(self):
        # TODO milestone 1: if open and reset_timeout has passed, go half-open.
        # Return True if a call may proceed, False to short-circuit.
        return True

    def record_success(self):
        # TODO milestone 1: reset failures and close the breaker.
        pass

    def record_failure(self):
        # TODO milestone 1: count the failure and open if fail_max reached.
        pass

    async def call(self, coro_fn):
        if not self._allow():
            raise BreakerOpen(self.name)
        try:
            result = await coro_fn()
        except Exception:
            self.record_failure()
            raise
        self.record_success()
        return result

Reading this file

  • class BreakerOpen(Exception):A distinct exception so callers can catch 'breaker open' separately from a real dependency error and choose a fallback.
  • self.state = "closed"The breaker starts closed (normal). Milestone 1 adds transitions to open and half-open.
  • fail_max=5, reset_timeout=10.0Trip after 5 failures; wait 10 seconds before probing. These are the two knobs you tune in the stretch work.
  • async def call(self, coro_fn):The wrapper: short-circuit when open, otherwise run the call and record success or failure. coro_fn is a zero-arg async function.

A minimal circuit breaker scaffold. You complete the state transitions in milestone 1. It is dependency-agnostic so you instantiate one per dependency.

That's 1 of 9 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Complete the circuit breaker state machine

    4 guided steps

    When catalog is fully down, retries alone just keep hammering it and draining your retry budget while threads pile up. An open breaker fails instantly, which frees resources and gives the dead dependency room to recover. This is the core mechanism the rest of the rung builds on.

  2. 2

    Give each dependency its own bounded pool (bulkheads)

    4 guided steps

    With one shared pool, a slow or flooded inventory holds every connection until timeout and healthy payments calls block waiting. Bounded per-dependency pools cap the damage: inventory can saturate its own 5 slots and get rejected fast, while payments is untouched. This is the bulkhead.

  3. 3

    Degrade catalog to cached products with a stale flag

    4 guided steps

    Catalog data is shared and slow-changing, so a few minutes stale is far better than a blank store. This converts a hard catalog outage into a soft, visible degradation that keeps revenue flowing.

  4. 4

    Queue the order when payments is down

    4 guided steps

    Dropping a paying customer's order because a third-party provider blipped is lost revenue. Queueing accepts the order now and captures payment later, turning a hard failure into a deferred success. This is fail-open for a case where it is the right business call.

  5. 5

    Expose breaker and pool metrics and add a Grafana panel

    4 guided steps

    A breaker that trips silently is an outage you find out about from customers. Exposing state lets you alert on 'breaker open' and watch recovery, and the pool metric proves your bulkhead is working during a flood.

  6. 6

    Write the degradation runbook and prove it end to end

    4 guided steps

    Containment you cannot describe is containment you cannot trust under pressure. A failure-to-behavior table turns your design into an operational contract and a test plan in one.

What's inside when you start

3 starter files, ready to clone
6 guided milestones
6 full reference solutions
9 code blocks explained line-by-line
6 "is it working?" checks
4 interview questions it prepares you for

You'll walk away with

A reusable circuit breaker (closed/open/half-open) wired around each dependency call in the api service
Per-dependency bounded connection pools (bulkheads) so one dependency cannot starve another
Catalog fallback that serves cached products from Redis with a 'prices may be stale' flag
Payments fallback that queues the order for deferred capture instead of failing checkout
Prometheus metrics for breaker state and pool saturation, plus a Grafana panel showing them
A short runbook entry mapping each injected failure to the expected degraded behavior

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building