Contain the blast radius
Continues from the last build: From rung 6 you have timeout budgets, exponential backoff with jitter, retry budgets, and idempotency keys so retries on checkout are safe.
Rung 6 made every dependency call fail fast and retry safely, so a single slow payment no longer froze a whole request.
What you'll build
You walk away with an API that stays up when a dependency dies: circuit breakers that trip and self-heal, a catalog fallback that serves stale-but-useful data, a payments fallback that queues orders instead of dropping them, and isolated connection pools proven to keep one dependency's flood from starving another. You can read a breaker's state in Grafana and explain blast-radius containment to a senior.
See how we teach, before you sign up
You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:
import time
class BreakerOpen(Exception):
"""Raised when the breaker is open and the call is short-circuited."""
class CircuitBreaker:
def __init__(self, name, fail_max=5, reset_timeout=10.0):
self.name = name
self.fail_max = fail_max
self.reset_timeout = reset_timeout
self.state = "closed"
self.failures = 0
self.opened_at = 0.0
def _allow(self):
# TODO milestone 1: if open and reset_timeout has passed, go half-open.
# Return True if a call may proceed, False to short-circuit.
return True
def record_success(self):
# TODO milestone 1: reset failures and close the breaker.
pass
def record_failure(self):
# TODO milestone 1: count the failure and open if fail_max reached.
pass
async def call(self, coro_fn):
if not self._allow():
raise BreakerOpen(self.name)
try:
result = await coro_fn()
except Exception:
self.record_failure()
raise
self.record_success()
return resultReading this file
class BreakerOpen(Exception):A distinct exception so callers can catch 'breaker open' separately from a real dependency error and choose a fallback.self.state = "closed"The breaker starts closed (normal). Milestone 1 adds transitions to open and half-open.fail_max=5, reset_timeout=10.0Trip after 5 failures; wait 10 seconds before probing. These are the two knobs you tune in the stretch work.async def call(self, coro_fn):The wrapper: short-circuit when open, otherwise run the call and record success or failure. coro_fn is a zero-arg async function.
A minimal circuit breaker scaffold. You complete the state transitions in milestone 1. It is dependency-agnostic so you instantiate one per dependency.
That's 1 of 9 explained code blocks in this single project.
The build, milestone by milestone
- 1
Complete the circuit breaker state machine
4 guided stepsWhen catalog is fully down, retries alone just keep hammering it and draining your retry budget while threads pile up. An open breaker fails instantly, which frees resources and gives the dead dependency room to recover. This is the core mechanism the rest of the rung builds on.
- 2
Give each dependency its own bounded pool (bulkheads)
4 guided stepsWith one shared pool, a slow or flooded inventory holds every connection until timeout and healthy payments calls block waiting. Bounded per-dependency pools cap the damage: inventory can saturate its own 5 slots and get rejected fast, while payments is untouched. This is the bulkhead.
- 3
Degrade catalog to cached products with a stale flag
4 guided stepsCatalog data is shared and slow-changing, so a few minutes stale is far better than a blank store. This converts a hard catalog outage into a soft, visible degradation that keeps revenue flowing.
- 4
Queue the order when payments is down
4 guided stepsDropping a paying customer's order because a third-party provider blipped is lost revenue. Queueing accepts the order now and captures payment later, turning a hard failure into a deferred success. This is fail-open for a case where it is the right business call.
- 5
Expose breaker and pool metrics and add a Grafana panel
4 guided stepsA breaker that trips silently is an outage you find out about from customers. Exposing state lets you alert on 'breaker open' and watch recovery, and the pool metric proves your bulkhead is working during a flood.
- 6
Write the degradation runbook and prove it end to end
4 guided stepsContainment you cannot describe is containment you cannot trust under pressure. A failure-to-behavior table turns your design into an operational contract and a test plan in one.
What's inside when you start
You'll walk away with
This is portfolio-grade. Build it free.
Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.
Start building