Make every dependency call fail fast
Continues from the last build: You left rung 5 with multi-window burn-rate SLO alerts that page only on real symptoms, so when checkout hangs you now get told, but nothing in the code stops the hang.
The payment provider had a slow day. Every checkout request to Carta opened a connection to the payments stub, waited, and waited.
What you'll build
You walk away able to design a timeout budget across a call chain, implement bounded retries with exponential backoff and jitter, protect a service from retry storms with a retry budget, and make a write path safe to retry with idempotency keys. You will have load-test evidence (k6) that checkout fails fast and recovers instead of hanging, and you will understand why a missing inner timeout is one of the most common causes of cascading outages.
See how we teach, before you sign up
You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:
import os
import httpx
from fastapi import FastAPI, Request
app = FastAPI()
PAYMENTS_URL = os.environ.get("PAYMENTS_URL", "http://payments-stub:9000")
# NAIVE on purpose: no timeout, no retries, no idempotency.
# A slow provider makes this call hang for the full provider delay,
# which holds the worker and eventually takes /healthz down.
async def charge(amount_cents: int) -> dict:
async with httpx.AsyncClient() as client:
resp = await client.post(
f"{PAYMENTS_URL}/charge",
json={"amount_cents": amount_cents},
)
resp.raise_for_status()
return resp.json()
@app.post("/checkout")
async def checkout(req: Request):
body = await req.json()
result = await charge(body["amount_cents"])
return {"status": "ok", "payment": result}
@app.get("/healthz")
async def healthz():
return {"ok": True}Reading this file
async with httpx.AsyncClient() as client:A default httpx client has no enforced timeout policy you control, so a slow read blocks indefinitely. Milestone 2 replaces this with a shared, budgeted client.result = await charge(body["amount_cents"])The handler awaits the charge directly with no upper bound, so api latency equals provider latency. This is the line that hangs checkout.PAYMENTS_URL = os.environ.get("PAYMENTS_URL", "http://payments-stub:9000")The stub address. You inject failures by setting LATENCY_MS and FAIL_RATE on this service in prod-sim.yml, not by editing code.async def healthz():The liveness probe shares the worker pool with checkout, which is why an unbounded checkout starves healthz. Fixing the budget keeps this green.
This is the inherited handler with the unbounded call. Every milestone changes how charge() behaves. Leave /healthz untouched so you can watch it survive a slow provider.
That's 1 of 7 explained code blocks in this single project.
The build, milestone by milestone
- 1
Reproduce the hang and write the budget table
4 guided stepsA fix you cannot reproduce is a guess. Establishing the cascade nginx 10s > api 3s > payments 2s first means every later change has a target to hit, and inverted budgets (the most common bug in this rung) get caught on paper instead of under load.
- 2
Add a budgeted httpx client with connect and read timeouts
4 guided stepsA missing inner read timeout is the single most common cause of cascading outages. Setting only a connect timeout is a classic trap because the connection succeeds instantly and then the slow response body hangs forever. An explicit read timeout is what actually bounds the call.
- 3
Add bounded retries with exponential backoff and jitter
4 guided stepsA single transient failure should not become a customer-visible error if a quick retry would succeed. But retries must be bounded in count and total time, and jittered so many clients that failed together do not retry in lockstep and create a synchronized thundering herd on the recovering dependency.
- 4
Cap retries with a token-bucket retry budget
4 guided stepsDuring a sustained failure, every request becomes N requests if retries are unbounded, multiplying load on an already struggling dependency exactly when it can least handle it. A retry budget caps the global blast radius so a brief blip still self-heals but a real outage degrades gracefully.
- 5
Make retries safe with an idempotency key on checkout
4 guided stepsThe dangerous case is a retry that fires after the first attempt actually succeeded but the response was lost (timeout, dropped connection, killed worker). Without idempotency, that retry charges the customer again. The key plus a shared store makes the write safe to retry, which is the precondition that justified milestones 3 and 4.
What's inside when you start
You'll walk away with
This is portfolio-grade. Build it free.
Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.
Start building