Back to path
IntermediateCarta · Project 6 of 14 ~6h· 5 milestones

Make every dependency call fail fast

Continues from the last build: You left rung 5 with multi-window burn-rate SLO alerts that page only on real symptoms, so when checkout hangs you now get told, but nothing in the code stops the hang.

The payment provider had a slow day. Every checkout request to Carta opened a connection to the payments stub, waited, and waited.

Timeout budget design across a call chainHTTP client timeouts in Python (httpx)Exponential backoff with jitterRetry budgets to prevent retry stormsIdempotency keys for safe write retriesReading latency percentiles in GrafanaLoad testing failure modes with k6

What you'll build

You walk away able to design a timeout budget across a call chain, implement bounded retries with exponential backoff and jitter, protect a service from retry storms with a retry budget, and make a write path safe to retry with idempotency keys. You will have load-test evidence (k6) that checkout fails fast and recovers instead of hanging, and you will understand why a missing inner timeout is one of the most common causes of cascading outages.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

carta/api/main.pypython
import os
import httpx
from fastapi import FastAPI, Request

app = FastAPI()
PAYMENTS_URL = os.environ.get("PAYMENTS_URL", "http://payments-stub:9000")

# NAIVE on purpose: no timeout, no retries, no idempotency.
# A slow provider makes this call hang for the full provider delay,
# which holds the worker and eventually takes /healthz down.
async def charge(amount_cents: int) -> dict:
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            f"{PAYMENTS_URL}/charge",
            json={"amount_cents": amount_cents},
        )
        resp.raise_for_status()
        return resp.json()

@app.post("/checkout")
async def checkout(req: Request):
    body = await req.json()
    result = await charge(body["amount_cents"])
    return {"status": "ok", "payment": result}

@app.get("/healthz")
async def healthz():
    return {"ok": True}

Reading this file

  • async with httpx.AsyncClient() as client:A default httpx client has no enforced timeout policy you control, so a slow read blocks indefinitely. Milestone 2 replaces this with a shared, budgeted client.
  • result = await charge(body["amount_cents"])The handler awaits the charge directly with no upper bound, so api latency equals provider latency. This is the line that hangs checkout.
  • PAYMENTS_URL = os.environ.get("PAYMENTS_URL", "http://payments-stub:9000")The stub address. You inject failures by setting LATENCY_MS and FAIL_RATE on this service in prod-sim.yml, not by editing code.
  • async def healthz():The liveness probe shares the worker pool with checkout, which is why an unbounded checkout starves healthz. Fixing the budget keeps this green.

This is the inherited handler with the unbounded call. Every milestone changes how charge() behaves. Leave /healthz untouched so you can watch it survive a slow provider.

That's 1 of 7 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Reproduce the hang and write the budget table

    4 guided steps

    A fix you cannot reproduce is a guess. Establishing the cascade nginx 10s > api 3s > payments 2s first means every later change has a target to hit, and inverted budgets (the most common bug in this rung) get caught on paper instead of under load.

  2. 2

    Add a budgeted httpx client with connect and read timeouts

    4 guided steps

    A missing inner read timeout is the single most common cause of cascading outages. Setting only a connect timeout is a classic trap because the connection succeeds instantly and then the slow response body hangs forever. An explicit read timeout is what actually bounds the call.

  3. 3

    Add bounded retries with exponential backoff and jitter

    4 guided steps

    A single transient failure should not become a customer-visible error if a quick retry would succeed. But retries must be bounded in count and total time, and jittered so many clients that failed together do not retry in lockstep and create a synchronized thundering herd on the recovering dependency.

  4. 4

    Cap retries with a token-bucket retry budget

    4 guided steps

    During a sustained failure, every request becomes N requests if retries are unbounded, multiplying load on an already struggling dependency exactly when it can least handle it. A retry budget caps the global blast radius so a brief blip still self-heals but a real outage degrades gracefully.

  5. 5

    Make retries safe with an idempotency key on checkout

    4 guided steps

    The dangerous case is a retry that fires after the first attempt actually succeeded but the response was lost (timeout, dropped connection, killed worker). Without idempotency, that retry charges the customer again. The key plus a shared store makes the write safe to retry, which is the precondition that justified milestones 3 and 4.

What's inside when you start

2 starter files, ready to clone
5 guided milestones
5 full reference solutions
7 code blocks explained line-by-line
5 "is it working?" checks
4 interview questions it prepares you for

You'll walk away with

A documented timeout budget table (nginx 10s, api 3s, payments 2s) in the repo README
api/clients.py with a configured httpx client carrying connect and read timeouts
api/retry.py implementing bounded exponential backoff with jitter plus a token-bucket retry budget
Idempotency-Key handling on POST /checkout that dedupes retried payment attempts via Redis
A k6 script load/checkout_timeouts.js that asserts p99 latency stays bounded under LATENCY_MS=8000
A short RESILIENCE.md note explaining the budget cascade and the retry-storm tradeoff

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building