Back to path
ExpertCarta · Project 14 of 14 ~9h· 4 milestones

Gate launches with a production readiness review

Continues from the last build: Rung 13 left you with a stateful data layer that survives bad days: automated backups, a timed restore drill, streaming replication with a lag SLI, and a failover game-day runbook.

Carta is reliable now. Thirteen rungs of work bought you SLOs, symptom alerts, timeouts and breakers, capacity headroom, runbooks, backups, and a tested failover.

Production readiness review designSLI and SLO definitionSymptom-based alertingTimeout and circuit breaker reviewCapacity and load testing with k6Runbook and on-call readinessError-budget policy and review cadenceLaunch go/no-go decision making

What you'll build

You walk away with a reusable PRR checklist grounded in the whole ladder, a completed review of a real (if seeded) service, fixes for its worst gaps, a signed go/no-go record, and a recurring error-budget review agenda. You can now gate any new service launch on reliability evidence rather than optimism, and keep it honest month over month.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

services/gift-cards/app.pypython
import os
import httpx
from fastapi import FastAPI, HTTPException

app = FastAPI()

PAYMENTS_URL = os.environ.get("PAYMENTS_URL", "http://payments-stub:9000")

# GAP: no SLI/SLO defined for this service anywhere.
# GAP: no request metrics exported, so symptom alerts are impossible.


@app.get("/healthz")
def healthz():
    # GAP: liveness only, does not check the payments dependency.
    return {"status": "ok"}


@app.post("/giftcards/redeem")
async def redeem(code: str, amount: int):
    # GAP: no timeout on the payments call, a slow provider hangs the request.
    # GAP: no circuit breaker, every request retries into a known-bad dependency.
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            f"{PAYMENTS_URL}/charge",
            json={"amount": amount, "ref": code},
        )
    if resp.status_code != 200:
        # GAP: no structured log line, so Loki cannot correlate failures.
        raise HTTPException(status_code=502, detail="payment failed")
    return {"redeemed": code, "amount": amount}

Reading this file

  • PAYMENTS_URL = os.environ.get("PAYMENTS_URL", "http://payments-stub:9000")Calls the inherited flaky payment simulator, the same dependency you tamed in earlier rungs.
  • async with httpx.AsyncClient() as client:An httpx client with no timeout argument defaults to no deadline, so a slow provider hangs the request indefinitely.
  • raise HTTPException(status_code=502, detail="payment failed")Returns an error but emits no structured log and no metric, so neither Loki nor Prometheus can see the failure.
  • return {"status": "ok"}Health check is liveness only and never checks the payments dependency, so it stays green during a downstream outage.

The launch candidate. Read the GAP comments as hints, but do not trust them to be complete: part of the exercise is finding gaps the comments do not name (capacity untested, no runbook, no backup story). The missing payments timeout is the most dangerous line.

That's 1 of 7 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Write the PRR checklist from the whole ladder

    4 guided steps

    A PRR is only fair if the standard is written down before the review starts. Grounding each item in a rung you completed keeps the checklist honest and teaches the next engineer why the line exists, not just that it does.

  2. 2

    Review gift-cards and find the 8 seeded gaps

    4 guided steps

    Finding gaps systematically against a written checklist is the core SRE review skill. The seeded service is deliberately plausible, which is exactly how reliability debt slips through a tired reviewer.

  3. 3

    Fix the most dangerous gap: add a payments timeout and breaker

    4 guided steps

    The missing timeout is the gap most likely to cause a real outage: one slow dependency hangs every request and exhausts the worker pool. Demonstrating it with an injected knob, then fixing it, is the difference between a checklist item and earned confidence.

  4. 4

    Run the go/no-go and stand up the error-budget review

    4 guided steps

    A review that does not end in a documented decision is just discussion. And a one-time gate decays unless an error-budget cadence keeps reliability a standing conversation, which is what turns this from a launch checkbox into a practice.

What's inside when you start

3 starter files, ready to clone
4 guided milestones
4 full reference solutions
7 code blocks explained line-by-line
4 "is it working?" checks
4 interview questions it prepares you for

You'll walk away with

A reusable prr/CHECKLIST.md with five sections, testable yes/no items each traced to a ladder rung, and a verdict field
A completed gaps table listing all 8 seeded gaps with file/line, severity, and blocker decision
A fixed services/gift-cards/app.py with an explicit payments timeout and a circuit breaker, proven against injected LATENCY_MS and FAIL_RATE
A symptom-based gift-cards alert rule replacing the seeded CPU alert
A signed prr/go-no-go.md with verdict, time-boxed owned conditions, and a record of fixes
A monthly error-budget review agenda with an explicit launch-freeze rule

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building