Back to path
LargePortfolio centerpiece ~28h· 6 milestones

Stand up a reliability program across services

Reliability is one person's heroics, not a system. You build the program: SLOs across multiple services, capacity you have actually tested, a chaos experiment, and on-call that does not burn people out.

Multi-service SLOsBurn-rate alertingLoad & capacity testingChaos engineeringOn-call & runbooksPostmortem cultureInfra & observability cost modeling

What you'll build

A multi-service reliability program with consistent SLOs and burn-rate alerting, load-tested capacity headroom, a chaos experiment with findings, and on-call runbooks plus a postmortem practice.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

slo-catalog.ymlyaml
services:
  - name: checkout
    job: checkout
    availability_slo: 0.999
    owner: payments-team
  - name: search
    job: search
    availability_slo: 0.995
    owner: discovery-team

Reading this file

  • - name: checkoutOne catalog entry per service, adding a service later is a single line, not a new rules file.
  • job: checkoutMatches the Prometheus job label so generated rules select the right service's metrics.
  • availability_slo: 0.999The per-service target, criticality varies so checkout gets a stricter SLO than search.
  • owner: payments-teamNames who is accountable, reliability is a program only when every service has a clear owner.

One source of truth for every service's target, vary the number, standardize the method.

That's 1 of 9 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Standardize SLOs

    5 guided steps

    When every service measures reliability differently, you cannot compare, prioritize, or build shared tooling. A consistent SLO catalog makes reliability a program, not a collection of one-offs.

  2. 2

    Test capacity

    5 guided steps

    Capacity assumptions are guesses until you test them. Knowing where a service breaks tells you how much headroom you have and when autoscaling must kick in before users feel it.

  3. 3

    Model the cost of reliability

    5 guided steps

    Reliability is bought, not free, every nine, every replica of headroom, and every retained time series has a price. A program that cannot quantify its cost cannot defend it or trade it off against the error budget.

  4. 4

    Break it on purpose

    5 guided steps

    Systems fail in ways you did not design for. Chaos engineering surfaces those failure modes deliberately and safely, so you fix them on your schedule instead of at 3am.

  5. 5

    Make on-call humane

    5 guided steps

    On-call that pages constantly for non-actionable alerts burns people out and trains them to ignore the pager, which is how real incidents get missed.

  6. 6

    Close the loop

    5 guided steps

    Reliability improves only when incidents change the system. A postmortem that produces tracked, prioritized work is the engine that turns failures into durable fixes.

What's inside when you start

3 starter files, ready to clone
6 guided milestones
6 full reference solutions
9 code blocks explained line-by-line
6 "is it working?" checks
4 interview questions it prepares you for

You'll walk away with

A reliability dashboard and SLO catalog spanning multiple services
A load-test report with documented headroom and scaling triggers
A cost model linking SLO targets to capacity and monthly infra + observability cost per service
A chaos-experiment report with hypothesis, findings, and follow-up fixes
On-call runbooks plus a blameless postmortem with tracked action items

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building