Stand up a reliability program across services
Reliability is one person's heroics, not a system. You build the program: SLOs across multiple services, capacity you have actually tested, a chaos experiment, and on-call that does not burn people out.
What you'll build
A multi-service reliability program with consistent SLOs and burn-rate alerting, load-tested capacity headroom, a chaos experiment with findings, and on-call runbooks plus a postmortem practice.
See how we teach, before you sign up
You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:
services:
- name: checkout
job: checkout
availability_slo: 0.999
owner: payments-team
- name: search
job: search
availability_slo: 0.995
owner: discovery-teamReading this file
- name: checkoutOne catalog entry per service, adding a service later is a single line, not a new rules file.job: checkoutMatches the Prometheus job label so generated rules select the right service's metrics.availability_slo: 0.999The per-service target, criticality varies so checkout gets a stricter SLO than search.owner: payments-teamNames who is accountable, reliability is a program only when every service has a clear owner.
One source of truth for every service's target, vary the number, standardize the method.
That's 1 of 9 explained code blocks in this single project.
The build, milestone by milestone
- 1
Standardize SLOs
5 guided stepsWhen every service measures reliability differently, you cannot compare, prioritize, or build shared tooling. A consistent SLO catalog makes reliability a program, not a collection of one-offs.
- 2
Test capacity
5 guided stepsCapacity assumptions are guesses until you test them. Knowing where a service breaks tells you how much headroom you have and when autoscaling must kick in before users feel it.
- 3
Model the cost of reliability
5 guided stepsReliability is bought, not free, every nine, every replica of headroom, and every retained time series has a price. A program that cannot quantify its cost cannot defend it or trade it off against the error budget.
- 4
Break it on purpose
5 guided stepsSystems fail in ways you did not design for. Chaos engineering surfaces those failure modes deliberately and safely, so you fix them on your schedule instead of at 3am.
- 5
Make on-call humane
5 guided stepsOn-call that pages constantly for non-actionable alerts burns people out and trains them to ignore the pager, which is how real incidents get missed.
- 6
Close the loop
5 guided stepsReliability improves only when incidents change the system. A postmortem that produces tracked, prioritized work is the engine that turns failures into durable fixes.
What's inside when you start
You'll walk away with
This is portfolio-grade. Build it free.
Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.
Start building