Define SLOs, alert on them, and run a game day
Alerts are either silent or screaming, and nobody can say how reliable the service should be. You define what reliable means, alert only on real user pain, and prove the team can respond by running a controlled incident.
What you'll build
A service with documented SLOs and error budgets, multi-window burn-rate alerts that page only on user-facing pain, and a blameless postmortem from a simulated incident.
See how we teach, before you sign up
You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:
groups:
- name: slo_recording
rules:
- record: job:slo_errors:ratio_rate5m
expr: |
sum(rate(http_requests_total{status="5xx"}[5m]))
/ sum(rate(http_requests_total[5m]))
- record: job:slo_errors:ratio_rate1h
expr: |
sum(rate(http_requests_total{status="5xx"}[1h]))
/ sum(rate(http_requests_total[1h]))Reading this file
record: job:slo_errors:ratio_rate5mPrecomputes the error ratio under a reusable name so alerts and dashboards share one definition.status="5xx" / totalThe recorded value is the fraction of requests failing, the heart of every SLO calculation.ratio_rate5m vs ratio_rate1hTwo windows of the same ratio let you build multi-window burn-rate alerts on top.
One recording rule = one source of truth for the error ratio, reused by alerts and dashboards.
That's 1 of 8 explained code blocks in this single project.
The build, milestone by milestone
- 1
Define SLIs
5 guided stepsYou can only set a target on something you measure. A good SLI is a ratio of good events to valid events, it has to map to what users actually feel.
- 2
Set SLOs & budgets
5 guided stepsAn SLO turns a vague "be reliable" into a number you can engineer against, and the error budget turns reliability into a shared, spendable resource between dev velocity and stability.
- 3
Alert on symptoms
5 guided stepsCause-based alerts ("CPU high") page constantly and often mean nothing to users. Burn-rate alerts page only when the error budget is being consumed fast enough to matter.
- 4
Budget the telemetry & write the runbook
5 guided stepsTelemetry cost scales with cardinality and retention, and an alert with no runbook is just noise at 3am. Both are what separate a real reliability setup from a demo, you should know the bill and the response before the incident, not during it.
- 5
Run a game day
5 guided stepsAlerts and runbooks are theory until a human runs them under pressure. A game day surfaces the gaps (missing access, unclear ownership, bad runbook) before a real incident does.
What's inside when you start
You'll walk away with
This is portfolio-grade. Build it free.
Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.
Start building