Back to path
MediumWorking system ~15h· 5 milestones

Define SLOs, alert on them, and run a game day

Alerts are either silent or screaming, and nobody can say how reliable the service should be. You define what reliable means, alert only on real user pain, and prove the team can respond by running a controlled incident.

SLI/SLO designError budgetsBurn-rate alertingIncident responsePostmortemsObservability cost controlRunbooks

What you'll build

A service with documented SLOs and error budgets, multi-window burn-rate alerts that page only on user-facing pain, and a blameless postmortem from a simulated incident.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

rules/slo_recording.ymlyaml
groups:
  - name: slo_recording
    rules:
      - record: job:slo_errors:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{status="5xx"}[5m]))
            / sum(rate(http_requests_total[5m]))
      - record: job:slo_errors:ratio_rate1h
        expr: |
          sum(rate(http_requests_total{status="5xx"}[1h]))
            / sum(rate(http_requests_total[1h]))

Reading this file

  • record: job:slo_errors:ratio_rate5mPrecomputes the error ratio under a reusable name so alerts and dashboards share one definition.
  • status="5xx" / totalThe recorded value is the fraction of requests failing, the heart of every SLO calculation.
  • ratio_rate5m vs ratio_rate1hTwo windows of the same ratio let you build multi-window burn-rate alerts on top.

One recording rule = one source of truth for the error ratio, reused by alerts and dashboards.

That's 1 of 8 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Define SLIs

    5 guided steps

    You can only set a target on something you measure. A good SLI is a ratio of good events to valid events, it has to map to what users actually feel.

  2. 2

    Set SLOs & budgets

    5 guided steps

    An SLO turns a vague "be reliable" into a number you can engineer against, and the error budget turns reliability into a shared, spendable resource between dev velocity and stability.

  3. 3

    Alert on symptoms

    5 guided steps

    Cause-based alerts ("CPU high") page constantly and often mean nothing to users. Burn-rate alerts page only when the error budget is being consumed fast enough to matter.

  4. 4

    Budget the telemetry & write the runbook

    5 guided steps

    Telemetry cost scales with cardinality and retention, and an alert with no runbook is just noise at 3am. Both are what separate a real reliability setup from a demo, you should know the bill and the response before the incident, not during it.

  5. 5

    Run a game day

    5 guided steps

    Alerts and runbooks are theory until a human runs them under pressure. A game day surfaces the gaps (missing access, unclear ownership, bad runbook) before a real incident does.

What's inside when you start

3 starter files, ready to clone
5 guided milestones
5 full reference solutions
8 code blocks explained line-by-line
5 "is it working?" checks
4 interview questions it prepares you for

You'll walk away with

An SLO document with SLI definitions, targets, and an error-budget policy
Multi-window burn-rate alert rules committed as code
A dashboard showing SLO attainment and remaining error budget
A telemetry cost-and-retention note with active-series count and a budget ceiling
A runbook for the fast-burn page with concrete diagnosis and mitigation steps
A blameless postmortem with tracked action items from the game day

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building