Set SLOs and spend an error budget
Continues from the last build: Rung 3 left you with user-centric SLIs wired into Prometheus: checkout availability, checkout journey latency, and inventory freshness, queryable but with no agreed targets.
You inherited Carta with SLIs that work: you can graph the fraction of POST /checkout requests that succeed and the p95 of the full checkout journey.
What you'll build
You walk away with two committed SLOs, Prometheus recording rules that compute SLI, error budget, and burn rate continuously, a Grafana SLO dashboard that an on-call can read in ten seconds, and a written error budget policy that turns "should we fix this?" into a rule instead of an argument. You will have spent budget on purpose and watched the dashboard react, so the numbers stop being abstract.
See how we teach, before you sign up
You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:
# Carta checkout SLOs Window: rolling 30 days for both. ## SLO 1: checkout availability Target: 99.9% of POST /checkout requests succeed (HTTP 2xx). SLI: successful_checkouts / total_checkouts. Error budget: 0.1% of requests may fail. Time budget: 30 days = 43200 minutes, so 0.1% = 43.2 minutes of total failure. ## SLO 2: checkout latency Target: 95% of checkout journeys finish under 400ms. SLI: fast_checkouts (le 0.4s) / total_checkouts. Error budget: up to 5% of requests may exceed 400ms. ## Why these numbers 99.9 is reachable for a stub-backed demo and leaves a real budget to spend. 400ms is the point past which a checkout feels slow to a buyer.
Reading this file
rolling 30 daysA rolling window means the budget is always measured over the last 30 days, not reset on the 1st, so it cannot be gamed by waiting for month end.43200 minutes30 days times 24 hours times 60 minutes; this is the denominator that turns a percentage budget into a minutes budget on-call can feel.43.2 minutes of total failure0.1% of 43200 minutes; this is the single most quoted number on this rung, the monthly allowance for SLO 1.95% of checkout journeys finish under 400msA latency SLO is a percentile target, not an average; 95% under 400ms still allows a slow tail for 5% of buyers.
The contract. Two SLOs, the window, and the budget each one buys, written so a new on-call understands it in one read.
That's 1 of 7 explained code blocks in this single project.
The build, milestone by milestone
- 1
Do the budget math and write the SLO contract
3 guided stepsAn SLO no one computed is an SLO no one believes. Deriving the budget yourself is what makes the later dashboard mean something instead of being a decoration.
- 2
Encode SLI, budget, and burn rate as recording rules
3 guided stepsComputing the budget once, centrally, means the dashboard and next rung's alerts read the same trustworthy number instead of each re-deriving it slightly differently.
- 3
Build the SLO dashboard as code
3 guided stepsA budget the team cannot see is a budget the team will not respect. Dashboard-as-code means anyone can stand up the identical board and it survives a Grafana wipe.
- 4
Write the error budget policy
3 guided stepsNumbers without a policy are just decoration. The policy is the pre-agreed decision so that at 2am no one argues; they read the rule. It is also what makes the SLO have teeth instead of being aspirational.
- 5
Spend the budget on purpose and log it
3 guided stepsUntil you have watched the number move you do not trust it. Spending budget on purpose proves the rules, the dashboard, and the policy all react together, and it rehearses the policy decision in a safe setting.
What's inside when you start
You'll walk away with
This is portfolio-grade. Build it free.
Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.
Start building