Back to path
BeginnerCarta · Project 3 of 14 ~4h· 5 milestones

Define healthy: pick the SLIs that matter

Continues from the last build: You can triage a slow box with the USE method, but you still cannot say whether the whole service is healthy for users.

Two incidents in, and the same painful question keeps coming up in the channel: "is Carta healthy right now?" Nobody can answer it without opening six Grafana dashboards and arguing about which line is bad.

SLI selection and user-journey thinkingDistinguishing SLIs from vanity metricsPromQL rate and histogram_quantileAvailability ratios from error countersLatency SLIs with histogram bucketsFreshness SLIs from timestamp gaugesGrafana panel buildingLoad generation with k6 to validate queries

What you'll build

You walk away able to name what "healthy" means for a service in user terms, defend why CPU is not an SLI, and write production-grade PromQL for availability, latency, and freshness against existing metrics, all on one panel that answers the question in five seconds.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

slo/inject.shbash
#!/usr/bin/env bash
# Validate SLIs by injecting faults via the inherited stub knobs.
# Every failure here is INJECTED, in real prod these would occur on their own.
set -euo pipefail
COMPOSE="docker compose -f prod-sim.yml"

case "${1:-help}" in
  fail)   $COMPOSE up -d -e FAIL_RATE=0.3 payments-stub; echo "FAIL_RATE=0.3 set, watch availability SLI drop" ;;
  slow)   $COMPOSE up -d -e LATENCY_MS=1500 payments-stub; echo "LATENCY_MS=1500 set, watch latency p95 climb" ;;
  stale)  $COMPOSE pause inventory; echo "inventory paused, watch freshness age climb" ;;
  reset)  $COMPOSE up -d -e FAIL_RATE=0 -e LATENCY_MS=0 payments-stub; $COMPOSE unpause inventory || true; echo "knobs reset" ;;
  *)      echo "usage: $0 {fail|slow|stale|reset}" ;;
esac

Reading this file

  • set -euo pipefailFail fast on any error or unset variable so a typo does not silently do nothing.
  • FAIL_RATE=0.3 payments-stubInjects a 30 percent payment failure rate to make the availability SLI drop.
  • $COMPOSE pause inventoryFreezes inventory refresh so the freshness gauge ages and the SLI climbs.
  • usage: $0 {fail|slow|stale|reset}Reminds you to test each fault separately, and always reset at the end.

A tiny helper so you change one SLI input at a time and can prove each query is honest. It only nudges the stub knobs and compose, never real systems.

That's 1 of 7 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Map the user journey and shortlist candidate SLIs

    4 guided steps

    An SLI is a chosen measurement of the level of service a user experiences. If you start from metrics you happen to have (CPU, RAM) you measure the system, not the service. Starting from the journey forces user-centric choices and stops you from drowning in dashboards.

  2. 2

    Write the availability SLI in PromQL

    4 guided steps

    Availability is the headline SLI for any request-driven service. Done right it is a ratio of good events to valid events, expressed as a number you can put an objective on later. Counters must be wrapped in rate() because raw counters only ever climb.

  3. 3

    Write the latency SLI from histogram buckets

    4 guided steps

    Averages hide pain; a 200ms mean can hide a 4s tail that loses customers. A tail percentile (p95 or p99) is the user-centric latency SLI. Histograms are the right tool because they let you compute any quantile cheaply at query time.

  4. 4

    Write the freshness SLI for inventory

    4 guided steps

    Not every SLI is about requests. Freshness, correctness, and coverage are data SLIs. For a storefront, stale stock means overselling, a real user harm that availability and latency cannot detect. This teaches you that health sometimes lives in a gauge, not a counter.

  5. 5

    Build one health panel and validate it under injected load

    4 guided steps

    The whole point of this rung is replacing six dashboards with one trustworthy view. A health panel is only useful if you have demonstrated each SLI responds to the failure it is supposed to catch and ignores noise it should not.

What's inside when you start

2 starter files, ready to clone
5 guided milestones
5 full reference solutions
7 code blocks explained line-by-line
5 "is it working?" checks
4 interview questions it prepares you for

You'll walk away with

slo/sli-candidates.md mapping the user journey to keep SLIs and rejected vanity metrics with reasons
slo/queries/availability.promql: non-5xx ratio for POST /checkout
slo/queries/latency.promql: p95 checkout latency from histogram buckets
slo/queries/freshness.promql: inventory data age from a timestamp gauge
slo/dashboards/health.json: one Grafana panel showing all three SLIs
A short validation note recording which knob moved which SLI (FAIL_RATE, LATENCY_MS, inventory pause)

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building