Triage a live incident, find it, stop it
You just joined the team that runs Carta, a small storefront checkout system, and the pager goes off in your first hour: checkout is timing out.
What you'll build
You walk away having run a full on-call loop under pressure: symptom to dashboards to logs to a localized root cause (connection pool exhaustion) to a mitigation (raise the pool, scale api) to telemetry-verified recovery, plus a short written incident timeline you could paste into a channel. You learn to USE existing observability rather than build it, and you internalize that "the dashboard is green again" is a claim you must prove, not assume.
See how we teach, before you sign up
You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:
# Runbook: checkout latency triage (the on-call loop)
Follow this loop every time. Do not skip steps under pressure.
1. Orient: confirm the baseline. What does healthy p95 look like right now?
- Grafana http://localhost:3001 -> Carta Overview -> checkout p95
2. Reproduce / observe the symptom.
- Load: k6 run load/checkout.js
- Confirm: timed checkout with curl -w "%{time_total}\n"
3. Localize. Eliminate, do not guess.
- Are downstreams (catalog, inventory, payments) slow, or just api?
- Logs: Loki Explore -> {service="api"} |= "pool"
4. Mitigate. Smallest safe change first.
- Raise DB_POOL_SIZE, optionally add an api replica, recreate api.
5. Verify recovery WITH telemetry, under the same load.
- p95 back near baseline AND pool wait logs stopped.
6. Write the timeline (docs/incident-timeline.md).Reading this file
1. Orient: confirm the baseline.Step one is always orientation. You cannot prove recovery without a baseline number.Loki Explore -> {service="api"} |= "pool"The exact LogQL query that localizes pool exhaustion, ready to paste into Grafana Explore.4. Mitigate. Smallest safe change first.Raise the pool before reaching for bigger hammers. Minimal, reversible mitigations first.5. Verify recovery WITH telemetry, under the same load.The habit this whole rung teaches: recovery is a telemetry claim, proven under the original load.
Your on-call loop on one page. Keep it open while you work the incident, it is the spine of every milestone.
That's 1 of 8 explained code blocks in this single project.
The build, milestone by milestone
- 1
Bring up the inherited stack and confirm a healthy baseline
4 guided stepsOn-call begins with orientation, not action. A baseline turns a vague feeling ("checkout feels slow") into a measurable claim ("p95 went from 90ms to 4s"). Without it you cannot tell whether a mitigation actually worked.
- 2
Reproduce the incident by injecting load
4 guided stepsA fault you cannot reproduce is a fault you cannot confidently fix. Reproducing on demand lets you watch the symptom appear and, later, watch it disappear, which is how you prove the mitigation worked.
- 3
Localize the fault with Grafana and Loki
4 guided stepsThe hardest part of on-call is not fixing, it is localizing. Latency at the edge can come from any layer. Disciplined elimination (which dependency is slow? what do the logs say?) is what separates a 10 minute incident from a 2 hour one.
- 4
Mitigate by raising the pool and scaling api, then verify recovery
4 guided stepsMitigation without verification is wishful thinking. Many incidents get prolonged because someone declared victory at deploy time. Closing the loop with the same dashboard you opened it with is the core SRE habit this whole rung exists to teach.
- 5
Write the incident timeline
4 guided stepsIncidents that are not written down repeat. A crisp timeline with real numbers makes the next on-call faster, feeds a future postmortem, and is the artifact a hiring manager or teammate trusts more than any verbal war story.
What's inside when you start
You'll walk away with
This is portfolio-grade. Build it free.
Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.
Start building