Debug the slow box with the USE method
Continues from the last build: Rung 1 left you with checkout latency tamed (DB pool widened from 5) but one api replica still drags, with clean logs.
The pager is quiet again. In rung 1 you traced the checkout latency spike to an exhausted api database connection pool, bumped the pool, and watched the dashboard recover.
What you'll build
You walk away able to triage a slow host without logs or guesses: a repeatable USE checklist across CPU, memory, disk, network, and file descriptors, fluency in top/vmstat/iostat/ss/proc/docker stats, and the ability to read a py-spy flame graph to find a CPU hot path and confirm an fd leak by counting descriptors over time.
See how we teach, before you sign up
You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:
# Inherited prod-sim stack (excerpt). Telemetry services omitted for brevity.
# This rung adds two injected-fault flags. Set to 1 to inject, 0 to fix.
services:
api:
build: ./api
environment:
- DB_POOL_SIZE=20 # widened in rung 1
- API_SLOW_SERIALIZER=1 # INJECTED: CPU-burning JSON hot path
deploy:
replicas: 3
ports:
- "8080:8080"
inventory:
build: ./inventory
environment:
- REDIS_URL=redis://redis:6379
- INVENTORY_LEAK_FDS=1 # INJECTED: fd opened per request, never closed
ports:
- "8082:8082"
payments-stub:
build: ./payments-stub
environment:
- FAIL_RATE=0
- LATENCY_MS=0
ports:
- "9000:9000"
Reading this file
API_SLOW_SERIALIZER=1Set to 1 to inject the CPU hot path, 0 to fix it in the last milestone.INVENTORY_LEAK_FDS=1Set to 1 to inject the fd leak, 0 to fix it once you have confirmed the leak.replicas: 3Three api replicas so only the injected one looks slow on the dashboard.DB_POOL_SIZE=20Carried from rung 1, the pool fix that is no longer the bottleneck.
The two INJECTED comments mark the only lines you toggle this rung. Leaving the stable contract (ports, healthz, k6 path) untouched.
That's 1 of 8 explained code blocks in this single project.
The build, milestone by milestone
- 1
Reproduce the slowness and write a USE checklist
4 guided stepsSRE triage starts with a reproducible signal and a method. The USE method (for every resource, check utilization, saturation, and errors) stops you from tunnel-visioning on CPU when the real problem might be file descriptors. A reproducible repro means you can prove the fix later.
- 2
Walk CPU and memory with top, vmstat, and docker stats
4 guided stepsdocker stats gives a fast per-container read, but it hides per-thread detail and the run queue. vmstat shows saturation (the r column is threads waiting for a CPU) that a single utilization percentage cannot. Knowing the PID lets you target /proc and py-spy later.
- 3
Profile the CPU hot path with a py-spy flame graph
4 guided stepsUtilization tells you that the CPU is busy, not why. A flame graph shows where time is actually spent, wide frames are expensive. py-spy is a sampling profiler that attaches to a live process with no code changes and no restart, exactly what you want during an incident.
- 4
Find the file descriptor leak in inventory
4 guided stepsfd leaks are silent: no CPU spike, no memory blowup, just a slow march toward the ulimit, after which every new connection fails with EMFILE and the service falls over with no obvious cause. Counting /proc/PID/fd over time is the canonical way to catch a leak before it becomes an outage.
- 5
Fix both bugs and prove it with the same tools
4 guided stepsVerification is the difference between resolved and 'seems better'. By re-running k6, docker stats, the flame graph, and the fd count, you prove utilization dropped, the hot path shrank, and the fd count stabilized. This is the closing half of the on-call loop you learned in rung 1, now applied to host performance.
What's inside when you start
You'll walk away with
This is portfolio-grade. Build it free.
Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.
Start building