Break it on purpose: chaos experiments
Continues from the last build: A remediator service auto-heals the known fd-leak signature, but it only handles one failure you already knew about.
Your remediator from the last rung heals the one failure you already understood, the fd leak.
What you'll build
You walk away with a repeatable chaos experiment harness: a steady-state definition tied to your SLIs, three written hypotheses with blast-radius and abort conditions, an injection runner script, and an experiment log documenting what held, what broke, and the one fix the surprise forced. You will be able to defend "our checkout survives a dead catalog" with evidence, not hope.
See how we teach, before you sign up
You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:
# Experiment N: <short name> hypothesis: checkout SLO holds when <dependency> is <fault> blast radius: <which service, how long, what % of traffic> abort: <the exact SLI threshold that stops the run> method: # 1. start baseline load k6 run load/baseline.js # 2. inject <injection command> # 3. observe steady-state dashboard for <duration> # 4. rollback <rollback command> result: <held | broke> + the surprise + the action you took
Reading this file
hypothesis: checkout SLO holds when <dependency> is <fault>A falsifiable claim, the heart of a chaos experiment.abort: <the exact SLI threshold that stops the run>Declared before the injection so you stop early, not after damage.k6 run load/baseline.jsSteady-state traffic must flow or the SLIs are undefined.
Copy this per experiment. The order (hypothesis, blast radius, abort BEFORE method) is the discipline that keeps chaos controlled.
That's 1 of 7 explained code blocks in this single project.
The build, milestone by milestone
- 1
Define steady state and write the experiment template
4 guided stepsChaos without a steady-state definition is just breaking things. The steady-state metrics are your scoreboard: they tell you whether the system tolerated the fault. Writing abort conditions first is what separates a controlled experiment from an outage you caused.
- 2
Experiment 1: kill catalog and observe
5 guided stepsThis is the cheapest, most informative experiment: a hard dependency failure. It directly tests the resilience claim from earlier rungs. If checkout collapses when catalog dies, you found a single point of failure before the flash sale did.
- 3
Experiment 2: inject payments latency with tc netem
5 guided stepsA slow dependency is more dangerous than a dead one: connections pile up, threads block, and the slowness propagates upstream. This tests whether your timeout budget actually protects the checkout latency SLI under realistic provider degradation.
- 4
Experiment 3: fill the Postgres disk
5 guided stepsResource exhaustion is the failure mode teams forget. A full disk does not throw a clean error, it makes writes hang and can corrupt state. This experiment checks the most important safety property of a payment system: it must never take money without recording the order.
- 5
Fix the surprise and write the experiment log
5 guided stepsChaos engineering only pays off if you close the loop: a found weakness becomes a fix becomes a passing re-run. The log is the artifact you bring to the flash-sale readiness review, it is evidence, dated and reproducible, that the checkout path tolerates each fault.
What's inside when you start
You'll walk away with
This is portfolio-grade. Build it free.
Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.
Start building