Automate incident response and safe rollback
Continues from the last build: Rung 11 gave teams a self-service golden path, and Rung 10 showed you the DORA numbers. You can measure failures with DORA, but recovering from one is still a manual, stressful scramble: someone digs for the last good SHA, copies kubectl commands from a wiki, and prays.
It is 02:14 and the canary you promoted at 22:00 has been quietly failing 18 percent of SMS sends.
What you'll build
You finish with a delivery platform that detects a bad release and recovers from it on its own or with one human keystroke. A bad SHA in production triggers an automated or ChatOps rollback to the last known-good image, the rollback verifies service health before declaring success, every action is logged for the postmortem, and a recurring game-day drill proves your MTTR is measured in single-digit minutes rather than tense guesswork at 02:14.
See how we teach, before you sign up
You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:
{
"prod": {
"api": "a1b2c3d",
"worker": "a1b2c3d",
"verified_at": "2026-06-08T22:00:00Z",
"verified_by": "canary-analysis"
},
"staging": {
"api": "e4f5a6b",
"worker": "e4f5a6b",
"verified_at": "2026-06-09T09:30:00Z",
"verified_by": "smoke-suite"
}
}Reading this file
"api": "a1b2c3d"The last known-good api image SHA. The rollback script reads this, so recovery never depends on a human remembering the tag at 02:14."verified_by": "canary-analysis"Records WHY this SHA is trusted: it survived the Rung 8 canary. This provenance is gold in a postmortem."verified_at": "2026-06-08T22:00:00Z"Timestamp lets you compute how long the bad release was live, a direct input to MTTR."staging":Per-environment keys keep prod and staging rollbacks from ever crossing wires, the fat-finger that hurt last incident.
The single source of truth for what to roll back TO. Updated only after a release passes canary analysis or the smoke suite, never by hand during an incident.
That's 1 of 8 explained code blocks in this single project.
The build, milestone by milestone
- 1
Make rollback one idempotent command
4 guided stepsLast incident, recovery was a scavenger hunt for the right SHA and namespace. Making rollback a single idempotent command removes the two biggest sources of MTTR: deciding what to roll back to, and typing it correctly under stress.
- 2
Verify health before declaring recovery
4 guided stepsA rollback that flips images but never confirms the app actually recovered is theater. Verification turns rollback into a closed loop and produces the timestamp that ends your MTTR clock honestly.
- 3
Trigger rollback from Slack with ChatOps
4 guided stepsAt 02:14 nobody should context-switch to a terminal, find the right kubeconfig, and remember flags. ChatOps puts recovery where the incident conversation already is, with an audit trail baked in.
- 4
Auto-trigger rollback on SLO and canary breach
4 guided stepsSLO and change-failure breaches are exactly the conditions where humans are slowest and most error-prone. Tying them straight to the rollback you already built collapses MTTR toward the time the alert takes to fire.
- 5
Codify on-call handoff and the blameless postmortem
4 guided stepsA recovery you cannot learn from repeats. A blameless template plus an auto-generated stub means the postmortem starts written, the timeline is accurate, and the handoff is unambiguous, so the next on-call inherits context instead of a mystery.
- 6
Prove low MTTR with a scheduled game-day drill
4 guided stepsUntested recovery is a belief, not a capability. A recurring game-day turns rollback into a rehearsed, measured drill, so the first time you roll back in anger is not actually the first time.
What's inside when you start
You'll walk away with
This is portfolio-grade. Build it free.
Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.
Start building