Learn from failure: blameless postmortems
Continues from the last build: You ran the r11 drill as Incident Commander with severity, roles, status updates, and a clean handoff, but no written record of what actually happened.
Last week you ran the incident on Carta clean: you declared SEV2, took the IC role, gave status updates every 20 minutes, and handed off without dropping the ball.
What you'll build
You walk away with a reusable postmortem template, one full worked postmortem of the r11 payments drill, a backlog of action items with owners and due dates, and concrete edits to SLOs, alert rules, and a runbook. More importantly, you can run a blameless review meeting that produces change instead of blame, and you understand why "root cause" is usually a trap and "contributing factors" is the honest frame.
See how we teach, before you sign up
You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:
# Postmortem: <title> ID: PM-<date>-<slug> Status: Draft Severity: SEV<n> Authors: <you> ## Summary One paragraph: what broke, who was affected, how long. ## Timeline (UTC) | Time | Event | Source | |------|-------|--------| | | | | Detection time: __ min Mitigation time: __ min ## Analysis We do not name a single root cause. Contributing factors: 1. [technical|process|detection] <factor> counterfactual: if <X> had <Y>, the incident would have <Z>. ## What went well - ## Action items | ID | Action | Owner | Due | Priority | Factor | |----|--------|-------|-----|----------|--------| | | | | | | | ## Review record Date: Attendees: Decisions: Status: Final
Reading this file
Status: DraftThe doc starts Draft and only becomes Final in the review record, so its lifecycle is visible at a glance.| Time | Event | Source |The timeline table forces a source on every row, which is the milestone-1 discipline baked into the template.We do not name a single root cause.The blameless frame is pre-written so the author cannot forget it under deadline.| ID | Action | Owner | Due | Priority | Factor |The action-item columns match the verify checks downstream: id, single owner, date, priority, linked factor.
Copy this per incident. The headings are the checklist: a section left empty is a question you have not answered yet. Keep the 'we do not name a single root cause' line, it sets the frame.
That's 1 of 8 explained code blocks in this single project.
The build, milestone by milestone
- 1
Reconstruct the timeline from evidence
4 guided stepsA postmortem written from memory drifts toward the author's narrative. A timeline anchored to log lines and panel screenshots is evidence, not opinion, and it is the only way to honestly compute detection and mitigation times. Disagreements at the review collapse fast when every claim has a citation.
- 2
Write the analysis: contributing factors, not root cause
4 guided stepsReal incidents are never one cause. The payments stub failing is necessary but not sufficient: it became a SEV2 because retries were missing, the alert went unacked for six minutes, and the inventory cache masked early signal. Naming one 'root cause' hides the other three and lets them survive. Counterfactuals turn analysis into a to-do list.
- 3
Derive action items with owners and due dates
4 guided stepsThe single most common reason postmortems change nothing is action items with no owner and no date. 'The team will add retries' is a wish; 'AI-1, owner Priya, due 2026-06-24, add retry-with-backoff to checkout payments call' is a commitment you can audit. One owner per item, never a team, because shared ownership is no ownership.
- 4
Wire lessons back into SLOs, alerts, and runbooks
4 guided stepsr10 gave you SLOs and alerts, r11 gave you the incident command system, and this rung proves they are living things. If the postmortem does not change an alert, a runbook, or an SLO, the same incident runs the same way next time. The edit is small; the discipline of always making it is the skill.
- 5
Run the blameless review meeting
4 guided stepsA postmortem reviewed by its author alone is a diary. The review is where other people stress-test the timeline, where owners accept their items out loud, and where the org decides what 'done' means. The meeting being blameless is a property you enforce live: you redirect any sentence that starts blaming a person back to the system.
What's inside when you start
You'll walk away with
This is portfolio-grade. Build it free.
Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.
Start building