On this page
- The difference is the response, not the outage
- Severity levels, match the response to the impact
- The incident commander, one brain, not five
- The incident lifecycle, step by step
- Runbooks, don't improvise the known stuff
- Blameless postmortems, fix systems, not people
- On-call that doesn't grind people down
- Common mistakes that cost hours
- Where to go next
The difference is the response, not the outage
Two teams have the same outage. The first descends into chaos: nobody knows who's in charge, three people make conflicting changes, leadership floods the channel asking for updates, the on-call engineer is alone and overwhelmed, and the customer comms are silence followed by a defensive apology. The second runs it like a drill: one clear leader, calm comms, a known process, and it's resolved in a third of the time with a fraction of the stress.
Same failure, two completely different outcomes. The difference isn't the technology, it's whether the team has a practised incident process and a humane on-call culture. This article covers both: how to run an incident without chaos, and how to build a rotation that doesn't burn people out. Because the engineers you have are the ones who'll be there at 3am, and you want them to last.
Who this is for
Anyone who's on-call, about to be, or building a team that will be. No SRE title required, this is the process and culture, in plain terms.
Severity levels, match the response to the impact
Not every alert is an emergency, and treating them all as one is how you burn out a team. Severity is how you decide how hard to pull the fire alarm.
A severity level classifies an incident by user impact, which drives how many people you pull in and how fast. The exact scale varies, but the shape is universal:
| Impact | Response | |
|---|---|---|
| SEV1 | Critical, major outage, data loss, security breach | All hands, incident commander, wake people up |
| SEV2 | Significant, key feature down, severe degradation | On-call + relevant team, urgent but daytime-ok |
| SEV3 | Minor, limited impact, workaround exists | Handle in normal hours, ticket it |
Agreeing on these *before* an incident is what prevents the two failure modes: panicking over a minor issue, or under-reacting to a real one. Severity is a shared language that lets everyone instantly understand how serious "we have an incident" actually is.
The incident commander, one brain, not five
The single biggest upgrade to chaotic incident response is the incident commander (IC) role. During a serious incident, one person is in charge, and crucially, the IC's job is to *coordinate*, not to fix. They hold the big picture, decide what to try, delegate the hands-on work, and shield the responders from distraction so they can think.
- Incident Commander, owns the response, makes decisions, delegates. Does *not* type fixes themselves.
- Operations / responders, the people actually investigating and making changes, one set of hands at a time.
- Communications lead, handles updates to stakeholders and customers, so responders aren't interrupted.
- Scribe, keeps a timestamped log of what happened and what was tried (gold for the postmortem).
The IC is a hat, not a title
The incident commander is whoever takes the role for *this* incident, often the on-call engineer, not the most senior person in the room. Seniority doesn't command incidents; the IC does. This separation stops the classic failure where five smart people make five conflicting changes and nobody knows the current state of the system.
The incident lifecycle, step by step
A well-run incident follows the same arc every time. Knowing the steps means nobody has to invent the process at 3am:
- 1
Detect
An alert fires (or a customer reports it). Good observability means you detect it before your users tweet about it.
- 2
Declare & triage
Call it an incident, assign a severity, and open a dedicated channel. Declaring early is free; under-declaring costs you the response.
- 3
Assign the IC
Someone takes the incident commander hat and announces it, so everyone knows who's coordinating.
- 4
Communicate
Post a status update on a regular cadence, even "still investigating, next update in 15 min." Silence is what makes stakeholders panic and pile on.
- 5
Mitigate, then fix
Stop the bleeding first, roll back, fail over, disable the feature. Restore service now; find the elegant root-cause fix later.
- 6
Resolve & verify
Confirm with data (not hope) that users are actually recovered, then formally close the incident.
- 7
Postmortem
Within a few days, while it's fresh: write up what happened, why, and what changes will stop a repeat. Blamelessly (see below).
Pro tip
Mitigate before you diagnose. The instinct to find the root cause first is natural and wrong, every minute spent understanding *why* is a minute users are still down. Roll back or fail over to restore service, *then* investigate at your leisure. Recovery time is the metric users feel.
Runbooks, don't improvise the known stuff
A runbook is a written, step-by-step guide for handling a specific known scenario: "the queue is backing up," "the database is failing over," "how to roll back a deploy." The value is brutal and simple, at 3am, with adrenaline high and judgment impaired, you do not want to be reasoning from scratch. You want a checklist a tired human can follow correctly.
Pro tip
Write the runbook the day after the incident, while the steps are fresh. The best runbooks come straight out of postmortems, every incident you survive should leave behind a runbook so the next person handles it in five minutes instead of an hour.
Blameless postmortems, fix systems, not people
After a significant incident, you write a postmortem: a document of what happened, the timeline, the impact, the root cause, and concrete action items to prevent a recurrence. The single most important word is blameless. The goal is to fix the *system* that allowed the failure, not to find a person to blame.
If a single human error could take down your system, the problem is the system, not the human. Blame drives mistakes underground; blamelessness brings them into the light where you can fix them.
This isn't soft, it's strategic. The moment people fear blame, they stop reporting near-misses, they hide mistakes, and you lose the information you need to get safer. A blameless culture treats every incident as a free lesson the system paid for, and asks "what about our process let this happen?" rather than "whose fault was it?" The action items that come out of it must be real, owned, and tracked, a postmortem with no follow-through is theatre.
On-call that doesn't grind people down
All the process in the world fails if the humans running it are exhausted and resentful. On-call is sustainable only when it's deliberately designed to be. The two enemies are bad rotations and bad alerts, and both are fixable.
- Humane rotations, enough people that nobody is on-call too often; a secondary for backup; compensation or time-off-in-lieu for nights disrupted; and the ability to swap shifts easily.
- Alert hygiene, every page must be *actionable* and tied to *user impact*. If an alert doesn't require a human to do something right now, it shouldn't page, make it a ticket or a dashboard.
- Kill the noise, relentlessly tune or delete flaky, non-actionable alerts. Alert fatigue is when people start ignoring the pager, and that's how a real SEV1 gets missed.
- Protect recovery, someone paged repeatedly overnight should not be expected at standup. Sleep is a system requirement, not a perk.
Alert fatigue is a reliability risk, not just a morale one
When the pager cries wolf fifty times a week, the human learns to ignore it, and then sleeps through the one page that mattered. Noisy alerting doesn't just burn people out; it actively makes your system less reliable. Treat every non-actionable page as a bug to be fixed.
Common mistakes that cost hours
- No incident commander. Five people make conflicting changes and nobody knows the system's current state. Name one coordinator, immediately.
- Diagnosing before mitigating. Hunting the root cause while users are down. Stop the bleeding first, roll back or fail over, then investigate.
- Going silent. Stakeholders with no updates assume the worst and flood the channel. Communicate on a cadence, even with "no news yet."
- Blameful postmortems. Naming a culprit makes everyone hide their mistakes, and you lose the lessons. Fix the system, not the person.
- Noisy, non-actionable alerts. They burn out on-call and train people to ignore the pager, so the real incident gets missed. Every page must be actionable.
Where to go next
The whole article in 6 lines
- Outages are inevitable; chaos isn't, a **practised process** is the difference between a 20-minute and a 2-hour incident.
- **Severity levels** match the response to the impact, so you neither panic over minor issues nor under-react to major ones.
- The **incident commander** coordinates and delegates, one brain in charge, not five people typing conflicting fixes.
- **Mitigate before you diagnose**, restore service first, find the elegant root cause later.
- **Blameless postmortems** fix the system, not the person; blame just drives mistakes into hiding.
- Sustainable on-call needs **humane rotations** and ruthless **alert hygiene**, every page must be actionable.
Incident response is the human side of reliability. It works best sitting on a foundation of good observability and resilient design:
- You can't respond to what you can't see, and good signals reduce alert fatigue: Observability: The Three Pillars.
- Fewer incidents in the first place, design systems that contain failure: Reliability & Resilience: Designing for Failure.
- See incident culture at the scale that pioneered much of it: How Netflix Built Its Streaming Pipeline.
Write one runbook and audit your noisiest alert this week. Your future 3am self, and your whole rotation, will thank you.
Want to go deeper?
This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.