Getting paged is the easy part. Running the incident, declaring severity, assigning a commander, and keeping customers informed, is what separates a 20-minute blip from a chaotic afternoon. Here's the playbook.
Checkout starts throwing 500s at 2:14pm. The alert fires, and within minutes five engineers have piled into a Slack channel. One is tailing logs. One is rolling back a deploy. One is staring at the database dashboard. Two are reading the other three's messages and guessing. Everyone is *working*. Nobody is *coordinating*.
Forty minutes pass. The deploy rollback finishes, but the errors don't stop, because the real cause was a database connection pool exhausted by a separate batch job nobody mentioned. Meanwhile, the support queue is on fire, the status page still says "All systems operational," and the VP of Sales just learned about the outage from an angry customer on LinkedIn.
The technical problem here was small. The coordination problem cost the company an hour and its credibility. This article is about the other half of incident response, not how you get paged (that's the on-call rotation), but how you actually *run* the thing once you're in it.
Who this is for
Engineers who are on-call but have never been the person in charge of an incident, junior SREs, and team leads who want a repeatable way to handle outages instead of improvising every time. No prior incident-command experience assumed.
Coordination beats heroics
In a serious incident, the bottleneck is almost never raw debugging skill. It's coordination: who's deciding, who's fixing, who's talking to the outside world, and making sure those are different people.
The instinct under pressure is to *do more*: more people, more dashboards, more parallel theories. But an incident without structure degrades as you add people, not improves, every new responder needs context, and without someone owning the room, that context never gets shared. The fix is borrowed from emergency services, where chaos is the default and structure is survival.
Triage nurse sorts arrivals by how critical they areSeverity level (SEV1/2/3) sets the response before anyone touches the keyboard
Charge nurse updates the family in the waiting roomComms Lead writes status updates for customers and stakeholders
A chart records every drug, dose, and timestampScribe captures the timeline as it happens
Specialists are paged in for the heart, the brain, the bonesSMEs (subject-matter experts) are pulled in for the database, the network, the payment provider
An incident runs like a hospital emergency room under the Incident Command System used by fire and rescue.
What running an incident looks like
Here's the whole lifecycle on one picture. The key insight is the fork in the middle: once roles are assigned, investigation and communication run *in parallel*. The people fixing the system are not the people talking to the world.
An incident flows from a single alert to a declared incident, forks into fix-and-communicate, then converges on resolution and a postmortem.
1
Acknowledge and assess
The on-call engineer acks the page and takes a first look. Is this real? What's the blast radius, one customer, one region, or everyone?
2
Declare an incident and set severity
If it's user-impacting, say the words: "I'm declaring a SEV2." Declaring is a deliberate act, it opens a channel, starts the clock, and pulls in process. Don't wait to be sure; you can downgrade later.
3
Become (or hand off) the Incident Commander
Whoever declared is IC by default. If you'd rather be heads-down debugging, explicitly hand the IC role to someone else: "Priya, can you take IC?", and get a yes.
4
Assign Comms and Scribe
The IC names a Comms Lead and a Scribe out loud. For a SEV1 these are separate humans; for a SEV3 the IC may wear all three hats.
5
Run two tracks in parallel
SMEs investigate and fix. Comms posts updates on the cadence the severity demands. The IC bridges the two, pulling status from the fix track and feeding it to Comms.
6
Resolve, then hand off to a postmortem
Once the service is healthy, declare the incident resolved, post a final update, and schedule the [blameless postmortem](/blog/blameless-postmortems). The Scribe's timeline becomes its backbone.
Severity levels: the dial that sets everything
Severity is the first decision and the most important one, because it sets the response: who gets pulled in, how often you communicate, and whether anyone's weekend gets interrupted. Get it wrong and you either over-react to a minor blip or sleepwalk through a real outage. Most teams use three levels.
Level
Impact
Response
Comms cadence
SEV1
Critical, core feature down for most users, data loss, or revenue stopped
All hands, page leadership, dedicated IC + Comms + Scribe
Every 15–30 min, public status page
SEV2
Major, significant degradation or one feature broken for many users
On-call + relevant SMEs, IC assigned
Every 30–60 min, status page + stakeholders
SEV3
Minor, limited impact, a workaround exists, no urgent revenue risk
On-call handles it, IC role often combined
At start and resolution; internal only
A typical SEV scale. Tune the exact thresholds to your product, but keep the shape.
When in doubt, declare higher
It's cheap to downgrade a SEV1 to a SEV2 twenty minutes in. It's expensive to discover at minute 40 that the "minor glitch" was a SEV1 the whole time. Round up, then walk it back.
Templates that make comms automatic
Under stress, blank pages are the enemy. Two templates remove almost all the friction: one to spin up the incident channel, and one for every status update. Paste, fill, send.
incident-channel-template.md
markdown
🚨 INCIDENT DECLARED, #inc-2026-06-06-checkout
Severity: SEV2
Started: 14:14 UTC
Summary: Checkout returning 500s for ~30% of users
ROLES
Incident Commander: @alex
Comms Lead: @priya
Scribe: @sam
SMEs: @dana (payments), @lee (database)
LINKS
Status page: https://status.example.com/incidents/4821
Dashboard: https://grafana.example.com/d/checkout
Runbook: https://wiki.example.com/runbooks/checkout-5xx
Post updates in this thread. Keep side-chatter out.
The Scribe owns the timeline.
And the status update, the thing the Comms Lead posts on cadence. Keep it boring and predictable; people in a panic want the same fields in the same order every time.
status-update-template.md
markdown
**[SEV2] Checkout errors, UPDATE 14:32 UTC**
Status: Investigating
Impact: ~30% of checkout attempts failing with a 500 error.
Browsing, cart, and account features are unaffected.
What we know:
Errors began at 14:14 after a batch job exhausted the
DB connection pool. We're isolating the job now.
Next step: Throttle the batch job and add pool headroom.
Next update: by 15:00 UTC (or sooner if status changes).
"Next update by" is the line that matters most
Customers and execs don't panic because something is broken, they panic because they don't know if anyone's on it. Always promise a time for the next update, even if the update is just "still working on it." Silence reads as abandonment.
The four roles, in plain terms
Incident Commander (IC)
The IC owns the incident, not the fix. They decide severity, assign roles, keep the timeline moving, and make the call when there's disagreement. Crucially, the IC does not debug, the moment they go heads-down in the logs, nobody is steering. The IC asks questions ("What have we ruled out? What's the riskiest thing we could try?") and removes blockers.
Comms Lead
Owns everything that leaves the room: the status page, customer-facing updates, and the running summary for leadership. They translate engineer-speak ("the pool's exhausted") into human-speak ("some checkouts are failing; here's what we're doing"). This frees the IC and SMEs to focus on the problem instead of fielding "any update?" pings.
Scribe
Captures the timeline as it happens, when the alert fired, when each action was taken, what was observed. This is invaluable for two reasons: it stops the team repeating dead-end theories, and it hands the postmortem a factual record instead of fuzzy memory. A good Scribe timestamps everything.
Subject-Matter Expert (SME)
The people who actually fix things, the database owner, the payments specialist, the network engineer. SMEs are pulled in by the IC as the investigation narrows. They report findings up to the IC rather than broadcasting half-theories to the whole channel.
Small incident? One person, many hats.
Roles are responsibilities, not headcount. A SEV3 at 3am might have one engineer being IC, Comms, and Scribe all at once. The point is that each responsibility is consciously owned, not that you need four people for every blip.
Common mistakes that cost hours
No Incident Commander. Everyone debugs, nobody coordinates. Theories collide, work gets duplicated, and the same dashboard gets checked five times. Name an IC in the first five minutes, always.
Debugging in silence. The fix track makes progress but the outside world hears nothing. Support guesses, execs panic, the status page lies. Comms is not optional; it runs in parallel from minute one.
Unclear or absent severity. Without a declared SEV, nobody knows the right response level, so you either swarm a trivial issue or under-staff a real outage. Say the severity out loud, early.
The IC has their hands in the logs. A commander who's also debugging is no longer commanding. If you want to dig in, hand off IC first.
No "next update by" time. Updates that don't promise a next one leave stakeholders refreshing the page anxiously. Always commit to a next time.
Resolving without handing off. The fire's out, everyone vanishes, and the lesson evaporates. Resolution is not the end, the postmortem is.
Takeaways
The whole playbook in seven lines
Getting paged is on-call; running the incident is incident command, they're different skills.
Coordination beats heroics: structure stops an incident degrading as you add people.
Severity is the first decision, it sets response level and comms cadence. When in doubt, declare higher.
The Incident Commander coordinates and decides; they do not debug.
Fix and comms run in parallel, different people, different tracks, one IC bridging them.
Always promise a "next update by" time; silence reads as abandonment.
Resolution isn't the finish line, hand the Scribe's timeline to a blameless postmortem.
Where to go next
Incident command sits in the middle of the SRE operational loop: a healthy on-call rotation gets the right person paged, incident command runs the response, and a blameless postmortem turns the pain into permanent improvement.
Blameless Postmortems, how to turn the Scribe's timeline into lasting fixes instead of blame.
Follow the full SRE career path to see where incident response fits alongside SLOs, observability, and reliability engineering.
Want to go deeper?
This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.