Writing Effective Runbooks

On this page

It is 3am and the pager just went off
The principle: optimise for the tired stranger
The shape of an incident, the shape of a runbook
Weak runbook vs strong runbook
Walkthrough: write a runbook for one alert
A runbook template you can copy
Linking runbooks from alerts
Keeping runbooks current
Toward executable runbooks
Common mistakes that cost hours
Takeaways
Where to go next

The principle: optimise for the tired stranger

A runbook is not documentation of how the system works. It is a set of instructions for what to do when it does not.
The on-call mindset

The reader of your runbook is not you. It is a tired stranger, a teammate who has never seen this service, woken from deep sleep, with adrenaline blunting their judgement. Every sentence should reduce the thinking that stranger has to do. No background theory, no "it depends", no links to a 40-page design doc. Just: look at this, run that, if you see X do Y.

Engine fire warning lightAn alert firing

The laminated emergency checklist for that exact lightThe runbook linked from the alert

Steps in fixed order: confirm, isolate, extinguishSymptoms, diagnosis, remediation

"If still burning, divert to nearest airport"Escalation path when remediation fails

Pilots train on it before they ever need itRunbooks tested in game days, not first used live

A runbook is a pilot's emergency checklist, not a textbook on aerodynamics.

Pilots do not improvise during an engine fire. They reach for the checklist for that specific failure and execute it in order. That is the bar. Your runbook should make a competent-but-unfamiliar engineer as effective as the person who built the system, under pressure, with no time to learn.

The shape of an incident, the shape of a runbook

Every on-call response follows the same arc, so every runbook should too. An alert fires and links straight to its runbook. The engineer checks the symptoms to confirm they are in the right place, runs diagnosis commands to narrow the cause, applies the remediation, and verifies recovery. If the fix does not work, they escalate, they do not sit and stare.

The path from a 3am page to resolution, every runbook should map onto this flow.

1
Alert fires with a runbook link
The page itself carries a `runbook_url`. One tap takes the engineer to the exact playbook, no hunting through the wiki.
2
Confirm the symptoms
A short "you should see X" section so the engineer knows this is the right runbook and the alert is real, not a flapping false positive.
3
Run the diagnosis commands
Copy-pasteable commands that narrow the cause: which dependency is slow, which pod is unhealthy, which queue is backed up.
4
Apply the remediation
The known fix for the most common cause, restart, scale, fail over, clear a queue, roll back a deploy.
5
Verify recovery
How to confirm it worked: the metric drops, the alert clears, the synthetic check goes green.
6
Escalate if stuck
If diagnosis is inconclusive or the fix did not hold, who to page next and what context to hand them.

Weak runbook vs strong runbook

Most teams have runbooks. Most of those runbooks fail the 3am test. The difference is rarely effort, it is whether the author wrote for themselves or for the tired stranger.

Dimension	Weak runbook	Strong runbook
Trigger	Generic, "service is having issues"	Tied to one specific alert by name
Commands	"Check the logs"	Exact, copy-pasteable command with the right namespace
Audience	Written for the author who already knows the system	Written for a teammate who has never seen it
Decisions	"Investigate and fix as appropriate"	If you see X, do Y; if Z, do W
Escalation	Absent, you are on your own	Named owner, secondary, and what to include in the handoff
Freshness	Last edited 18 months ago, commands broken	Reviewed each time it is used; owner and review date stamped
Location	Somewhere in the wiki, found by searching	One click from the alert via runbook_url

The same runbook, written two ways.

Walkthrough: write a runbook for one alert

Do not try to document the whole service. Pick a single alert that has paged someone recently and write the runbook for that. One alert, one runbook. Repeat for your next-noisiest alert next week.

1
Pick a real, recent alert
Choose one that actually fired and woke someone up, say `CheckoutLatencyHigh`. Real alerts have real, known fixes; hypothetical ones produce vague runbooks.
2
Write the symptoms section
What does the engineer see when this is real? "p99 latency on /checkout above 2s for 5 minutes; error rate may also climb." This confirms they are in the right place.
3
List the top one or two causes
Ask whoever has been paged: when this fired, what was actually wrong? Usually one or two causes cover most pages, a slow downstream dependency, a bad deploy, a resource limit.
4
Write exact diagnosis commands
For each cause, the precise command to confirm it. Real namespace, real service name, no placeholders the reader has to guess. Test that each one runs as written.
5
Write the remediation for each cause
The fix, as a command or a clear action. Restart the deployment, scale replicas, roll back, fail over. State the expected effect.
6
Add verification and escalation
How to confirm recovery, then who to escalate to with what context if it does not recover. Add the owner's name and a review date.
7
Link it from the alert
Add the runbook URL to the alert definition so the page carries the link. A runbook nobody can find at 3am is not a runbook.

A runbook template you can copy

Keep runbooks in version control next to the service, not in a wiki, that way they are reviewed in pull requests and never drift from the code. Markdown is plenty. Here is a complete template, filled in for our example alert.

runbooks/checkout-latency-high.md

markdown

# Runbook: CheckoutLatencyHigh

**Owner:** payments-team · **Secondary:** platform-oncall
**Last reviewed:** 2026-06-01 · **Severity:** SEV-2

## Symptoms
You were paged because p99 latency on `/checkout` exceeded 2s for 5 minutes.
You should see:
- Latency panel on the [Checkout dashboard] climbing above the red line
- Possibly a rising 5xx error rate on the same panel

If latency is already back to normal, the alert may have self-resolved, confirm,
then close. Do not skip the verify step.

## Diagnosis
Run these in order. Each one points at a likely cause.

```bash
# 1. Are checkout pods healthy and not restarting?
kubectl -n payments get pods -l app=checkout

# 2. Is the payments DB the bottleneck? (slow query latency)
kubectl -n payments logs deploy/checkout --since=10m | grep -i "slow query"

# 3. Was there a recent deploy that lines up with the alert?
kubectl -n payments rollout history deploy/checkout
```

## Remediation
Match the diagnosis to a fix:

| If you see... | Do this |
|---|---|
| Pods CrashLooping or OOMKilled | `kubectl -n payments rollout restart deploy/checkout` |
| Slow queries + a recent deploy | Roll back: `kubectl -n payments rollout undo deploy/checkout` |
| Healthy pods, slow downstream | Scale out: `kubectl -n payments scale deploy/checkout --replicas=8` |

## Verify recovery
- p99 latency on the Checkout dashboard drops below 2s within ~5 minutes
- The `CheckoutLatencyHigh` alert clears in the alert manager
- The `/checkout` synthetic probe returns green

## Escalate
If none of the above recovers within 15 minutes, or diagnosis is inconclusive:
1. Page **payments-team** (secondary on-call) via PagerDuty.
2. Hand off with: the alert link, which diagnosis steps you ran, and their output.
3. If customer impact is widespread, declare a SEV-1 and open an incident channel.

Keep it in the repo

Runbooks in `runbooks/` next to the service code get reviewed in PRs, version-controlled, and updated when the code changes. A wiki page has none of those forcing functions, which is exactly why wiki runbooks rot.

Linking runbooks from alerts

A perfect runbook nobody can find is worthless. The single highest-leverage habit is wiring the runbook URL into the alert itself, so the page that wakes someone up carries the link. In Prometheus Alertmanager this is an annotation:

alerts/checkout.yml

yaml

groups:
  - name: checkout
    rules:
      - alert: CheckoutLatencyHigh
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{route="/checkout"}[5m])) by (le)) > 2
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "p99 checkout latency above 2s"
          runbook_url: "https://github.com/acme/payments/blob/main/runbooks/checkout-latency-high.md"

Now the on-call notification, in Slack, PagerDuty, or email, renders a clickable runbook link. The tired stranger taps once and lands exactly where they need to be. Adopt a rule: no alert ships without a `runbook_url`. It is a one-line addition that pays back on every single page.

Keeping runbooks current

Runbooks rot faster than any other documentation because the systems they describe change weekly. A stale runbook is worse than none, it sends a half-asleep engineer to run commands that fail or, worse, make things worse. Three habits keep them honest:

Touch it every time you use it. When you run a runbook during an incident and a step is wrong, fix it before you go back to sleep, or first thing after. The best time to update a runbook is right after it failed you.
Review them in incident retros. Every postmortem should ask: did the runbook exist, was it linked, did it work? Action items feed straight back into the runbook.
Stamp an owner and a review date. An unowned runbook is nobody's job to maintain. A visible "last reviewed" date makes staleness obvious at a glance.

Stale runbooks are a trap

An engineer trusts the runbook precisely because they do not know the system. If it tells them to restart a service that was renamed three months ago, you have weaponised their trust against them. Treat runbook accuracy as a reliability requirement, not a nice-to-have.

Toward executable runbooks

The endgame is a runbook a machine can run. Once a runbook's steps are precise enough to copy-paste, they are precise enough to script. Progress along this ladder as a runbook matures:

Prose, "check the logs and restart if needed." The starting point. Better than nothing, barely.
Exact commands, copy-pasteable blocks with real names. The standard this article aims for.
One-click actions, buttons in your on-call tool that run the diagnosis or remediation for you.
Automated remediation, the system runs the safe, well-understood fixes itself and only pages a human if they do not work.

Do not jump straight to automation. Automate a fix only after it has been run by hand enough times that you trust it blindly, automating a flaky remediation just lets the system break itself faster. The runbook is how you earn that trust: it is the human-tested spec the automation is built from. For the hands-on commands behind these steps, the Linux, Bash, and kubectl labs are where you build the muscle memory.

Common mistakes that cost hours

Stale commands. The runbook references a service, namespace, or flag that was renamed. The reader trusts it, runs it, and burns ten minutes on an error. Fix runbooks the moment they fail you.
Vague instructions. "Investigate and resolve" is not a runbook, it is a shrug. Every step should be a concrete command or an unambiguous action.
No escalation path. The runbook covers the happy path and goes silent when the fix does not work, leaving the engineer stranded at the exact moment they most need direction.
No link from the alert. A great runbook in a wiki nobody searches at 3am might as well not exist. Wire runbook_url into every alert.
Documenting the system, not the response. A runbook is not an architecture doc. Cut the background theory; keep the symptoms, the commands, and the decisions.
One giant runbook for everything. A 5,000-word mega-doc is unnavigable under stress. One alert, one focused runbook.

Takeaways

The whole article in seven lines

A runbook is instructions for what to do when the system breaks, written for a tired stranger, not for you.
Structure every runbook the same way: symptoms, diagnosis, remediation, verify, escalation.
One alert, one runbook. Do not try to document the whole service at once.
Wire `runbook_url` into every alert, no alert ships without a runbook link.
Use exact, copy-pasteable commands with real names, not "check the logs".
Always include an escalation path; the runbook must not go silent when the fix fails.
Keep them in version control, stamp an owner and review date, and fix them the moment they fail you.

Where to go next

Runbooks are one pillar of a calm on-call practice. The other two are alerts worth paging on and a clear incident process around them, read those alongside this one.

Sibling reads: Alerting Without Burnout, so the alerts that link to your runbooks are worth waking up for; and Incident Management & On-Call, the wider response process your runbooks plug into.
Practise the commands: the Linux lab, Bash lab, and kubectl lab build the diagnosis-and-remediation muscle memory your runbooks rely on.
Go deep on the role: the SRE career path puts runbooks, alerting, and incident response into a full progression.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

Cloud

Reliability & Resilience: Designing for Failure

Read

DevOps

Incident Management & On-Call That Doesn't Burn People Out

Read

SRE

What is Site Reliability Engineering?

Read

Writing Effective Runbooks

It is 3am and the pager just went off

The principle: optimise for the tired stranger

The shape of an incident, the shape of a runbook

Weak runbook vs strong runbook

Walkthrough: write a runbook for one alert

A runbook template you can copy

Linking runbooks from alerts

Keeping runbooks current

Toward executable runbooks

Common mistakes that cost hours

Takeaways

Where to go next

Want to go deeper?

Reliability & Resilience: Designing for Failure

Incident Management & On-Call That Doesn't Burn People Out

What is Site Reliability Engineering?