Back to Blog
SRE11 min readJun 2026

Writing Effective Runbooks

A good runbook lets any on-call engineer resolve an incident at 3am without paging the one person who knows the system. Here is how to write runbooks that actually get used.

SRERunbooksOn-CallOperations
SB

Sri Balaji

Founder · TheSimplifiedTech

On this page

It is 3am and the pager just went off

You are on call. Your phone screams at 3:04am. The alert says `CheckoutLatencyHigh`. You have never touched the checkout service. The one engineer who understands it is asleep, on holiday, or left the company last quarter. The clock is running, customers are getting timeouts, and you are staring at a dashboard you do not recognise.

This is the moment a runbook earns its keep. A runbook is a short, focused operational playbook for one specific situation: here is what this alert means, here is how to confirm what is wrong, here is how to fix it, and here is who to call if you cannot. Done well, it turns tribal knowledge into something anyone on the rotation can execute under stress, half-awake, with zero prior context.

Done badly, it is a stale wiki page that lies to you at the worst possible moment. This article is about writing the first kind.

Who this is for

On-call engineers, SREs, and platform teams who own services and get paged. If you have ever opened an alert and thought "now what?", this is for you. No prior runbook experience assumed, we build one from scratch.

The principle: optimise for the tired stranger

A runbook is not documentation of how the system works. It is a set of instructions for what to do when it does not.
The on-call mindset

The reader of your runbook is not you. It is a tired stranger, a teammate who has never seen this service, woken from deep sleep, with adrenaline blunting their judgement. Every sentence should reduce the thinking that stranger has to do. No background theory, no "it depends", no links to a 40-page design doc. Just: look at this, run that, if you see X do Y.

Engine fire warning lightAn alert firing
The laminated emergency checklist for that exact lightThe runbook linked from the alert
Steps in fixed order: confirm, isolate, extinguishSymptoms, diagnosis, remediation
"If still burning, divert to nearest airport"Escalation path when remediation fails
Pilots train on it before they ever need itRunbooks tested in game days, not first used live
A runbook is a pilot's emergency checklist, not a textbook on aerodynamics.

Pilots do not improvise during an engine fire. They reach for the checklist for that specific failure and execute it in order. That is the bar. Your runbook should make a competent-but-unfamiliar engineer as effective as the person who built the system, under pressure, with no time to learn.

The shape of an incident, the shape of a runbook

Every on-call response follows the same arc, so every runbook should too. An alert fires and links straight to its runbook. The engineer checks the symptoms to confirm they are in the right place, runs diagnosis commands to narrow the cause, applies the remediation, and verifies recovery. If the fix does not work, they escalate, they do not sit and stare.

links tostucknot recovered
Alert fires

CheckoutLatencyHigh

Linked runbook

runbook_url in alert

Check symptoms

Am I in the right place?

Run diagnosis

Scoped commands

Apply fix

Known remediation

Verify recovered

Alert clears?

Escalate

Owner / next tier

The path from a 3am page to resolution, every runbook should map onto this flow.

  1. 1

    Alert fires with a runbook link

    The page itself carries a `runbook_url`. One tap takes the engineer to the exact playbook, no hunting through the wiki.

  2. 2

    Confirm the symptoms

    A short "you should see X" section so the engineer knows this is the right runbook and the alert is real, not a flapping false positive.

  3. 3

    Run the diagnosis commands

    Copy-pasteable commands that narrow the cause: which dependency is slow, which pod is unhealthy, which queue is backed up.

  4. 4

    Apply the remediation

    The known fix for the most common cause, restart, scale, fail over, clear a queue, roll back a deploy.

  5. 5

    Verify recovery

    How to confirm it worked: the metric drops, the alert clears, the synthetic check goes green.

  6. 6

    Escalate if stuck

    If diagnosis is inconclusive or the fix did not hold, who to page next and what context to hand them.

Weak runbook vs strong runbook

Most teams have runbooks. Most of those runbooks fail the 3am test. The difference is rarely effort, it is whether the author wrote for themselves or for the tired stranger.

DimensionWeak runbookStrong runbook
TriggerGeneric, "service is having issues"Tied to one specific alert by name
Commands"Check the logs"Exact, copy-pasteable command with the right namespace
AudienceWritten for the author who already knows the systemWritten for a teammate who has never seen it
Decisions"Investigate and fix as appropriate"If you see X, do Y; if Z, do W
EscalationAbsent, you are on your ownNamed owner, secondary, and what to include in the handoff
FreshnessLast edited 18 months ago, commands brokenReviewed each time it is used; owner and review date stamped
LocationSomewhere in the wiki, found by searchingOne click from the alert via runbook_url
The same runbook, written two ways.

Walkthrough: write a runbook for one alert

Do not try to document the whole service. Pick a single alert that has paged someone recently and write the runbook for that. One alert, one runbook. Repeat for your next-noisiest alert next week.

  1. 1

    Pick a real, recent alert

    Choose one that actually fired and woke someone up, say `CheckoutLatencyHigh`. Real alerts have real, known fixes; hypothetical ones produce vague runbooks.

  2. 2

    Write the symptoms section

    What does the engineer see when this is real? "p99 latency on /checkout above 2s for 5 minutes; error rate may also climb." This confirms they are in the right place.

  3. 3

    List the top one or two causes

    Ask whoever has been paged: when this fired, what was actually wrong? Usually one or two causes cover most pages, a slow downstream dependency, a bad deploy, a resource limit.

  4. 4

    Write exact diagnosis commands

    For each cause, the precise command to confirm it. Real namespace, real service name, no placeholders the reader has to guess. Test that each one runs as written.

  5. 5

    Write the remediation for each cause

    The fix, as a command or a clear action. Restart the deployment, scale replicas, roll back, fail over. State the expected effect.

  6. 6

    Add verification and escalation

    How to confirm recovery, then who to escalate to with what context if it does not recover. Add the owner's name and a review date.

  7. 7

    Link it from the alert

    Add the runbook URL to the alert definition so the page carries the link. A runbook nobody can find at 3am is not a runbook.

A runbook template you can copy

Keep runbooks in version control next to the service, not in a wiki, that way they are reviewed in pull requests and never drift from the code. Markdown is plenty. Here is a complete template, filled in for our example alert.

runbooks/checkout-latency-high.md
markdown
# Runbook: CheckoutLatencyHigh

**Owner:** payments-team · **Secondary:** platform-oncall
**Last reviewed:** 2026-06-01 · **Severity:** SEV-2

## Symptoms
You were paged because p99 latency on `/checkout` exceeded 2s for 5 minutes.
You should see:
- Latency panel on the [Checkout dashboard] climbing above the red line
- Possibly a rising 5xx error rate on the same panel

If latency is already back to normal, the alert may have self-resolved, confirm,
then close. Do not skip the verify step.

## Diagnosis
Run these in order. Each one points at a likely cause.

```bash
# 1. Are checkout pods healthy and not restarting?
kubectl -n payments get pods -l app=checkout

# 2. Is the payments DB the bottleneck? (slow query latency)
kubectl -n payments logs deploy/checkout --since=10m | grep -i "slow query"

# 3. Was there a recent deploy that lines up with the alert?
kubectl -n payments rollout history deploy/checkout
```

## Remediation
Match the diagnosis to a fix:

| If you see... | Do this |
|---|---|
| Pods CrashLooping or OOMKilled | `kubectl -n payments rollout restart deploy/checkout` |
| Slow queries + a recent deploy | Roll back: `kubectl -n payments rollout undo deploy/checkout` |
| Healthy pods, slow downstream | Scale out: `kubectl -n payments scale deploy/checkout --replicas=8` |

## Verify recovery
- p99 latency on the Checkout dashboard drops below 2s within ~5 minutes
- The `CheckoutLatencyHigh` alert clears in the alert manager
- The `/checkout` synthetic probe returns green

## Escalate
If none of the above recovers within 15 minutes, or diagnosis is inconclusive:
1. Page **payments-team** (secondary on-call) via PagerDuty.
2. Hand off with: the alert link, which diagnosis steps you ran, and their output.
3. If customer impact is widespread, declare a SEV-1 and open an incident channel.

Keep it in the repo

Runbooks in `runbooks/` next to the service code get reviewed in PRs, version-controlled, and updated when the code changes. A wiki page has none of those forcing functions, which is exactly why wiki runbooks rot.

Linking runbooks from alerts

A perfect runbook nobody can find is worthless. The single highest-leverage habit is wiring the runbook URL into the alert itself, so the page that wakes someone up carries the link. In Prometheus Alertmanager this is an annotation:

alerts/checkout.yml
yaml
groups:
  - name: checkout
    rules:
      - alert: CheckoutLatencyHigh
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{route="/checkout"}[5m])) by (le)) > 2
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "p99 checkout latency above 2s"
          runbook_url: "https://github.com/acme/payments/blob/main/runbooks/checkout-latency-high.md"

Now the on-call notification, in Slack, PagerDuty, or email, renders a clickable runbook link. The tired stranger taps once and lands exactly where they need to be. Adopt a rule: no alert ships without a `runbook_url`. It is a one-line addition that pays back on every single page.

Keeping runbooks current

Runbooks rot faster than any other documentation because the systems they describe change weekly. A stale runbook is worse than none, it sends a half-asleep engineer to run commands that fail or, worse, make things worse. Three habits keep them honest:

  1. Touch it every time you use it. When you run a runbook during an incident and a step is wrong, fix it before you go back to sleep, or first thing after. The best time to update a runbook is right after it failed you.
  2. Review them in incident retros. Every postmortem should ask: did the runbook exist, was it linked, did it work? Action items feed straight back into the runbook.
  3. Stamp an owner and a review date. An unowned runbook is nobody's job to maintain. A visible "last reviewed" date makes staleness obvious at a glance.

Stale runbooks are a trap

An engineer trusts the runbook precisely because they do not know the system. If it tells them to restart a service that was renamed three months ago, you have weaponised their trust against them. Treat runbook accuracy as a reliability requirement, not a nice-to-have.

Toward executable runbooks

The endgame is a runbook a machine can run. Once a runbook's steps are precise enough to copy-paste, they are precise enough to script. Progress along this ladder as a runbook matures:

  1. Prose, "check the logs and restart if needed." The starting point. Better than nothing, barely.
  2. Exact commands, copy-pasteable blocks with real names. The standard this article aims for.
  3. One-click actions, buttons in your on-call tool that run the diagnosis or remediation for you.
  4. Automated remediation, the system runs the safe, well-understood fixes itself and only pages a human if they do not work.

Do not jump straight to automation. Automate a fix only after it has been run by hand enough times that you trust it blindly, automating a flaky remediation just lets the system break itself faster. The runbook is how you earn that trust: it is the human-tested spec the automation is built from. For the hands-on commands behind these steps, the Linux, Bash, and kubectl labs are where you build the muscle memory.

Common mistakes that cost hours

  1. Stale commands. The runbook references a service, namespace, or flag that was renamed. The reader trusts it, runs it, and burns ten minutes on an error. Fix runbooks the moment they fail you.
  2. Vague instructions. "Investigate and resolve" is not a runbook, it is a shrug. Every step should be a concrete command or an unambiguous action.
  3. No escalation path. The runbook covers the happy path and goes silent when the fix does not work, leaving the engineer stranded at the exact moment they most need direction.
  4. No link from the alert. A great runbook in a wiki nobody searches at 3am might as well not exist. Wire runbook_url into every alert.
  5. Documenting the system, not the response. A runbook is not an architecture doc. Cut the background theory; keep the symptoms, the commands, and the decisions.
  6. One giant runbook for everything. A 5,000-word mega-doc is unnavigable under stress. One alert, one focused runbook.

Takeaways

The whole article in seven lines

  • A runbook is instructions for what to do when the system breaks, written for a tired stranger, not for you.
  • Structure every runbook the same way: symptoms, diagnosis, remediation, verify, escalation.
  • One alert, one runbook. Do not try to document the whole service at once.
  • Wire `runbook_url` into every alert, no alert ships without a runbook link.
  • Use exact, copy-pasteable commands with real names, not "check the logs".
  • Always include an escalation path; the runbook must not go silent when the fix fails.
  • Keep them in version control, stamp an owner and review date, and fix them the moment they fail you.

Where to go next

Runbooks are one pillar of a calm on-call practice. The other two are alerts worth paging on and a clear incident process around them, read those alongside this one.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.