Alerting Without Burnout

On this page

The 3am page that didn't matter
The one rule that fixes most of it
How a signal becomes a page (or doesn't)
Bad alert vs good alert
Designing an actionable alert, step by step
Burn-rate alerting on SLOs
A real Prometheus alerting rule
Common mistakes that cost you sleep
Takeaways
Where to go next

The 3am page that didn't matter

It is 3:07am. Your phone screams. You fumble for the laptop, heart pounding, and pull up the alert: "disk usage on node-7 is at 82%". You stare at it. The disk has been at 82% for three weeks. Nothing is broken. No user noticed anything. By the time you have your VPN connected, a log-rotation cron has already dropped it back to 71%. You close the laptop and lie awake, adrenaline still pumping, knowing your alarm goes off in four hours.

That page did not protect a single user. It just taught you, one more time, that the pager is mostly noise, which is exactly how good engineers learn to ignore it. The next time it fires, you snooze it. And the time after that is the real outage you slept through.

This is alert fatigue, and it is the slow-motion failure mode of every on-call rotation. The fix is not a better paging app or a louder ringtone. It is a discipline: alert on what users feel, route by what a human must do, and delete everything else. This article shows you how.

Who this is for

Engineers who carry a pager, or are about to. You have services in production and some monitoring, but the alerts feel noisy, cause-based, or ignorable. You want a system where a page genuinely means "drop what you're doing." Comfort with metrics and a tool like Prometheus helps, but the principles are tool-agnostic.

The one rule that fixes most of it

Page a human only when a human must act now. Everything else is a ticket, a dashboard, or deleted.
The core principle of sustainable alerting

Read that again, because almost every bad alert violates it. A page is the most expensive signal you own: it interrupts a person, often while they sleep, and spends their trust. If the answer to "what should the on-call do right now?" is "nothing" or "wait and see," it was never a page. It was a metric that someone wired to a siren.

The corollary is just as important: alert on symptoms, not causes. Users do not experience your disk filling up or a single pod restarting. They experience slow checkouts and failed logins. Alert on the thing the user feels, high error rate, high latency, requests not completing, and you get one meaningful alert instead of forty cause-based ones racing each other to the pager.

You feel sharp chest pain and call for help nowPage on-call: high user-facing error rate, SLO burning fast

A routine checkup flags slightly high cholesterolOpen a ticket: disk trending toward full over weeks

Your smartwatch logs your resting heart rate all dayDashboard only: per-node CPU, cache hit ratio, queue depth

A single sneeze on a TuesdayIgnore: one pod restarted and recovered in 10 seconds

Symptom-based alerting, the way your body already works

Your nervous system does not page you for every cell. It escalates by impact. That is the model: one symptom ("users are in pain") on the pager, the causes underneath it on dashboards for you to diagnose once you are awake and looking.

How a signal becomes a page (or doesn't)

Before designing individual alerts, picture the whole pipeline. A raw metric is not an alert, and an alert is not automatically a page. Each stage is a chance to filter noise. The job is to make sure only user-impacting, act-now conditions survive all the way to the right end of this diagram.

Metrics flow into alert evaluation, then a severity router decides the destination. SLO burn rate is its own branch, fast burn pages, slow burn tickets.

1
Metrics get scraped
Your service exposes counters and histograms. The monitoring system collects them every few seconds. At this stage everything is just numbers, no judgment yet.
2
Rules evaluate them
Alert rules run continuously, comparing metrics to thresholds over a time window ("error ratio > 5% for 5 minutes"). The window matters: a `for:` duration suppresses momentary blips that self-heal.
3
The router decides severity
A firing alert carries a severity label. The router (Alertmanager, PagerDuty rules) maps that label to a destination, and to who gets woken, if anyone.
4
It lands where it belongs
Critical → page the human. Warning → open a ticket for business hours. Info → no notification, it just colors a dashboard. Same pipeline, three very different costs.
5
SLO burn rate cuts across all of it
Instead of a static threshold, you measure how fast you are spending your error budget. Burning it in hours pages; burning it slowly over days opens a ticket.

Bad alert vs good alert

Most noisy alerts share a family resemblance: they fire on a cause, lack context, and leave the responder guessing. Good alerts invert every one of those traits. Use this as a checklist when reviewing your alert catalog, if a rule sits in the left column, rewrite it or delete it.

Trait	Bad alert	Good alert
What it watches	A cause (CPU, disk, one pod down)	A symptom users feel (errors, latency)
Actionability	Nothing to do, or "wait and see"	A clear action the responder takes now
Signal quality	Noisy, fires on transient blips	Stable, uses a `for:` window, self-heals filtered
Severity	Everything is "critical"	Tiered: page vs ticket vs dashboard
Context	Just a metric name and a number	Runbook link, dashboard, summary of impact
Outcome over time	Gets muted, then ignored	Stays trusted because it's always real

The difference between a pager you trust and one you mute.

The litmus test

For any alert, ask: "If this pages at 3am, what does the on-call do in the first 60 seconds?" If you can't name a concrete action, it is not a page. Demote it to a ticket or a dashboard panel.

Designing an actionable alert, step by step

Do not start from a metric and ask "should I alert on this?" Start from the user and work inward. Here is the sequence that produces alerts people trust.

1
Name the user-facing symptom
What would a user notice? "Checkout fails" or "the page takes 8 seconds." Map that to a measurable signal, typically one of the Four Golden Signals: errors, latency, traffic, saturation.
2
Pick the metric that proves it
Express the symptom as a ratio or percentile you already emit: error_rate = 5xx / total, or p99 latency. Ratios beat raw counts, "5% of requests fail" scales across traffic levels; "50 errors" does not.
3
Set a threshold tied to impact, not aesthetics
Anchor it to your SLO. If your target is 99.9% success, a sustained 1% error rate is clearly out of budget. Avoid round-number guesses like "80% CPU" that have no connection to user pain.
4
Add a duration window
Use `for: 5m` (or similar) so a 20-second spike that recovers on its own never pages. The window is your single biggest noise reducer, tune it per signal.
5
Choose the severity tier
Act-now and user-impacting → page. Important but can wait until morning → ticket. Useful context → dashboard only. Be honest; severity inflation is how everything becomes critical and nothing is.
6
Attach a runbook and context
Every paging alert links to a runbook: what it means, how to confirm, the first three things to try. An alert without a runbook is a puzzle handed to a half-asleep person.

Burn-rate alerting on SLOs

Static thresholds are blunt. "Error rate > 5%" pages just as urgently for a 30-second blip as for an hour-long outage. Burn-rate alerting fixes this by asking a smarter question: how fast are we spending our error budget?

Your SLO gives you an error budget, if your target is 99.9% over 30 days, you are allowed roughly 43 minutes of "bad" per month. The burn rate is how fast you are consuming it. A burn rate of 1x spends the whole budget exactly over the window; 14x spends it in about two days; 100x spends it in hours. Fast burn means a real, ongoing incident, page now. Slow burn means a gradual degradation, open a ticket, fix it this week.

Why two windows?

The standard pattern pairs a long window with a short one (e.g. 1h and 5m). Both must be burning fast to fire. The long window confirms the problem is sustained; the short window makes the alert recover quickly once you fix it. One window alone is either too jumpy or too slow.

This single technique collapses a dozen ad-hoc threshold alerts into a small set of tiered ones, and it ties every page directly to user-visible reliability. If you have not read it yet, the Four Golden Signals gives you the metrics that feed these rules.

A real Prometheus alerting rule

Here is a complete, two-tier burn-rate setup for an HTTP service with a 99.9% availability SLO. The fast-burn rule pages; the slow-burn rule only warns. Note the labels, severity is what the router reads to decide page vs ticket, and the annotations that carry context to the responder.

alerts.yaml

yaml

groups:
  - name: slo-burn-rate
    rules:
      # FAST BURN, pages on-call. Budget gone in ~2 days at this rate.
      - alert: HighErrorBudgetBurnFast
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
              / sum(rate(http_requests_total[1h]))
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
              / sum(rate(http_requests_total[5m]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical          # router -> page on-call
          slo: availability
        annotations:
          summary: "Fast error-budget burn on checkout API"
          description: "Burning the 99.9% budget 14x too fast. Users are seeing failures now."
          runbook: "https://runbooks.internal/checkout-error-budget"

      # SLOW BURN, opens a ticket. Gradual degradation, fix this week.
      - alert: HighErrorBudgetBurnSlow
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
              / sum(rate(http_requests_total[6h]))
          ) > (3 * 0.001)
        for: 15m
        labels:
          severity: warning           # router -> ticket, no page
          slo: availability
        annotations:
          summary: "Slow error-budget burn on checkout API"
          description: "Steady low-level errors eroding the budget. Investigate during business hours."
          runbook: "https://runbooks.internal/checkout-error-budget"

The 0.001 is 1 - 0.999, your allowed error fraction. Multiplying by the burn-rate factor (14.4 for fast, 3 for slow) gives the threshold. The router then reads severity and sends critical to the pager and warning to your ticket queue, the same rule file, two completely different costs to a human.

Practice the diagnosis flow

When one of these fires, you confirm impact by inspecting the live system. The [kubectl lab](/labs/kubectl) lets you rehearse the read-only commands, checking pods, logs, and events, that turn a page into a diagnosis.

Common mistakes that cost you sleep

Alerting on causes, not symptoms. Forty cause-based alerts all fire during one outage. One symptom alert ("error rate high") would have told the same story without the storm.
Severity inflation. When every alert is critical, severity carries no information and the on-call treats all of them as noise. Reserve critical for act-now, user-impacting conditions.
No `for:` window. Alerting on instantaneous values pages on every transient blip. A duration window filters the spikes that self-heal before you even log in.
Static thresholds with no link to impact. "80% CPU" is an aesthetic, not an SLO. Anchor thresholds to user-visible reliability, that is what burn-rate alerting does for you.
Pages with no runbook. Handing a half-asleep engineer a metric name and a number, with no guidance, guarantees a slow, stressful response. Every page links to a runbook.
Never deleting alerts. Alert catalogs only grow unless you prune. Any alert that paged and required no action gets demoted or deleted. Hygiene is a recurring chore, not a one-time setup.
Routing everything to the pager. Tickets and dashboards exist for a reason. If it doesn't need a human in the next few minutes, it does not belong on the pager.

Takeaways

The whole article in seven lines

Page a human only when a human must act now, everything else is a ticket, a dashboard, or deleted.
Alert on symptoms users feel (errors, latency), not causes (CPU, disk, single pods).
Tier severity honestly: page (act now) vs ticket (this week) vs dashboard (context only).
Add a `for:` window to every alert, it is your single biggest noise reducer.
Use SLO burn-rate alerting: fast burn pages, slow burn tickets; two windows confirm and recover.
Every paging alert links to a runbook and a dashboard. No puzzles at 3am.
Alert hygiene is recurring: regularly prune alerts that paged but needed no action.

Where to go next

Good alerts are one pillar of a sustainable on-call practice. The next is what happens after the page fires, triage, comms, and learning from it. Build the full picture with these:

Metrics foundation: The Four Golden Signals, the errors, latency, traffic, and saturation that feed every alert rule here.
After the page: Incident Management & On-Call, how to run the incident your alert just opened, and write the postmortem that prevents the next one.
Practice the diagnosis: the kubectl lab, rehearse the read-only commands that turn a page into a root cause.
The bigger path: the SRE career path, where alerting, SLOs, and reliability engineering fit into the full role.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

Cloud

Reliability & Resilience: Designing for Failure

Read

DevOps

Incident Management & On-Call That Doesn't Burn People Out

Read

SRE

What is Site Reliability Engineering?

Read