Burn-Rate Alerting: SLO Math That Doesn't Page You for Nothing

On this page

The 3 a.m. page that meant nothing
The principle: alert on spend rate, not on a line
From SLO to a page: the pipeline
Choosing windows and thresholds
The alert in Prometheus
Why two windows, not one
Common mistakes that cost hours
Takeaways
Where to go next

The 3 a.m. page that meant nothing

Your alert says error_rate > 1%. At 3 a.m. it fires. You stumble to your laptop, pull up the dashboard, and the error rate is already back to 0.2%. A single bad deploy node flapped for ninety seconds, retried, and healed itself. Nothing was broken by the time you were awake. You acknowledge, you close the laptop, and you do not fall back asleep.

The opposite failure is quieter and worse. Your service sits at 0.9% errors for three straight days, never tripping the 1% line, but steadily torching the reliability budget your users actually feel. No page ever fires. By the time someone notices, you have already blown your quarterly target.

Both failures come from the same mistake: a static threshold has no sense of time or scale. It cannot tell a harmless blip from a sustained bleed, because it only ever looks at one instant. Burn-rate alerting fixes this by measuring how fast you are spending your error budget, and only waking a human when the spend rate actually threatens the target.

Who this is for

Engineers and SREs who already have an SLO (or want one) and are tired of noisy threshold alerts. You should be comfortable with Prometheus/PromQL and the idea of an [error budget](/blog/error-budgets-explained). No advanced math required, just ratios and time.

The principle: alert on spend rate, not on a line

Page a human when the error budget is being consumed fast enough that, left unchecked, it will run out before you can reasonably respond, and nowhere near as long before that.
The burn-rate rule of thumb (Google SRE Workbook)

A static threshold is a fixed line painted on the road. A burn rate is a fuel gauge that warns you by how fast the needle is dropping, not by where it sits. If your tank empties in two minutes, that is an emergency whether you are at half a tank or a quarter. If it empties slowly over a week, you have time to pull over at the next station. The number to act on is the *rate of consumption relative to how long the fuel is supposed to last*, not the current level.

Fuel tank sizeError budget (the allowed failures over the SLO window)

How fast the needle dropsBurn rate (multiple of normal budget spend)

"Empties in 2 minutes" warning lightFast-burn page: budget gone in hours

"Range getting low" trip warningSlow-burn ticket: budget trending to zero over days

Ignoring a single flickering gaugeShort confirmation window: don't react to one bad reading

Burn rate replaces "how many errors right now" with "how fast am I spending my safety margin."

Concretely: burn rate is the ratio of your current error rate to the error rate you are *allowed* to sustain across the whole SLO window. A burn rate of 1x means you will spend exactly 100% of your budget by the end of the window, perfectly on plan. A burn rate of 14.4x means you will spend the entire month's budget in about two days. The bigger the multiple, the less time you have.

From SLO to a page: the pipeline

Every burn-rate alert is the end of a short chain. You start from a target, derive a budget, measure how fast it is being spent, evaluate that spend over two time windows at once, and only then decide whether a human gets paged or a ticket gets filed.

SLO target → error budget → live burn rate → multi-window evaluation → page or ticket.

1
Pick an SLO
Say 99.9% of requests succeed over a rolling 30-day window. That leaves a 0.1% error budget.
2
Turn the target into a budget
The budget is `1 - SLO` = 0.1% of all requests over the window. Spend it however you like; just don't run out.
3
Measure the live error rate
Over a short window, compute failures ÷ total requests. This is your instantaneous spend.
4
Compute burn rate
Divide the live error rate by the budget (0.001). 1.4% errors ÷ 0.1% = 14x burn, fourteen times faster than sustainable.
5
Evaluate over two windows
Require a long window AND a short window to both exceed the threshold, so you confirm sustained burn and recover fast when it stops.
6
Route by severity
Fast burn (budget gone in hours) pages a human. Slow burn (budget gone in days) opens a ticket for business hours.

Choosing windows and thresholds

The art is mapping burn rates to severities. The trick is to anchor each threshold to a time-to-exhaust: at this burn rate, how long until the whole budget is gone? A useful identity for a 30-day window: the budget lasts (30 days) ÷ burn_rate. The classic Google two-tier setup pages on anything that would exhaust the budget within ~2 days and tickets on anything that would exhaust it within ~3 days at the slow end.

Burn rate	Time to exhaust	Budget spent (window)	Action
14.4x	~2 days	2% in 1h	Page on-call (urgent)
6x	~5 days	5% in 6h	Page on-call (high)
3x	~10 days	10% in 24h	File ticket (business hrs)
1x	~30 days	on plan	No action, expected spend
< 1x	never	under budget	No action, healthy

Burn rate → time to exhaust a 30-day budget → recommended action. Higher burn = less time = louder alert.

Read the top row as: "if we keep burning this fast, 2% of the month's budget vanishes in an hour and the whole thing is gone in two days." That is worth a page. The 3x row bleeds slowly enough that a ticket and a fix during working hours is fine. Anything at or below 1x is exactly the spend you planned for, silence is correct.

Pick windows proportional to the burn rate

Faster tiers use shorter windows (1h long / 5m short) so you catch and recover quickly; slower tiers use longer windows (24h long / 2h short) so a brief spike can't open a ticket. Rule of thumb: the short window is ~1/12 of the long window.

The alert in Prometheus

Here is a complete two-tier, multi-window burn-rate rule for a 99.9% availability SLO. Each alert requires both a long-window and a short-window burn rate to exceed the threshold. Define a recording rule for your success ratio first, then reference it at multiple ranges.

burn-rate-rules.yaml

yaml

groups:
  - name: slo_burn_rate
    rules:
      # Helper: error ratio over a range. Repeat per window below.
      # error_ratio = failed requests / total requests

      # ---- FAST BURN: 14.4x over 1h, confirmed by 5m ----
      - alert: ErrorBudgetFastBurn
        expr: |
          (
            sum(rate(http_requests_total{job="api",code=~"5.."}[1h]))
            / sum(rate(http_requests_total{job="api"}[1h]))
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_total{job="api",code=~"5.."}[5m]))
            / sum(rate(http_requests_total{job="api"}[5m]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: page
        annotations:
          summary: "Fast budget burn, 30d budget gone in ~2 days"

      # ---- SLOW BURN: 3x over 24h, confirmed by 2h ----
      - alert: ErrorBudgetSlowBurn
        expr: |
          (
            sum(rate(http_requests_total{job="api",code=~"5.."}[24h]))
            / sum(rate(http_requests_total{job="api"}[24h]))
          ) > (3 * 0.001)
          and
          (
            sum(rate(http_requests_total{job="api",code=~"5.."}[2h]))
            / sum(rate(http_requests_total{job="api"}[2h]))
          ) > (3 * 0.001)
        for: 15m
        labels:
          severity: ticket
        annotations:
          summary: "Slow budget burn, 30d budget trending to zero in ~10 days"

The 0.001 is your error budget (1 - 0.999). The multiplier (14.4, 3) is the burn rate from the table. Route severity: page to PagerDuty/Opsgenie and severity: ticket to a queue. For production, extract the ratio into a recording rule so you are not recomputing the same query at four ranges.

Why two windows, not one

A single long window is accurate but slow to react and slow to clear: a 1-hour window keeps firing for nearly an hour after the incident is fixed, because the bad minutes stay inside the range. A single short window is fast but jumpy: a 5-minute window fires on any brief spike and creates exactly the noise you were trying to escape.

Requiring both windows to breach gives you the best of each. The long window confirms the burn is real and sustained, not a flicker. The short window resets the alert quickly: the moment the recent error rate drops, the short window clears and the alert auto-resolves, even though the long window is still elevated. Confirmation from the long side, fast recovery from the short side.

The mental model

Long window = "is this actually a problem?" Short window = "is it still happening right now?" You only page when the answer to both is yes.

Common mistakes that cost hours

Forgetting the short window. Long-window-only alerts fire long after the incident heals, so on-call keeps getting paged for something already fixed.
Using the same window for every severity. A 2% urgent burn and a slow 10-day bleed need different time scales. One window can't catch both without being either deaf or hysterical.
Alerting on raw error count instead of a ratio. 50 errors is catastrophic at low traffic and invisible at peak. Always divide by total requests.
Hard-coding the budget in five places. When the SLO changes, you'll miss one. Compute the threshold from a single budget constant or a recording rule.
Setting the SLO to 100%. A zero budget means any single error is an infinite burn rate. Pick a target your system can realistically meet, then alert against the slack.
Paging on slow burn. A 3x burn over 24h is a ticket, not a 3 a.m. wake-up. Reserve the page for budgets that vanish in hours.
Ignoring the denominator going to zero. At very low traffic the ratio gets noisy or undefined. Add a minimum-request guard or widen the window for low-volume services.

Takeaways

The whole article in seven lines

Static thresholds have no sense of time, they page on blips and sleep through slow bleeds.
Burn rate = current error rate ÷ error budget. It measures how fast you're spending your safety margin.
Anchor every threshold to a time-to-exhaust: 14.4x ≈ 2 days (page), 3x ≈ 10 days (ticket), 1x = on plan (silence).
Always require a long window AND a short window to breach together.
Long window confirms the burn is real; short window clears the alert fast when it heals.
Fast burn pages a human; slow burn files a ticket for business hours.
Compute thresholds from one budget constant so an SLO change can't leave a stale alert behind.

Where to go next

Burn-rate alerting only works on top of a well-defined budget and a low-noise alerting culture. These pair directly with this article:

Error Budgets Explained, where the budget you're spending actually comes from, and how to negotiate it.
Alerting Without Burnout, the broader on-call hygiene that burn-rate alerts plug into.
The Four Golden Signals, latency, traffic, errors, and saturation: the signals you build SLOs on.
Practice the PromQL and rule syntax in the YAML lab, then map the full SRE career path to see where SLOs sit in the bigger picture.

Start with one SLO, one budget, and the two-tier rule above. You'll page less and catch more.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

DevOps

Observability: Metrics, Logs & Traces (The Three Pillars)

Read

Cloud

Reliability & Resilience: Designing for Failure

Read

DevOps

Incident Management & On-Call That Doesn't Burn People Out

Read