The three acronyms every SRE lives by, what they actually mean, how to pick good SLIs, why 100% uptime is the wrong target, and where your error budget really comes from.
Someone asks: "Is the service up?" You glance at a dashboard, see green, and say "yep." Then a support ticket lands, checkout has been failing for one customer for twenty minutes. So... was it up? "Up" turns out to be a feeling, not a number. And you can't promise, budget, or alert on a feeling.
SRE fixes this by turning reliability into math. Three acronyms do the work, and they nest inside each other: an SLI is what you measure, an SLO is the target you hold yourself to, and an SLA is the promise you make to customers (with money attached). Mix them up and you'll either chase impossible targets or sign contracts you can't keep. Let's untangle them, slowly, from zero.
Who this is for
Engineers, on-call newcomers, and anyone who has heard "what's our SLO?" and quietly nodded. No prior SRE knowledge assumed, if you can read a percentage, you're ready. We'll define every term, draw the picture, and write a real SLO spec by the end.
Three definitions, one sentence each
An SLI is a number that describes how well the service is doing. An SLO is the line you draw for that number. An SLA is what happens, contractually, if you cross it.
Read that again, the order matters. You can't have a target without a measurement, and you shouldn't make a promise without a target you already trust. SLI → SLO → SLA is a build order, not just a list.
The speedometer reading right nowSLI, the live measured value (e.g. 99.95% of requests succeeded)
Your rule: "I'll stay under 120 km/h"SLO, the internal target you hold yourself to
The fine printed on the speeding ticketSLA, the external consequence if you break the agreed limit
The 0–120 gap you're allowed to useError budget, the room between perfect and your SLO
Think of a road trip with a strict friend.
The picture: how the three layer
Everything starts with real user requests and flows rightward. Each arrow narrows the idea: raw events become a ratio (SLI), the ratio gets a target (SLO), the leftover of that target becomes your error budget, and only the SLO, softened, gets exposed to customers as an SLA.
User requests become an SLI, the SLI gets a target (SLO), 1 − SLO is your error budget, and a looser version of the SLO is promised externally as the SLA.
1
Count the events
Every request is either a "good" event (served fast and correctly) or a "bad" one (an error, or too slow). That raw count is the foundation.
2
Turn it into an SLI
Divide good events by total valid events. 999,000 good out of 1,000,000 = an SLI of 99.9%. The SLI is always a ratio between 0 and 100%.
3
Pin an SLO on it
Decide the lowest SLI you're willing to accept over a window, say 99.9% over 28 days. That's your SLO: a target, owned internally, with no lawyers involved.
4
Derive the error budget
Whatever's left below 100% is yours to spend on risk: 100% − 99.9% = 0.1%. Over 28 days that's ~40 minutes of allowed badness.
5
Wrap a looser SLA around it
Promise customers something safely below your SLO, e.g. 99.5%. The gap is your safety margin so you can miss the SLO without breaching the contract.
Choosing good SLIs
An SLI is only useful if it tracks something a user would actually complain about. CPU at 80% is not an SLI, no customer cares about your CPU. "Did my page load?" is. The reliable pattern is a ratio: good events ÷ valid events, expressed as a percentage. Three SLIs cover most services on day one.
Availability, the share of requests that return a non-error response. good = HTTP 2xx/3xx/4xx; bad = 5xx. (Note: 4xx is usually the *client's* fault, so it counts as a successful service.)
Latency, the share of requests served faster than a threshold, e.g. "95% of requests under 300ms." You measure *fast enough*, not the average, averages hide the slow tail that users feel.
Error rate, the inverse view: the share of requests that fail. Handy when failures, not slowness, are the main pain (think payment APIs).
Pick from the user's seat
Good SLIs are measured as close to the user as possible, at the load balancer or CDN edge, not deep inside one microservice. If you can, ask: "would a real customer notice this number moving?" If not, it's a system metric, not an SLI.
Setting realistic SLO targets
Here's the counter-intuitive part: 100% is the wrong target. It sounds responsible, but it's a trap. Reaching 100% means never shipping risky changes, never doing maintenance, and paying exponentially more for each extra "nine", all to chase reliability your users can't even perceive (their own wifi drops more often than your 99.99% service does).
A realistic SLO sits *just above* the point where users start to notice and complain. You find it empirically: look at your current SLI over the last few weeks, check whether anyone was unhappy, and set the target a notch tighter than today's reality, ambitious but reachable. Each extra nine roughly multiplies cost and effort, so buy only the nines users actually feel.
SLO
Allowed downtime / month
Error budget
Typical use
99%
~7h 18m
1%
Internal tools, batch jobs
99.9% ("three nines")
~43m
0.1%
Most web apps & APIs
99.95%
~22m
0.05%
Paid SaaS, e-commerce
99.99% ("four nines")
~4m 23s
0.01%
Critical infra, payments
What each "nine" actually buys you (per 30-day month).
Notice the jump from three to four nines costs you ~39 minutes of slack and roughly an order of magnitude more engineering. That budget, the error budget, is the most useful by-product of an SLO: it's a spendable allowance for risk, and it turns "should we ship this?" into a number instead of an argument.
How SLAs differ
An SLA is an SLO that left the building and got a lawyer. It's a contract with a customer that says "we will keep availability above X, and if we don't, you get Y", usually service credits, refunds, or the right to walk away. Because money is on the line, two rules always hold.
The SLA target is always looser than the SLO. If you operate to 99.9% internally, you might promise 99.5% externally. That gap is deliberate breathing room: you can miss your own goal and still honor the contract.
SLAs have consequences; SLOs have alerts. Breach an SLO and your team investigates and slows down risky releases. Breach an SLA and the company pays out. Never let the two numbers be equal, that leaves you no margin.
SLI
SLO
SLA
What it is
A measurement
An internal target
An external contract
Form
A ratio (e.g. 99.95%)
A threshold on the SLI
A promise + penalty
Audience
Engineers / dashboards
Engineering team
Customers / legal
If breached
Nothing, it's just data
Investigate, slow releases
Pay credits / refunds
Strictness
n/a
Tighter (e.g. 99.9%)
Looser (e.g. 99.5%)
Owns
Telemetry
SRE / product
Sales / legal
The three side by side.
Define your first SLO
You don't need a platform team to start. Pick one critical user journey and walk these five steps end to end.
1
Pick the journey
Choose the one flow that hurts most if it breaks, login, search, checkout. Reliability is per-journey, not per-server.
2
Choose one SLI
Start with availability: good = non-5xx responses, valid = all requests to that endpoint. One SLI is enough to begin.
3
Measure today's reality
Compute the SLI over the last 2–4 weeks. Maybe it's 99.7%. Now you know what you're actually delivering.
4
Set the target a notch tighter
If reality is 99.7% and nobody complained, set the SLO at 99.9%, reachable, slightly ambitious. Write down the window (e.g. rolling 28 days).
5
Derive the budget & wire an alert
Error budget = 100% − 99.9% = 0.1%. Alert when you've burned a large chunk of it fast, that's your early-warning system.
A worked example: the SLO spec
Teams keep SLOs in version control as plain YAML so they're reviewable and reproducible. Here's an availability SLO for a checkout API, including the PromQL-style queries that compute the SLI from raw counters.
checkout-availability.slo.yaml
yaml
# SLO: Checkout API availabilityservice: checkout-api
slo:
objective: 99.9# internal target, in %window: 28d # rolling evaluation windowdescription: >
99.9% of valid checkout requests should return a
non-5xx response over any rolling 28-day window.
# The SLI is a ratio of good events to valid events.sli:
events:
# "good" = requests the service handled successfullygood: |
sum(rate(http_requests_total{
job="checkout-api", code!~"5.."
}[5m]))
# "valid" = all requests we hold ourselves accountable forvalid: |
sum(rate(http_requests_total{
job="checkout-api"
}[5m]))
# Derived, not configured: error budget = 1 - objective.error_budget:
fraction: 0.001# 1 - 0.999minutes_per_window: 40# ~0.1% of 28 days# The external SLA is intentionally looser than the SLO.sla:
guarantee: 99.5# promised to customers, in %penalty: "10% service credit if monthly availability < 99.5%"
The SLI itself is one division, good over valid. As a single PromQL expression over the window it reads like this:
checkout-sli.promql
promql
# Availability SLI over the last 28 days, as a ratio (0-1).
# Multiply by 100 for a percentage to compare against the SLO.
sum(rate(http_requests_total{job="checkout-api", code!~"5.."}[28d]))
/
sum(rate(http_requests_total{job="checkout-api"}[28d]))
If that expression returns 0.9993, your SLI is 99.93%, comfortably above the 99.9% SLO, with budget to spare. If it dips to 0.9988, you've blown the SLO (but not yet the 99.5% SLA). That single number is now the heartbeat of every reliability decision you make.
Common mistakes that cost hours
Targeting 100%. It's unachievable, ruinously expensive, and leaves zero room to ship. Pick the lowest number users won't notice.
Setting the SLA equal to (or tighter than) the SLO. No margin means every internal miss is a contractual breach. The SLA must always be the looser number.
Choosing system metrics as SLIs. CPU, memory, and disk are causes, not symptoms. Measure what the user experiences, success and speed of their requests.
Counting 4xx as failures. A 404 or 400 is usually the client's mistake; lumping it into your error rate punishes you for users' typos. Scope SLIs to what you control.
Averaging latency. A 200ms average can hide a 5-second p99 that's enraging your slowest 1%. Always measure latency as a *threshold ratio*, not a mean.
No window or a vague one. "99.9%" means nothing without "over 28 days." The window decides how much a single bad hour hurts.
Takeaways
The whole article in seven lines
**SLI** = a measurement: good events ÷ valid events, as a %.
**SLO** = the internal target you set on that SLI (e.g. 99.9%).
**SLA** = the external contract, always looser than the SLO, with penalties.
Build order is **SLI → SLO → SLA**: measure, then target, then promise.
**Error budget = 100% − SLO**, your spendable allowance for risk.
**100% is the wrong target**; aim just above where users start to complain.
Good SLIs cover **availability, latency, and error rate**, measured near the user.
Where to go next
You now have the vocabulary of reliability. The natural next step is learning to *spend* the number you just defined, that's what error budgets are for, and how the four golden signals tell you which SLIs to pick.
Read Error Budgets, Explained, how to turn that 0.1% into release decisions and alert thresholds.
Read The Four Golden Signals, latency, traffic, errors, saturation: the metrics most SLIs come from.
Follow the SRE career path to see where reliability fits in the bigger picture.
Get hands-on in the kubectl lab to inspect the workloads you'll be measuring.
Brush up shell fundamentals in the Linux lab, every SLI query starts with knowing your way around a system.
Want to go deeper?
This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.