SLIs, SLOs & SLAs: Defining Reliability

On this page

Reliability, but measured
Three definitions, one sentence each
The picture: how the three layer
Choosing good SLIs
Setting realistic SLO targets
How SLAs differ
Define your first SLO
A worked example: the SLO spec
Common mistakes that cost hours
Takeaways
Where to go next

Reliability, but measured

Someone asks: "Is the service up?" You glance at a dashboard, see green, and say "yep." Then a support ticket lands, checkout has been failing for one customer for twenty minutes. So... was it up? "Up" turns out to be a feeling, not a number. And you can't promise, budget, or alert on a feeling.

SRE fixes this by turning reliability into math. Three acronyms do the work, and they nest inside each other: an SLI is what you measure, an SLO is the target you hold yourself to, and an SLA is the promise you make to customers (with money attached). Mix them up and you'll either chase impossible targets or sign contracts you can't keep. Let's untangle them, slowly, from zero.

Who this is for

Engineers, on-call newcomers, and anyone who has heard "what's our SLO?" and quietly nodded. No prior SRE knowledge assumed, if you can read a percentage, you're ready. We'll define every term, draw the picture, and write a real SLO spec by the end.

Three definitions, one sentence each

An SLI is a number that describes how well the service is doing. An SLO is the line you draw for that number. An SLA is what happens, contractually, if you cross it.
The whole article, compressed

Read that again, the order matters. You can't have a target without a measurement, and you shouldn't make a promise without a target you already trust. SLI → SLO → SLA is a build order, not just a list.

The speedometer reading right nowSLI, the live measured value (e.g. 99.95% of requests succeeded)

Your rule: "I'll stay under 120 km/h"SLO, the internal target you hold yourself to

The fine printed on the speeding ticketSLA, the external consequence if you break the agreed limit

The 0–120 gap you're allowed to useError budget, the room between perfect and your SLO

Think of a road trip with a strict friend.

The picture: how the three layer

Everything starts with real user requests and flows rightward. Each arrow narrows the idea: raw events become a ratio (SLI), the ratio gets a target (SLO), the leftover of that target becomes your error budget, and only the SLO, softened, gets exposed to customers as an SLA.

User requests become an SLI, the SLI gets a target (SLO), 1 − SLO is your error budget, and a looser version of the SLO is promised externally as the SLA.

1
Count the events
Every request is either a "good" event (served fast and correctly) or a "bad" one (an error, or too slow). That raw count is the foundation.
2
Turn it into an SLI
Divide good events by total valid events. 999,000 good out of 1,000,000 = an SLI of 99.9%. The SLI is always a ratio between 0 and 100%.
3
Pin an SLO on it
Decide the lowest SLI you're willing to accept over a window, say 99.9% over 28 days. That's your SLO: a target, owned internally, with no lawyers involved.
4
Derive the error budget
Whatever's left below 100% is yours to spend on risk: 100% − 99.9% = 0.1%. Over 28 days that's ~40 minutes of allowed badness.
5
Wrap a looser SLA around it
Promise customers something safely below your SLO, e.g. 99.5%. The gap is your safety margin so you can miss the SLO without breaching the contract.

Choosing good SLIs

An SLI is only useful if it tracks something a user would actually complain about. CPU at 80% is not an SLI, no customer cares about your CPU. "Did my page load?" is. The reliable pattern is a ratio: good events ÷ valid events, expressed as a percentage. Three SLIs cover most services on day one.

Availability, the share of requests that return a non-error response. good = HTTP 2xx/3xx/4xx; bad = 5xx. (Note: 4xx is usually the *client's* fault, so it counts as a successful service.)
Latency, the share of requests served faster than a threshold, e.g. "95% of requests under 300ms." You measure *fast enough*, not the average, averages hide the slow tail that users feel.
Error rate, the inverse view: the share of requests that fail. Handy when failures, not slowness, are the main pain (think payment APIs).

Pick from the user's seat

Good SLIs are measured as close to the user as possible, at the load balancer or CDN edge, not deep inside one microservice. If you can, ask: "would a real customer notice this number moving?" If not, it's a system metric, not an SLI.

Setting realistic SLO targets

Here's the counter-intuitive part: 100% is the wrong target. It sounds responsible, but it's a trap. Reaching 100% means never shipping risky changes, never doing maintenance, and paying exponentially more for each extra "nine", all to chase reliability your users can't even perceive (their own wifi drops more often than your 99.99% service does).

A realistic SLO sits *just above* the point where users start to notice and complain. You find it empirically: look at your current SLI over the last few weeks, check whether anyone was unhappy, and set the target a notch tighter than today's reality, ambitious but reachable. Each extra nine roughly multiplies cost and effort, so buy only the nines users actually feel.

SLO	Allowed downtime / month	Error budget	Typical use
99%	~7h 18m	1%	Internal tools, batch jobs
99.9% ("three nines")	~43m	0.1%	Most web apps & APIs
99.95%	~22m	0.05%	Paid SaaS, e-commerce
99.99% ("four nines")	~4m 23s	0.01%	Critical infra, payments

What each "nine" actually buys you (per 30-day month).

Notice the jump from three to four nines costs you ~39 minutes of slack and roughly an order of magnitude more engineering. That budget, the error budget, is the most useful by-product of an SLO: it's a spendable allowance for risk, and it turns "should we ship this?" into a number instead of an argument.

How SLAs differ

An SLA is an SLO that left the building and got a lawyer. It's a contract with a customer that says "we will keep availability above X, and if we don't, you get Y", usually service credits, refunds, or the right to walk away. Because money is on the line, two rules always hold.

The SLA target is always looser than the SLO. If you operate to 99.9% internally, you might promise 99.5% externally. That gap is deliberate breathing room: you can miss your own goal and still honor the contract.
SLAs have consequences; SLOs have alerts. Breach an SLO and your team investigates and slows down risky releases. Breach an SLA and the company pays out. Never let the two numbers be equal, that leaves you no margin.

	SLI	SLO	SLA
What it is	A measurement	An internal target	An external contract
Form	A ratio (e.g. 99.95%)	A threshold on the SLI	A promise + penalty
Audience	Engineers / dashboards	Engineering team	Customers / legal
If breached	Nothing, it's just data	Investigate, slow releases	Pay credits / refunds
Strictness	n/a	Tighter (e.g. 99.9%)	Looser (e.g. 99.5%)
Owns	Telemetry	SRE / product	Sales / legal

The three side by side.

Define your first SLO

You don't need a platform team to start. Pick one critical user journey and walk these five steps end to end.

1
Pick the journey
Choose the one flow that hurts most if it breaks, login, search, checkout. Reliability is per-journey, not per-server.
2
Choose one SLI
Start with availability: good = non-5xx responses, valid = all requests to that endpoint. One SLI is enough to begin.
3
Measure today's reality
Compute the SLI over the last 2–4 weeks. Maybe it's 99.7%. Now you know what you're actually delivering.
4
Set the target a notch tighter
If reality is 99.7% and nobody complained, set the SLO at 99.9%, reachable, slightly ambitious. Write down the window (e.g. rolling 28 days).
5
Derive the budget & wire an alert
Error budget = 100% − 99.9% = 0.1%. Alert when you've burned a large chunk of it fast, that's your early-warning system.

A worked example: the SLO spec

Teams keep SLOs in version control as plain YAML so they're reviewable and reproducible. Here's an availability SLO for a checkout API, including the PromQL-style queries that compute the SLI from raw counters.

checkout-availability.slo.yaml

yaml

# SLO: Checkout API availability
service: checkout-api
slo:
  objective: 99.9          # internal target, in %
  window: 28d              # rolling evaluation window
  description: >
    99.9% of valid checkout requests should return a
    non-5xx response over any rolling 28-day window.

# The SLI is a ratio of good events to valid events.
sli:
  events:
    # "good" = requests the service handled successfully
    good: |
      sum(rate(http_requests_total{
        job="checkout-api", code!~"5.."
      }[5m]))
    # "valid" = all requests we hold ourselves accountable for
    valid: |
      sum(rate(http_requests_total{
        job="checkout-api"
      }[5m]))

# Derived, not configured: error budget = 1 - objective.
error_budget:
  fraction: 0.001         # 1 - 0.999
  minutes_per_window: 40  # ~0.1% of 28 days

# The external SLA is intentionally looser than the SLO.
sla:
  guarantee: 99.5         # promised to customers, in %
  penalty: "10% service credit if monthly availability < 99.5%"

The SLI itself is one division, good over valid. As a single PromQL expression over the window it reads like this:

checkout-sli.promql

promql

# Availability SLI over the last 28 days, as a ratio (0-1).
# Multiply by 100 for a percentage to compare against the SLO.
sum(rate(http_requests_total{job="checkout-api", code!~"5.."}[28d]))
  /
sum(rate(http_requests_total{job="checkout-api"}[28d]))

If that expression returns 0.9993, your SLI is 99.93%, comfortably above the 99.9% SLO, with budget to spare. If it dips to 0.9988, you've blown the SLO (but not yet the 99.5% SLA). That single number is now the heartbeat of every reliability decision you make.

Common mistakes that cost hours

Targeting 100%. It's unachievable, ruinously expensive, and leaves zero room to ship. Pick the lowest number users won't notice.
Setting the SLA equal to (or tighter than) the SLO. No margin means every internal miss is a contractual breach. The SLA must always be the looser number.
Choosing system metrics as SLIs. CPU, memory, and disk are causes, not symptoms. Measure what the user experiences, success and speed of their requests.
Counting 4xx as failures. A 404 or 400 is usually the client's mistake; lumping it into your error rate punishes you for users' typos. Scope SLIs to what you control.
Averaging latency. A 200ms average can hide a 5-second p99 that's enraging your slowest 1%. Always measure latency as a *threshold ratio*, not a mean.
No window or a vague one. "99.9%" means nothing without "over 28 days." The window decides how much a single bad hour hurts.

Takeaways

The whole article in seven lines

**SLI** = a measurement: good events ÷ valid events, as a %.
**SLO** = the internal target you set on that SLI (e.g. 99.9%).
**SLA** = the external contract, always looser than the SLO, with penalties.
Build order is **SLI → SLO → SLA**: measure, then target, then promise.
**Error budget = 100% − SLO**, your spendable allowance for risk.
**100% is the wrong target**; aim just above where users start to complain.
Good SLIs cover **availability, latency, and error rate**, measured near the user.

Where to go next

You now have the vocabulary of reliability. The natural next step is learning to *spend* the number you just defined, that's what error budgets are for, and how the four golden signals tell you which SLIs to pick.

Read Error Budgets, Explained, how to turn that 0.1% into release decisions and alert thresholds.
Read The Four Golden Signals, latency, traffic, errors, saturation: the metrics most SLIs come from.
Follow the SRE career path to see where reliability fits in the bigger picture.
Get hands-on in the kubectl lab to inspect the workloads you'll be measuring.
Brush up shell fundamentals in the Linux lab, every SLI query starts with knowing your way around a system.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

DevOps

Kubernetes in Production: Beyond the Tutorial

Read

DevOps

Observability: Metrics, Logs & Traces (The Three Pillars)

Read

Cloud

Reliability & Resilience: Designing for Failure

Read