What is Site Reliability Engineering?

On this page

The 3 a.m. pager and the question behind SRE
A one-sentence definition (and an analogy)
The picture: the SRE feedback loop
SRE vs. DevOps vs. traditional ops
The pillars of SRE
Common misconceptions
Takeaways
Where to go next

A one-sentence definition (and an analogy)

Site Reliability Engineering is what you get when you treat operations as a software problem, defining reliability as a measurable target and using engineering to meet it.
The SRE mental model, in one line

The phrase that started it all comes from Ben Treynor Sloss, who founded SRE at Google: it's "what happens when you ask a software engineer to design an operations function." That framing is the whole idea. An operations person asks "how do I keep this running?" A software engineer asks "how do I make this not need a person to keep it running?"

A promise to seat guests within 10 minutesSLO, your reliability target

Measuring actual wait times at the doorSLI, the metric you track

The few late seatings you can tolerate before regulars leaveError budget, allowed unreliability

Re-rolling silverware by hand every night, foreverToil, repetitive manual work to automate away

A kitchen debrief that asks "what failed," not "who failed"Blameless postmortem

SRE concepts mapped to running a busy restaurant.

The picture: the SRE feedback loop

SRE isn't a checklist, it's a loop. Users hit your service; you measure how well it serves them; you compare that against the target you promised; and the gap between target and reality (the error budget) drives a decision: keep shipping features, or stop and invest in reliability. Then the loop runs again.

The SRE feedback loop: measure reliability, compare to target, and let the error budget drive the ship-vs-stabilize decision.

1
Users generate real traffic
Every request is a tiny test of your reliability promise, was it fast? did it succeed?
2
The service emits SLIs
Service Level Indicators are the raw measurements: success rate, latency, error rate, the truth about how the service behaved.
3
Compare against the SLO
The Service Level Objective is your target, e.g. 99.9% of requests succeed. SLIs tell you where you actually landed against it.
4
Read the error budget
The gap between 100% and your SLO is the failure you're allowed. If you promised 99.9%, your budget is 0.1%, and you spend it on every outage and risky deploy.
5
Make the call
Budget left? Keep shipping features. Budget blown? Freeze risky changes and pour engineering into reliability. The data decides, not the loudest voice.
6
Feed the decision back
The choice changes what you do to the service, and the loop runs again with the next wave of traffic.

SRE vs. DevOps vs. traditional ops

These three get conflated constantly. The cleanest way to think about it: traditional ops is a job, DevOps is a culture, and SRE is a specific, prescriptive implementation of that culture. Google likes to say "class SRE implements interface DevOps", DevOps states the goals, SRE gives you the concrete practices to hit them.

Dimension	Traditional Ops	DevOps	SRE
Core idea	Keep it running, manually	Dev and ops share ownership	Run ops as a software problem
Success metric	Uptime, ticket volume	Deployment speed + stability	SLOs met within error budget
Reliability target	"As high as possible"	Implicit, team-defined	Explicit SLOs, deliberately < 100%
Failure response	Find who broke it	Shared blame, faster fixes	Blameless postmortems, fix the system
Toil	The job	Reduce friction	Capped and engineered away (≤ 50%)
Who does it	Separate ops team	Whole team, cultural	Engineers who write software for ops

Three ways to run production, how they differ in practice.

The headline difference is the explicit, deliberately-imperfect reliability target. Traditional ops chases 100% uptime (impossible and ruinously expensive). SRE picks a number like 99.9%, admits the remaining 0.1% will fail, and turns that admission into a budget you can spend on shipping faster.

The pillars of SRE

Everything in SRE hangs off three load-bearing ideas. You don't need to master them today, just know what each one is and why it exists. Each links to a deeper article.

Reliability you can measure, SLIs, SLOs, and error budgets. An SLI is what you measure (success rate, latency), an SLO is the target you commit to (99.9%), and the error budget is the failure that target permits. Together they turn "is it reliable enough?" from an argument into arithmetic. Start here: SLIs, SLOs & SLAs and Error Budgets.
Toil reduction. Toil is manual, repetitive, automatable work that scales with traffic and produces no lasting value, restarting servers, copying configs, clearing the same alert nightly. SRE caps toil (Google's rule of thumb: ≤ 50% of an SRE's time) so the rest goes to engineering that makes toil disappear. The terminal labs are where you build that automation muscle: Bash scripting lab and the Linux lab.
Blameless culture. When systems fail, the question is "what about our system let this happen?", never "whose fault is it?" Blameless postmortems surface the real causes because people stop hiding mistakes. Reliability is a property of systems and processes, not of how careful individuals are.

Pro tip

Notice the through-line: every pillar replaces opinion and heroics with **data and engineering**. That's SRE in one sentence.

Common misconceptions

"SRE means 100% uptime." The opposite. SRE deliberately targets less than 100%, because the last fraction of a percent costs more than it's worth, and a service that never fails is a service that never ships. The error budget exists to *spend*.
"SRE is just a rebrand of ops / a fancy job title." It's a different operating model. Renaming your ops team "SRE" without SLOs, error budgets, and a toil cap changes nothing.
"SRE and DevOps are competitors." They're complementary. DevOps is the philosophy; SRE is one concrete implementation of it. You can absolutely do both.
"You need to be Google to do SRE." You need SLOs you can measure and the discipline to act on them. A three-person startup can set an SLO and track an error budget on day one.
"SREs just do operations work, not real engineering." A core SRE principle is capping toil so the majority of time goes to writing software, automation, tooling, reliability features. If your SREs only firefight, you're doing ops with a new label.

Takeaways

SRE in seven lines

SRE = running operations as a software problem, with engineering rigor.
It came from Google: "what happens when a software engineer designs an ops function."
Reliability becomes measurable: SLIs (measure) → SLOs (target) → error budget (allowed failure).
The error budget drives the core decision: ship features vs. invest in reliability.
Toil is capped and engineered away so SREs build, not just firefight.
Failure is handled blamelessly, fix the system, not the person.
DevOps is the philosophy; SRE is a concrete, prescriptive implementation of it.

Where to go next

You now have the map. The next step is to make reliability concrete, define what "reliable" means for a real service and learn to spend an error budget on purpose.

Go deep on the measurement layer: SLIs, SLOs & SLAs, defining reliability.
Learn the decision engine: Error Budgets explained.
Follow the full track from here: the SRE career path.
Build the automation muscle that kills toil: the Bash scripting lab and the Linux fundamentals lab.
Manage clusters reliably with the kubectl lab.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

DevOps

What DevOps Actually Is (It's Not a Job Title)

Read

DevOps

Your First CI Pipeline with GitHub Actions

Read

DevOps

Dockerfile Best Practices: Small, Fast, Secure Images

Read

What is Site Reliability Engineering?

The 3 a.m. pager and the question behind SRE

A one-sentence definition (and an analogy)

The picture: the SRE feedback loop

SRE vs. DevOps vs. traditional ops

The pillars of SRE

Common misconceptions

Takeaways

Where to go next

Want to go deeper?

What DevOps Actually Is (It's Not a Job Title)

Your First CI Pipeline with GitHub Actions

Dockerfile Best Practices: Small, Fast, Secure Images