Toil & Automation: How SREs Win Back Their Week

On this page

The 2 a.m. ritual nobody questions
What toil actually is (and isn't)
The cap: why ~50% is the line in the sand
The picture: from toil to reclaimed time
The "is it toil?" checklist
Deciding what to automate: frequency × time × risk
Build it: turn a runbook into a script
Common mistakes that cost hours
Takeaways
Where to go next

The 2 a.m. ritual nobody questions

Every Tuesday at 2 a.m. the batch job finishes, and someone has to log in, copy three files to a bucket, restart a service, and tick a box in a spreadsheet. It takes fifteen minutes. It never fails, until the one week it does, and now it's a 3 a.m. incident. Nobody questions the ritual, because "it only takes fifteen minutes." Multiply that by a dozen rituals across a team and you've quietly handed an entire engineer's week to work that produces nothing new.

This is toil, and learning to see it, count it, and kill it is one of the highest-leverage skills in Site Reliability Engineering. The goal isn't to automate everything. It's to be ruthless about what deserves automation and honest about what's just busywork dressed up as "keeping the lights on."

Who this is for

Junior SREs, on-call engineers, and developers who keep getting paged for the same manual tasks. If you've ever thought "there has to be a script for this," this article gives you the framework to prove it, prioritize it, and ship it. No prior SRE theory needed, just a terminal and a service you operate.

What toil actually is (and isn't)

Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as the service grows.
Google SRE Book, Chapter 5

Read that definition slowly, every word is load-bearing. Toil is not "work I dislike" and it is not "overhead." Answering email, doing your expenses, attending a planning meeting, that's overhead, not toil. Toil is specifically the operational grind: the manual, repeatable mechanics of keeping a service alive that a machine could do instead of you.

The tell-tale sign is the last clause: it scales linearly with the service. Twice the traffic, twice the customers, twice the toil. Real engineering work has the opposite shape, you build a thing once and it keeps paying off. Toil is a tax; engineering is an investment.

Bailing water out of a leaky boat by hand, bucket after bucketManually restarting a service every time memory leaks

Fixing the hole in the hull once, so no more bailingShipping the memory fix (or an auto-restart) so the task disappears

Hand-washing more dishes as more guests arriveWork that scales linearly with traffic, classic toil

Installing a dishwasher: load once, walk awayAutomation: pay the build cost once, reclaim every future hour

Toil vs. engineering, in everyday terms

The cap: why ~50% is the line in the sand

Google's SRE practice sets a hard guideline: an SRE should spend no more than 50% of their time on toil. The other half is reserved for engineering, building automation, improving reliability, and reducing future toil. The number isn't magic; it's a forcing function.

Here's the trap the cap protects against: toil is self-perpetuating. The more time a team spends firefighting and hand-cranking operations, the less time they have to build the automation that would end the firefighting. Without a cap, a team slides into a pure-ops role, burns out, and the service stagnates. The 50% line says: operational load must leave room to engineer your way out of it. When toil creeps past the line, that's a signal to push back, hire, or invest in automation, not to quietly absorb it.

The cap is a budget, not a target

50% is a ceiling you measure against, not a quota to fill. If your toil is at 20%, wonderful, spend the rest on engineering. The point is to make toil visible enough that it can't silently eat 90% of the week.

The picture: from toil to reclaimed time

The toil lifecycle: a repetitive task is identified, measured in hours, triaged into a decision, run through an automation pipeline, and the saved hours flow back into engineering.

1
Identify
Notice the task. Anything you do by hand more than once that a runbook could describe is a toil candidate. Name it.
2
Measure
Log how long it takes and how often you do it. Frequency × duration = hours per month. You can't prioritize what you don't count.
3
Decide
Run it through the decision framework: is it cheaper to automate, eliminate the root cause, or tolerate it for now?
4
Automate
Build the smallest thing that removes the manual step, a script, then a CI job, then ideally self-service so humans leave the loop entirely.
5
Reclaim
The hours that used to go to the task now go to engineering. Track the savings so the investment is visible.

The "is it toil?" checklist

Not every annoying task is toil, and not all toil is worth automating. Run a candidate task through these six questions. The more "yes" answers, the more clearly it's toil, and the stronger the case to do something about it.

Question	Yes = toil	Why it matters
Manual?	You run it by hand	A human in the loop is the raw material of toil
Repetitive?	You've done it before, you'll do it again	One-off work is a project, not toil
Automatable?	A machine could do it	If it needs human judgment, it's not toil, yet
Reactive?	Triggered by a page or ticket	Interrupt-driven work fragments engineering time
No lasting value?	Service is the same after as before	Toil maintains; it doesn't improve
Scales with load?	More traffic = more of this work	Linear scaling is the signature of toil

Score a task against each attribute. Mostly "yes" = it's toil worth attacking.

Watch the "automatable" row

A task that genuinely needs human judgment, a nuanced rollback decision, a customer-facing call, is not toil, and forcing automation onto it is how you cause outages. Toil is the mechanical part. Automate the mechanics; keep the judgment with the human.

Deciding what to automate: frequency × time × risk

Once you've measured your toil, you can't (and shouldn't) automate all of it at once. Prioritize with three multipliers: frequency (how often), time (how long each run), and risk (what happens when a tired human does it at 3 a.m.). Multiply them into a rough score and attack the top of the list first.

Factor	Low	High
Frequency	A few times a year	Daily or on every deploy
Time per run	A couple of minutes	Half an hour of focused work
Risk if done wrong	Cosmetic, easily undone	Data loss or customer-facing outage

A lightweight prioritization model, high frequency + high time + high risk goes first.

The classic mistake is automating the satisfying task instead of the valuable one. A fiddly job you do once a quarter feels great to script, but a two-minute task you do twenty times a day quietly costs far more. Let the numbers, not the annoyance, set the order. There's also a sanity check from the xkcd "Is It Worth The Time?" table: if a five-minute weekly task takes you two days to automate, you won't break even for years. Spend the build budget where the payback is real.

Three valid outcomes, automation is only one

Automate (turn the task into code), Eliminate (fix the root cause so the task disappears entirely, always the best outcome), or Tolerate (consciously accept it for now because the payback isn't there). "Tolerate" is a real, defensible choice, as long as it's a decision, not a default.

Build it: turn a runbook into a script

Automation doesn't have to start with a platform. The first 80% of toil dies to a humble shell script that captures the runbook exactly. Here's the 2 a.m. ritual from the intro, copy artifacts, restart the service, record that it ran, turned into something you can schedule and forget. Note the safety rails: it fails loudly, logs what it did, and verifies the restart instead of assuming it worked.

automate-weekly-publish.sh

bash

#!/usr/bin/env bash
# Replaces the manual Tuesday 2 a.m. publish ritual.
# Run via cron; it logs, verifies, and exits non-zero on any failure.
set -euo pipefail

ARTIFACT_DIR="/var/batch/out"
BUCKET="s3://reports-prod/weekly"
SERVICE="report-api"
LOG="/var/log/weekly-publish.log"

log() { echo "[$(date -u +%FT%TZ)] $*" | tee -a "${LOG}"; }

log "Starting weekly publish"

# 1. Copy the three artifacts to the bucket
if ! aws s3 cp "${ARTIFACT_DIR}" "${BUCKET}" --recursive --only-show-errors; then
  log "ERROR: upload failed"
  exit 1
fi
log "Uploaded artifacts to ${BUCKET}"

# 2. Restart the service to pick up the new data
systemctl restart "${SERVICE}"

# 3. Verify it actually came back before declaring success
sleep 5
if systemctl is-active --quiet "${SERVICE}"; then
  log "${SERVICE} healthy after restart, publish complete"
else
  log "ERROR: ${SERVICE} did not come back up"
  exit 1
fi

That script is the *first* rung, not the last. The toil-reduction ladder climbs: a manual runbook becomes a script, the script becomes a scheduled or CI-triggered job, and the job eventually becomes self-service so the on-call engineer never touches it at all. Each rung removes a little more human involvement, and a little more 2 a.m.

Common mistakes that cost hours

Automating the wrong thing. Scripting the satisfying quarterly job while a twenty-times-a-day task burns more hours. Measure first; let frequency × time × risk pick the target.
Gold-plating the automation. Spending two weeks building a configurable, plugin-based framework for a task that needed a ten-line script. The automation becomes its own toil to maintain. Build the smallest thing that works.
Automating a broken process. If the real fix is eliminating the root cause, a slick script just makes the bad workflow faster. Always ask "can I delete this task entirely?" before "how do I script it?"
Forgetting to measure. Without hours-per-month numbers you can't prioritize, can't justify the time to build, and can't prove the win afterward. Counting is half the discipline.
Automation with no guard rails. A script that fails silently or assumes success is worse than the manual task, it breaks at 2 a.m. and nobody notices. Fail loudly, log, and verify.
Treating the 50% cap as optional. Quietly absorbing ever-more toil instead of pushing back is how teams sleepwalk into burnout. The cap only works if you act when you cross it.

Takeaways

The whole article in seven lines

Toil = manual, repetitive, automatable, reactive work with no lasting value that scales with the service.
It's not the same as overhead (email, meetings), toil is operational grind specifically.
Cap toil at ~50% of an SRE's time so there's room to engineer your way out of it.
You can't prioritize what you don't measure: frequency × duration = hours per month.
Prioritize automation by frequency × time × risk, let numbers, not annoyance, set the order.
Three valid outcomes: Automate, Eliminate (best), or consciously Tolerate.
Start with the smallest script that fails loudly and verifies itself, then climb the ladder to self-service.

Where to go next

The fastest way to internalize this is to find one real task you do by hand and run it through the loop: measure it, score it, and ship the smallest script that kills it. The labs below give you a safe terminal to practice the automation skills.

Bash scripting lab, write, test, and harden the kind of automation script shown above.
Linux lab, the systemctl, cron, and process basics every automation script leans on.
CI/CD lab, promote a script into a triggered pipeline so humans leave the loop.
SRE career path, see where toil and automation sit alongside SLOs, error budgets, and on-call.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

DevOps

CI/CD Fundamentals: What a Pipeline Really Does

Read

Cloud

Reliability & Resilience: Designing for Failure

Read

DevOps

Incident Management & On-Call That Doesn't Burn People Out

Read