The three pillars of observability, SLOs and error budgets, distributed tracing, and alerting that doesn't cry wolf.
The three pillars of observability, SLOs and error budgets, distributed tracing, and alerting that doesn't cry wolf.
Lesson outline
The page fires at 3 AM. "API error rate is 12%." You open your laptop. Your monitoring dashboard shows the error rate. But it does not show you why.
You check the logs — millions of lines, no structured search. You check the metrics — all look normal except the error rate. You have no traces to follow a request through the system. You are flying blind.
Thirty minutes later, you find it: a downstream authentication service started returning 401s due to a certificate expiry. It took 30 minutes to find something that a distributed trace would have shown in 10 seconds.
Monitoring vs Observability
Monitoring: you know in advance what questions to ask, and you instrument for those specific questions. Observability: the system emits enough data that you can answer questions you did not know you would need to ask. Monitoring tells you THAT something is wrong. Observability tells you WHY.
What each pillar answers and when to use it
The correlation key: trace_id
Generate a unique trace_id at the entry point (API gateway, load balancer) and propagate it through every service call as an HTTP header (X-Trace-ID). Include it in every log line. Now you can: search logs by trace_id to find all events for one request, find the trace in Jaeger to see the waterfall, and correlate with metrics from that time window.
The reliability vocabulary
How error budgets change engineering decisions
Team has 43 minutes of error budget left for the month. A risky migration is proposed. The team asks: "If this migration has a 20% chance of causing a 30-minute incident, should we do it?" The error budget quantifies risk: 20% × 30min = 6min expected burn. Is that worth the feature benefit? This is the conversation error budgets enable.
| SLO | Allowed downtime / month | Allowed downtime / year | Who uses it |
|---|---|---|---|
| 99% (two nines) | 7.3 hours | 3.65 days | Internal tools, batch jobs |
| 99.9% (three nines) | 43.8 minutes | 8.76 hours | Most production services |
| 99.95% | 21.9 minutes | 4.38 hours | Important services, e-commerce |
| 99.99% (four nines) | 4.4 minutes | 52.6 minutes | Payment processing, auth |
| 99.999% (five nines) | 26 seconds | 5.26 minutes | Telecom, life-critical systems |
Your API has a 99.9% availability SLO. After a 60-minute incident, your error budget is fully consumed with 15 days left in the month. What should the team do?
The fastest way to make alerts useless is to have too many of them. If engineers get 50 alerts per shift and 45 are noise, they will train themselves to ignore all 50. This is how critical alerts get missed.
Principles for effective alerting
SRE, platform engineering, and senior backend interviews. Observability is one of the most commonly tested SRE concepts.
Common questions:
Key takeaways
What is the difference between an SLI, SLO, and SLA?
SLI: the specific metric (fraction of successful requests). SLO: your internal target for the SLI (99.9% success). SLA: contractual promise to customers — always looser than the SLO.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.