Observability: Metrics, Logs & Traces (The Three Pillars)
Monitoring tells you the system is broken. Observability tells you why, even for failures you never thought to predict. This is the from-scratch mental model of the three pillars, the four golden signals, and SLIs/SLOs explained so you can actually instrument a service tomorrow.
It's 3am. Your dashboard is red: error rate is up, latency is climbing. Monitoring did its job, it told you *something* is wrong. Then comes the question monitoring cannot answer: why? Which service? Which dependency? Which one customer's request set off the cascade? You start SSHing into boxes and grepping logs by hand, and the incident drags on for an hour that should have been five minutes.
That gap, between *knowing something broke* and *understanding why*, is exactly the gap observability fills. It's one of the clearest dividing lines between an engineer who can keep a system alive and one who can only watch it die. This article builds the mental model from zero: what observability actually means, the three pillars it rests on, what to measure, and how to set targets you can defend.
Who this is for
Anyone who has deployed something and then wondered what it's doing in production. No prior monitoring experience needed. We use vendor-neutral terms (the same ideas apply whether you run Prometheus + Grafana, Datadog, or the OpenTelemetry stack).
Monitoring vs observability, the one distinction that matters
Monitoring answers questions you already knew to ask. Observability lets you ask new questions of your system without shipping new code to answer them.
Monitoring is dashboards and alerts for known failure modes: CPU over 80%, disk almost full, error rate above threshold. You decided in advance what to watch. That's necessary, but production fails in ways nobody predicted. Observability is the property that, when something *unexpected* happens, you can explore the system's actual behaviour and figure it out from the data you're already emitting.
๐จ A check-engine lightMonitoring (a known signal fired)
๐ง A mechanic's full diagnostic portObservability (ask anything, after the fact)
๐ A fixed pre-flight checklistAlerts on known thresholds
๐ฅ The flight recorder (black box)Logs + traces you can replay
Same car, two different relationships with it.
The practical upshot: you build observability so that the *next* incident, the one you can't imagine yet, is debuggable with data you're already collecting. You don't get a second chance to instrument an outage that already happened.
How telemetry actually flows
Before the three pillars, see the shape of the whole system. Your application emits three kinds of telemetry. They travel through a collector, land in backends suited to each data type, and surface as dashboards and alerts. Follow the flow:
Telemetry pipeline. The app emits metrics, logs, and traces. A collector (e.g. the OpenTelemetry Collector) fans them out to purpose-built backends. Dashboards and alerting sit on top. Alerts page a human; dashboards are where that human investigates.
Notice the division of labour: metrics drive alerts because they're cheap to aggregate, logs and traces are where you investigate once an alert fires. A mature setup links them, an alert on a metric jumps you to the relevant traces, which link to the exact log lines. That stitched-together path is what turns a one-hour incident into a five-minute one.
The three pillars, side by side
Metrics, logs, and traces are not competing choices, they answer different questions, and you need all three. The fastest way to internalise them is to ask what each one is *good at*:
What it is
Answers
Example
Metrics
Numbers aggregated over time
Is something wrong, and how much?
p99 latency = 1.4s; error rate = 3%
Logs
Timestamped event records
What exactly happened in this event?
"payment failed: card declined, user 8842"
Traces
One request's path across services
Where in the request did time/errors go?
checkout: gateway 12ms โ auth 8ms โ db 1300ms
Each pillar answers a different question. Reach for the one that matches what you're trying to learn.
Metrics, cheap, aggregated, alert-friendly
A metric is a number measured over time: request count, latency, CPU, queue depth. Because they're pre-aggregated, metrics are cheap to store and fast to query, which makes them ideal for dashboards and alerts. Their weakness is that aggregation throws away detail, a metric tells you 3% of requests failed, never *which* ones or *why*.
Logs, high detail, the per-event truth
A log line is a record of one event. Logs carry the detail metrics lose. The single biggest upgrade you can make is structured logging, emit JSON, not free text, so you can filter and aggregate. {"level":"error","user":8842,"reason":"card_declined"} is queryable; payment failed for user is a needle in a haystack.
In a system with many services, one user action triggers a chain of internal calls. A trace records that whole journey as a tree of spans, each span is one operation, with timing. Traces are how you answer "the checkout is slow, *which* of the eight services it touches is the culprit?" The trace_id in that log line above is what ties the pillars together: from a slow trace you jump straight to its logs.
Pro tip
The cardinality trap: never put high-cardinality values (user IDs, request IDs, emails) in metric *labels*, it explodes your metrics store and bankrupts you. Those belong in logs and traces. Metric labels should be low-cardinality: status code, endpoint, region. This single rule prevents most observability cost blowups.
What to actually measure: the four golden signals
You could measure a thousand things. Google's SRE book distilled what matters for any user-facing service down to four. If you only instrument these, you'll catch the overwhelming majority of real problems:
Latency, how long requests take. Track p50, p95, p99, and crucially split *successful* from *failed* requests (a fast failure can hide a slow success).
Traffic, how much demand the system is under: requests per second, transactions per minute.
Errors, the rate of failed requests. Include both explicit failures (HTTP 500s) and implicit ones (wrong answers, policy violations).
Saturation, how "full" the system is: CPU, memory, queue depth, connection pool usage. The leading indicator of impending failure.
USE and RED, two cousins worth knowing
For resources (CPUs, disks), the USE method tracks Utilisation, Saturation, and Errors. For request-driven services, the RED method tracks Rate, Errors, and Duration. Both are subsets of the golden signals, pick the framing that fits what you're measuring.
SLI, SLO, SLA, measuring 'good enough' on purpose
Metrics tell you what's happening. SLOs tell you whether what's happening is acceptable, and they turn "is the site okay?" from a gut feeling into a number. The three terms get mixed up constantly, so here they are in plain language:
What it is
Example
SLI
A measured indicator of service health
% of requests served in under 300ms
SLO
Your internal target for that SLI
99.9% of requests under 300ms, monthly
SLA
A contractual promise to customers
99.9% uptime or you get a credit
An SLI is the measurement, an SLO is your internal target, an SLA is the external promise (with consequences).
The genius idea that falls out of SLOs is the error budget. A 99.9% monthly SLO means you're *allowed* to be bad 0.1% of the time, roughly 43 minutes a month. That budget is a tool: if you have budget left, ship features fast. If you've burned it, freeze risky changes and spend on reliability. It ends the eternal feature-vs-stability argument with math instead of opinions.
Don't aim for 100%
Each extra '9' of availability costs exponentially more and buys diminishing value. 99.999% ("five nines") is ~5 minutes of downtime a *year*, heroic and rarely worth it. Set the SLO at the level your users actually need, then stop. Chasing 100% is how teams burn out without users noticing the difference.
Common mistakes that cost hours
Alerting on causes instead of symptoms. An alert on "CPU > 80%" pages you for something users may never feel. Alert on what users experience, error rate, latency SLO burn, and use causes for investigation, not paging.
Logging unstructured text.log.info("user " + id + " failed") can't be filtered or aggregated. Emit structured JSON with consistent field names from day one; retrofitting it across a codebase is miserable.
High-cardinality metric labels. Putting user IDs or request IDs in metric labels explodes storage and cost. Keep labels low-cardinality; push the detail to logs and traces.
No trace IDs linking the pillars. Without a shared trace_id propagated across services and into logs, your three pillars are three islands and every investigation starts from scratch.
Alert fatigue. Hundreds of noisy, non-actionable alerts train people to ignore the pager. Every alert should be actionable and tied to user impact, or it shouldn't page. (More on this in the on-call article below.)
Where to go next
The whole article in 6 lines
**Monitoring** tells you *that* something broke; **observability** lets you find out *why*, including failures you never predicted.
The three pillars: **metrics** (is it wrong, how much), **logs** (what exactly happened), **traces** (where the time/errors went).
Alert on **metrics**; investigate with **logs and traces**, and link them with a shared `trace_id`.
Instrument the **four golden signals**: latency, traffic, errors, saturation.
**SLIs** measure, **SLOs** set internal targets, **SLAs** are external promises; the **error budget** turns reliability into a decision.
Don't chase 100%, keep metric labels low-cardinality and every alert actionable.
Observability is the sense organ of a reliable system. Once you can see clearly, the next step is building systems that fail gracefully, and running the humans who respond when they do:
Get hands-on inspecting a live system: the kubectl Lab lets you query pod health and logs the way you would during an incident.
Instrument one service with the four golden signals this week. The next time it misbehaves, you'll be reading data instead of guessing.
Want to go deeper?
This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.