Observability: Metrics, Logs & Traces (The Three Pillars)

On this page

The 2 a.m. page nobody can answer
What is observability?
The picture: how telemetry flows
The three pillars side by side
What they look like in code
Monitoring vs observability
Common mistakes that cost hours
Takeaways
Where to go next

The 2 a.m. page nobody can answer

It's 2 a.m. Your phone buzzes: "Checkout latency p99 above 3s." You open the dashboard. Yep, latency is high, the graph agrees with the alert. Now what? *Which* service is slow? Is it the database, a downstream payment API, a single bad host, or one specific customer hammering an endpoint? The dashboard tells you that something is wrong. It can't tell you why. So you start grepping logs across six services, eyeballing timestamps, and guessing.

That gap, between knowing something broke and understanding what broke, is exactly what observability exists to close. And it rests on three kinds of data your systems can emit: metrics, logs, and traces. Get the mental model for these three and most monitoring tools suddenly make sense.

Who this is for

Anyone who runs software in production and has stared at a dashboard with no idea what to click next, backend and cloud engineers, DevOps, and aspiring SREs. No prior tooling knowledge assumed. By the end you'll know what each pillar is, when to reach for which, and how they fit together.

What is observability?

Observability is the ability to understand the internal state of a system purely from the data it emits, so you can answer questions you never thought to ask in advance.
The working definition we'll use all article

The key phrase is "questions you never thought to ask." Traditional monitoring is built around dashboards and alerts you set up ahead of time, you predict what might go wrong and watch for it. Observability is about having rich enough data that when something *new* goes wrong, you can interrogate the system and find the answer without shipping new code first. The three pillars are simply the three shapes that emitted data comes in.

Vital signs on the monitor, heart rate, blood pressure, temperature, sampled every few secondsMetrics, cheap numbers over time: requests/sec, error rate, CPU, p99 latency

The nurse's notes, "patient reported chest pain at 02:14, gave aspirin"Logs, timestamped records of discrete events with detail and context

The full diagnostic journey, ER → bloodwork → cardiology → imaging, each step timedTraces, the path of one request across every service, with timing at each hop

Think of your service as a patient and yourself as the doctor on call.

Vitals tell you *that* the patient is unwell and how unwell. Notes tell you *what specifically* happened and when. The diagnostic journey shows you *where in the process* time was lost. You need all three to actually treat the patient, and you reach for them in that order.

The picture: how telemetry flows

Here's the shape of a typical observability setup. A single service emits all three signal types; they flow to a backend that stores and indexes them; and you consume them through dashboards, alerts, and ad-hoc queries.

One service emits three signal types → a backend ingests and stores them → humans consume them three ways.

1
Instrument the service
Add a library (or an agent / OpenTelemetry SDK) that records numbers, writes structured events, and tags each request with a trace ID. This is the one-time setup cost.
2
Emit three signal types
On every request the service bumps counters (metrics), writes a JSON log line, and records spans for each operation it performs.
3
Ship to a backend
Telemetry is exported, usually over the network, to a backend that ingests, stores, and indexes it. Each signal type is often stored differently (time-series DB for metrics, log store for logs, trace store for spans).
4
Alert on metrics
Cheap numeric thresholds fire pages: "error rate > 1%". Metrics are what you alert on because they're fast and cheap to evaluate continuously.
5
Investigate with logs and traces
Once an alert fires, you pivot: open the trace for a slow request to find the slow hop, then read the logs from that service at that moment for the why.

The three pillars side by side

The single most useful thing to internalize is what each pillar is *good at* and what it *costs*. The big lever is cardinality, how many distinct label combinations a signal carries. Metrics must stay low-cardinality to stay cheap; logs and traces can be high-cardinality because you only fetch the ones you need.

	Metrics	Logs	Traces
What it answers	Is something wrong, and how much?	What exactly happened in this event?	Where in the request did time/errors occur?
Shape of data	Numbers over time (aggregated)	Timestamped event records	A tree of timed spans per request
Cardinality	Low, keep labels bounded	High, any field is fine	High, one trace per request
Cost	Cheap to store & query at scale	Moderate to high (volume)	High per-trace; usually sampled
Best for	Dashboards, alerting, SLOs, trends	Debugging a specific event, audit	Latency breakdown across services

Reach for the pillar whose "best for" matches the question in your head.

Cardinality is where bills explode

Adding a label like `user_id` or `request_id` to a **metric** multiplies its time series by every distinct value, millions of series, a runaway bill, and a slow backend. Those high-cardinality identifiers belong on **logs and traces**, not metrics. This one rule prevents most observability cost disasters.

What they look like in code

Concretely, here's one request handler emitting all three signals. A metric counts and times the request, a structured log records what happened, and a trace span captures the timing of a sub-operation. Note how the same trace_id ties everything together, that's the glue.

checkout.py

python

import time, logging, json
from opentelemetry import trace, metrics

logger = logging.getLogger("checkout")
tracer = trace.get_tracer("checkout")
meter = metrics.get_meter("checkout")

# --- METRIC: a counter and a latency histogram (low cardinality) ---
requests = meter.create_counter("checkout_requests_total")
latency = meter.create_histogram("checkout_latency_ms")

def handle_checkout(order):
    start = time.time()
    # --- TRACE: a span wraps the work; child spans nest inside ---
    with tracer.start_as_current_span("handle_checkout") as span:
        span.set_attribute("order.id", order.id)          # high-cardinality is OK here
        ctx = span.get_span_context()
        trace_id = format(ctx.trace_id, "032x")

        with tracer.start_as_current_span("charge_payment"):
            charge(order)                                  # timed sub-operation

        elapsed = (time.time() - start) * 1000
        requests.add(1, {"status": "ok"})                 # bounded label only
        latency.record(elapsed, {"status": "ok"})

        # --- LOG: a structured event, correlated by trace_id ---
        logger.info(json.dumps({
            "event": "checkout_completed",
            "order_id": order.id,
            "amount": order.amount,
            "latency_ms": round(elapsed, 1),
            "trace_id": trace_id,
        }))

Read the three differently. The metric lines (requests.add, latency.record) feed dashboards and alerts, they only carry the bounded status label. The log line is a single JSON object you can grep, filter, and parse on any field. The span records how long charge_payment took *inside* the overall request. When you put trace_id on the log too, you can jump from a slow trace straight to its log lines, or vice versa.

Structured logs over print statements

Emit logs as JSON (key/value), not free-form strings. `logger.info("checkout done for " + id)` is unsearchable at scale; a JSON object lets the backend index every field so you can query `order_id=42 AND latency_ms>2000` in one line.

Monitoring vs observability

These words get used interchangeably, but the distinction is real and worth holding onto. Monitoring answers known questions. You decide in advance what matters, error rate, CPU, queue depth, build a dashboard and an alert, and watch. It's excellent for the failures you can anticipate.

Observability answers new questions. When something breaks in a way you *didn't* predict, can you slice the data to find the cause without first deploying new instrumentation? That's the test. If the only way to debug an incident is to add logging and redeploy, you have monitoring, not observability.

Monitoring: a smoke detector, wired to one known signal, alarms when it tripsPredefined dashboards + alerts on metrics you chose ahead of time

Observability: a detective at the scene, asks whatever the evidence suggestsRich, high-cardinality logs + traces you can query in any direction after the fact

Same data, different mode of use.

They're not rivals, observability is the superset. You still build dashboards and alerts (that's the monitoring layer, almost always on metrics). But you back them with logs and traces detailed enough to investigate the unknown. A practical way to scope your alerts is the The Four Golden Signals: latency, traffic, errors, and saturation.

Common mistakes that cost hours

Alerting on logs instead of metrics. Scanning log volume to fire alerts is slow and expensive. Alert on cheap numeric metrics; use logs to investigate *after* the page.
High-cardinality labels on metrics. Putting user_id, order_id, or request_id on a metric explodes your time-series count and your bill. Those fields belong on logs and traces.
Unstructured logs. Free-text log lines can't be queried or aggregated. Emit JSON with consistent field names from day one.
No correlation IDs. Without a shared trace_id across services and on your logs, you can't connect a slow trace to the log line that explains it. Propagate it everywhere.
Tracing nothing or everything. No traces means no cross-service latency breakdown; 100% sampling at scale is ruinously expensive. Sample intelligently (e.g. keep all errors, sample the rest).
Treating dashboards as the whole job. A wall of green graphs feels safe but only covers the questions you already thought of. Real incidents are usually the ones you didn't.

Takeaways

The whole article in seven lines

Observability = understanding a system's internal state from the data it emits, including for questions you didn't plan for.
**Metrics** = cheap numbers over time. Answer "is something wrong, how much?" → dashboards, alerts, SLOs.
**Logs** = structured timestamped events. Answer "what exactly happened?" → debugging a specific case.
**Traces** = the path of one request across services. Answer "where did the time/error go?" → latency breakdown.
Reach for them in order: metrics tell you *that*, traces tell you *where*, logs tell you *why*.
Keep metrics low-cardinality; put high-cardinality IDs on logs and traces; correlate everything with a `trace_id`.
Monitoring answers known questions; observability lets you ask new ones. You need both.

Where to go next

You now have the data model. The next steps are learning the standard way to emit these signals, going deep on the pillar that closes the most incidents, and scoping what to alert on.

OpenTelemetry in Practice, the vendor-neutral standard for emitting metrics, logs, and traces from any language.
Distributed Tracing, how spans propagate across services so you can follow one request end to end.
The Four Golden Signals, the four metrics worth alerting on for any user-facing service.
Follow the full SRE career path to build observability, alerting, and reliability skills in order.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

DevOps

Observability: Metrics, Logs & Traces (The Three Pillars)

Read

Cloud

Reliability & Resilience: Designing for Failure

Read

DevOps

Incident Management & On-Call That Doesn't Burn People Out

Read