Observability

How you see what’s wrong and where-using metrics, logs, and traces.

Metrics, logs, and traces

Observability = answering "what’s wrong and where?" without redeploying. Three building blocks: metrics, logs, and traces.

Metrics = numbers over time. A counter (e.g. total requests). A gauge (e.g. CPU %). A histogram (e.g. requests under 200 ms). You put them on dashboards and set alerts (e.g. "error_rate > 1% for 5 min" → Slack).

Logs = one line per event. Each line has timestamp, level (info/warn/error), message, and often request_id. You search: "level=error last hour" or "request_id=abc" to see one user’s path.

Traces = one path per request. A trace is a list of spans. Each span = one step (e.g. "checkout API", "tax service", "DB") with a duration. The UI shows a timeline-the longest span is the bottleneck. You see "tax service took 8 s" instead of "checkout is slow".

The three pillars of observability

Metrics

What is happening?

Aggregated numbers over time—trends, rates, percentiles.

• Error rate: 0.1%
• Latency p99: 200ms
• CPU: 45%

Logs

When did it happen?

Event records with timestamps—who did what, and when.

• User logged in
• Payment processed
• Error: timeout

Traces

Why did it happen?

Request journey across services—find bottlenecks and root cause.

• API → Auth → DB
• Slow span: tax API
• Distributed debug

Example: Distributed trace (one request)

API

Auth

Traces show where time is spent across services—e.g. DB span is longest → optimize or cache.

The power of observability

Metrics tell you what is happening. Logs tell you when it happened. Traces tell you why—following a request through services to find the root cause.

Examples: what you actually see

Metric (Prometheus)

Counterhttp_requests_total{path="/api/orders"} 1523

Histogramhttp_request_duration_seconds_bucket{le="0.2"} 1400

Grafana graphs these.

Log (structured JSON)

{"level":"error","msg":"payment declined","request_id":"req-7f3a","gateway":"stripe"}

Search by request_id or level=error.

Trace (spans)

Frontend2 ms

API1 ms

Auth50 ms

Tax API8200 ms

DB30 ms

Tax API is the bottleneck. Without a trace you’d only see “API slow”.

Dashboards and alerts

Dashboard — “How is the system doing?”

Requests/s

1,240

Error rate

0.2%

p99 latency

180 ms

Errors (24h)

Panels: RPS, error count, latency (p50/p95/p99), breakdowns by service or endpoint.

Alerts

Fire when a condition is met (e.g. error_rate > 1% for 5 min or p99_latency > 2s). Notification goes to Slack, PagerDuty, or email.

Then: use logs and traces to find root cause.

Flow: something’s wrong → debug

Alert fires

error_rate > 1% for 5 min

Filter logs

time + level=error

Open trace

request_id → see failed span

Metrics tell you something is wrong. Logs and traces help you find which request and which span failed or timed out.

Real-world scenario: slow checkout with "healthy" servers

Expert scenario

What happened: "Pay now" takes 10 seconds. Dashboards show CPU and memory fine-everything "healthy".

What you do: Open a trace for a slow checkout. You see: API 20 ms → Auth 30 ms → Tax API 9,200 ms → DB 50 ms. The tax API is the bottleneck. Fix: cache tax results, add a 3 s timeout + fallback, or move tax to a background job. Checkout gets fast again.

Ready to see how this works in the cloud?

Switch to Career Paths on the Academy page for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths