Observability
How you see what’s wrong and where-using metrics, logs, and traces.
Metrics, logs, and traces
Observability = answering "what’s wrong and where?" without redeploying. Three building blocks: metrics, logs, and traces.
Metrics = numbers over time. A counter (e.g. total requests). A gauge (e.g. CPU %). A histogram (e.g. requests under 200 ms). You put them on dashboards and set alerts (e.g. "error_rate > 1% for 5 min" → Slack).
Logs = one line per event. Each line has timestamp, level (info/warn/error), message, and often request_id. You search: "level=error last hour" or "request_id=abc" to see one user’s path.
Traces = one path per request. A trace is a list of spans. Each span = one step (e.g. "checkout API", "tax service", "DB") with a duration. The UI shows a timeline-the longest span is the bottleneck. You see "tax service took 8 s" instead of "checkout is slow".
The three pillars of observability
Metrics
What is happening?
Aggregated numbers over time—trends, rates, percentiles.
- • Error rate: 0.1%
- • Latency p99: 200ms
- • CPU: 45%
Logs
When did it happen?
Event records with timestamps—who did what, and when.
- • User logged in
- • Payment processed
- • Error: timeout
Traces
Why did it happen?
Request journey across services—find bottlenecks and root cause.
- • API → Auth → DB
- • Slow span: tax API
- • Distributed debug
Example: Distributed trace (one request)
Traces show where time is spent across services—e.g. DB span is longest → optimize or cache.
The power of observability
Metrics tell you what is happening. Logs tell you when it happened. Traces tell you why—following a request through services to find the root cause.
Examples: what you actually see
Examples: what you actually see
http_requests_total{path="/api/orders"} 1523http_request_duration_seconds_bucket{le="0.2"} 1400Grafana graphs these.
{"level":"error","msg":"payment declined","request_id":"req-7f3a","gateway":"stripe"}Search by request_id or level=error.
Tax API is the bottleneck. Without a trace you’d only see “API slow”.
Dashboards and alerts
Dashboards and alerts
Requests/s
1,240
Error rate
0.2%
p99 latency
180 ms
Errors (24h)
42
Panels: RPS, error count, latency (p50/p95/p99), breakdowns by service or endpoint.
Fire when a condition is met (e.g. error_rate > 1% for 5 min or p99_latency > 2s). Notification goes to Slack, PagerDuty, or email.
Then: use logs and traces to find root cause.
Flow: something’s wrong → debug
Metrics tell you something is wrong. Logs and traces help you find which request and which span failed or timed out.
Real-world scenario: slow checkout with "healthy" servers
Expert scenarioWhat happened: "Pay now" takes 10 seconds. Dashboards show CPU and memory fine-everything "healthy".
What you do: Open a trace for a slow checkout. You see: API 20 ms → Auth 30 ms → Tax API 9,200 ms → DB 50 ms. The tax API is the bottleneck. Fix: cache tax results, add a 3 s timeout + fallback, or move tax to a background job. Checkout gets fast again.
Sign in to track progress on your dashboard.
Ready to see how this works in the cloud?
Switch to Career Paths on the Academy page for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based paths