Log Aggregation
Centralized logging collects logs from all services into a single searchable system โ replacing the chaos of SSH-ing into individual servers with structured, queryable, correlated log data that powers incident response and root cause analysis.
Log Aggregation
Centralized logging collects logs from all services into a single searchable system โ replacing the chaos of SSH-ing into individual servers with structured, queryable, correlated log data that powers incident response and root cause analysis.
What you'll learn
- Centralized logging replaces SSH-based debugging with a single queryable system โ essential for any architecture with more than one service
- Structured JSON logs make every field a query target; unstructured text requires fragile regex parsing. Always log in JSON.
- Trace IDs (correlation IDs) propagated through all service calls are the most impactful single addition to a microservices logging setup โ they enable cross-service request tracing
- Log shippers (Fluent Bit for Kubernetes) collect and forward logs; backends (Loki, Elasticsearch, CloudWatch) store and query them. Choose the shipper and backend separately.
- Log retention and cost management are mandatory: set retention policies from day one, filter health-check logs at the shipper, and use tiered storage for compliance requirements
Lesson outline
Why Centralized Logging: The SSH-to-40-Servers Problem
Without centralized logging, investigating a production incident means SSH-ing into each server and grepping through log files. For a single service on one server, this is inconvenient. For a microservices architecture with dozens of services across hundreds of instances, it is impossible.
What "Flying Blind" Looks Like
At 3 AM, your monitoring alerts that 20% of API requests are failing. Without centralized logging: you SSH into Server 1 (nothing obvious), SSH into Server 2 (different error message), try to figure out if the errors are correlated, check if it is the same user, realize you need to grep 8 different servers and correlate timestamps manually. By the time you find the root cause, an hour has passed. With centralized logging: one query in Kibana or Grafana, filter by error code, group by service, correlate the trace ID. Root cause in 5 minutes.
Centralized logging ships log output from every service and instance to a single system where it can be searched, filtered, and correlated. The key components are: log shippers (Fluentd, Fluent Bit, Logstash, Vector) that collect and forward logs, and storage/query backends (Elasticsearch, Loki, CloudWatch, Datadog) that index and serve them.
What you gain from centralized logging
- Cross-service correlation โ A single request may touch 8 services. With a shared trace ID in logs, one query shows the full journey and exactly where it failed.
- Historical analysis โ When did this error first appear? Was there a deployment at that time? Centralized logs with timestamps answer these questions instantly.
- Dashboards and alerting โ Error rate, 5xx count, slow queries โ all visible on dashboards without SSH. Alert on log patterns (error rate spike, specific exception) via PagerDuty.
- Security and compliance โ Audit trail for who accessed what, when. Compliance requirements (HIPAA, PCI, SOC2) often mandate centralized log retention.
- Scale-independent โ Whether you have 2 servers or 2,000, the workflow is the same: query the log system. SSH-based debugging does not scale.
Log Shipping Architectures: Fluentd, Fluent Bit, and Logstash
Log shippers are the agents that collect logs from their sources (application stdout, log files, system journals) and forward them to your storage backend. Choosing the right shipper affects resource consumption, pipeline complexity, and reliability.
| Shipper | Resource usage | Best for | Deployment model |
|---|---|---|---|
| Fluent Bit | Very low (~45MB RAM) | Kubernetes pods, edge devices, resource-constrained environments | DaemonSet in Kubernetes; sidecar for specific pods |
| Fluentd | Moderate (~40-300MB RAM) | Complex log routing, transformation, fan-out to multiple destinations | DaemonSet or standalone agent; aggregator layer |
| Logstash | High (~500MB+ RAM) | Complex ETL pipelines with Elasticsearch; filter and transform heavy workloads | Standalone service; often behind a message queue (Kafka) |
| Vector | Low (~30MB RAM) | High-performance log processing; modern alternative to Fluentd/Logstash | DaemonSet or standalone; Rust-based, very fast |
| CloudWatch Agent | Low | AWS workloads; native CloudWatch Logs integration | EC2 agent; ECS sidecar; Lambda auto-instrumentation |
Use Fluent Bit in Kubernetes
In Kubernetes, deploy Fluent Bit as a DaemonSet (one pod per node). It reads container logs from /var/log/containers/, adds Kubernetes metadata (namespace, pod name, labels), and forwards to your backend. It is lightweight enough to run on every node without impacting application performance. Fluentd can serve as an aggregator layer if you need complex transformation before the backend.
Common Log Shipping Architecture in Kubernetes
Application โ stdout/stderr โ container runtime โ /var/log/containers/ Fluent Bit DaemonSet reads /var/log/containers/ Fluent Bit adds K8s metadata (pod, namespace, labels) Fluent Bit forwards to Loki / Elasticsearch / CloudWatch Grafana / Kibana / CloudWatch Logs Insights queries the backend
Structured vs Unstructured Logs: JSON Wins
Unstructured logs are plain text: "2024-01-15 14:23:01 ERROR PaymentService failed to process order 12345 for user john@example.com". They are human-readable but machine-unfriendly โ querying requires regex, filtering is slow, and extending the schema requires changing grep patterns across all your monitoring.
Log in JSON โ Always
Structured logs (JSON) make every field a first-class query target: {"timestamp": "2024-01-15T14:23:01Z", "level": "error", "service": "payments", "message": "payment processing failed", "order_id": "12345", "user_id": "usr_456", "error_code": "INSUFFICIENT_FUNDS", "duration_ms": 234, "trace_id": "abc-def-789"} Now you can: filter by error_code, aggregate by service, correlate by trace_id, alert on level:error AND service:payments, build dashboards on duration_ms percentiles. None of this requires parsing text.
Essential fields for every log line
- timestamp โ ISO 8601 UTC (2024-01-15T14:23:01.234Z). Never local time โ correlating across timezones is a nightmare.
- level โ debug, info, warn, error, fatal. Use consistently across all services.
- service / app โ Which service emitted this log. Essential for filtering in centralized systems.
- trace_id / request_id โ A unique ID that follows a request across all services. Copy from incoming headers (X-Trace-ID) or generate at the edge. This is the key to cross-service correlation.
- user_id โ When relevant, include the user ID. Answers "was this error isolated to one user or affecting everyone?" immediately.
- duration_ms โ For any operation with a significant duration (database queries, external API calls). Powers latency dashboards without separate instrumentation.
Never Log PII or Secrets
Logs are often retained for 30-90 days and accessible to many engineers. Logging passwords, credit card numbers, SSNs, or full JWT tokens creates a compliance and security liability. Log user IDs and order IDs, not personal data. Use token scrubbing middleware if your framework does not handle this automatically.
ELK vs Loki vs CloudWatch vs Datadog
Choosing a log aggregation backend is a decision that will be hard to reverse โ log schemas, query patterns, and alerting all depend on the backend. The four dominant options each have different trade-offs.
| System | Index model | Query language | Cost model | Best for |
|---|---|---|---|---|
| ELK Stack (Elasticsearch + Logstash + Kibana) | Full-text indexes all fields | Lucene/KQL โ powerful | High storage cost; self-managed is complex | Teams needing full-text search, complex aggregations, existing Elastic expertise |
| Grafana Loki | Labels only (like Prometheus for logs) | LogQL โ similar to PromQL | Very low storage cost (only indexes labels, not full text) | Teams already using Grafana/Prometheus; cost-sensitive; Kubernetes-native |
| AWS CloudWatch Logs | Managed; full text | CloudWatch Logs Insights (SQL-like) | Per-GB ingestion + storage; expensive at scale | AWS-native teams; simple use cases; no infrastructure to manage |
| Datadog Logs | Fully managed; full text | Datadog query language | Most expensive at scale; $0.10/GB + retention costs | Teams wanting unified metrics + logs + traces + APM in one platform |
Loki is the Cost-Efficient Choice for Kubernetes
Loki stores logs as compressed chunks, indexing only labels (like Prometheus). Query performance is lower than Elasticsearch for ad-hoc full-text search, but for structured JSON logs where you filter by label (service=payments, level=error), it is extremely fast and dramatically cheaper. At 10GB/day ingestion, Loki on S3 costs ~$3/month in storage vs ~$50 for equivalent Elasticsearch.
Log Retention and Cost Management
Log storage costs scale linearly with log volume and retention period. A busy production system can generate gigabytes of logs per hour. Without cost management, log storage becomes one of the largest line items in your infrastructure bill.
Log retention best practices
- Tiered retention by environment โ dev: 3-7 days. staging: 14-30 days. production: 90 days hot, archive to cold storage (S3 Glacier) for 1-7 years depending on compliance requirements.
- Sample high-volume debug logs โ Sampling: keep 1% of debug/info logs from healthy services, 100% of warn/error logs. Most debug logs are never queried. Sampling can reduce volume by 10-50x.
- Filter at the shipper, not the backend โ Use Fluent Bit filters to drop health check logs (/healthz, /ping) before they reach the backend. These are typically 30-50% of all HTTP logs but have zero debugging value.
- Compress aggressively โ Logs compress extremely well (text is repetitive). Loki, Elasticsearch, and CloudWatch all compress. Ensure compression is enabled. Can reduce storage by 5-10x.
- Alert on log volume anomalies โ If a service suddenly produces 10x its normal log volume, something is wrong (error loop, debug logging left on). Alert on log ingestion rate per service.
CloudWatch Logs: Set Retention or Pay Forever
CloudWatch Logs default retention is "never expire." A single EC2 instance logging at moderate volume with 10 years of logs can cost $500+/month in storage. Set retention on every log group: 7 days for dev, 30 days for staging, 90 days for production. Do this from day one โ it is painful to clean up years of accumulated logs later.
How this might come up in interviews
Log aggregation appears in observability system design questions, SRE interview scenarios about incident response, and DevOps maturity assessments. "How would you investigate a production issue?" almost always involves logging architecture.
Common questions:
- How would you implement centralized logging for a microservices architecture on Kubernetes?
- What is structured logging and why does it matter?
- Walk me through how trace IDs work for cross-service debugging.
- How do you manage log storage costs at scale?
- What is the difference between Loki and Elasticsearch for log storage?
Strong answer: Explaining trace IDs for cross-service correlation. Advocating for structured JSON logs with specific field examples. Having a cost management strategy (retention tiers, sampling, filtering). Knowing at least two log aggregation tools and their trade-offs.
Red flags: Describing SSH-ing into servers as a log investigation strategy. Not knowing what structured logging is. No mention of trace/correlation IDs. Not considering log retention costs. Treating logging as optional.
Quick check ยท Log Aggregation
1 / 3
Your microservices architecture has 15 services. A checkout request fails and you need to find where in the call chain it broke. What single addition would most reduce your investigation time?
Key takeaways
- Centralized logging replaces SSH-based debugging with a single queryable system โ essential for any architecture with more than one service
- Structured JSON logs make every field a query target; unstructured text requires fragile regex parsing. Always log in JSON.
- Trace IDs (correlation IDs) propagated through all service calls are the most impactful single addition to a microservices logging setup โ they enable cross-service request tracing
- Log shippers (Fluent Bit for Kubernetes) collect and forward logs; backends (Loki, Elasticsearch, CloudWatch) store and query them. Choose the shipper and backend separately.
- Log retention and cost management are mandatory: set retention policies from day one, filter health-check logs at the shipper, and use tiered storage for compliance requirements
๐ก Analogy
Centralized logging is like having a court reporter present in every meeting in your organization simultaneously. Without them, if you need to know what was said in a meeting from last Tuesday at 2 PM, you have to track down every attendee, hope they remember, and try to reconcile conflicting accounts. With the court reporter, you search the transcript system: "show me everything said about the budget in the last 30 days, sorted by speaker." Trace IDs are like case numbers that link all documents across different meetings related to the same matter.
โก Core Idea
Ship structured JSON logs from every service to a central system. Filter by trace ID to correlate a request across services. Filter by error level to find problems. All without SSH-ing into a single server.
๐ฏ Why It Matters
Centralized logging is table stakes for operating any system with more than one service or more than one server. The mean time to resolution (MTTR) in incidents is dominated by time to diagnose. With centralized structured logs and trace IDs, diagnosis drops from hours to minutes.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Discussion
Questions? Discuss in the community or start a thread below.
Join DiscordIn-app Q&A
Sign in to start or join a thread.