See it in production: metrics, logs, traces, SLOs
Continues from the last build: TaskFlow auto-syncs to EKS via Argo CD but you are blind to its health.
TaskFlow now ships itself. In the last rung you wired Argo CD so a merge to main reconciles the desired state onto your EKS cluster with no kubectl apply by hand.
What you'll build
TaskFlow is fully observable on EKS: Prometheus scrapes a real /metrics endpoint whose latency histogram has a 0.3s bucket and a status label, Grafana shows live RED dashboards, structured logs are searchable in Loki, a distributed trace links frontend to backend to RDS, and a multi-window multi-burn-rate alert pages on-call only when the 99.5%/300ms SLO is genuinely at risk.
See how we teach, before you sign up
You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:
import logging
import sys
from pythonjsonlogger import jsonlogger
from prometheus_fastapi_instrumentator import Instrumentator, metrics
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
# Buckets MUST include 0.3 so the 300ms SLO query has a real boundary.
# The library default is (0.1, 0.5, 1, +Inf), which has no le="0.3".
LATENCY_BUCKETS = (0.05, 0.1, 0.2, 0.3, 0.5, 1.0, 2.0, float("inf"))
def _setup_json_logging() -> None:
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(jsonlogger.JsonFormatter(
"%(asctime)s %(levelname)s %(name)s %(message)s"
))
root = logging.getLogger()
root.handlers = [handler]
root.setLevel(logging.INFO)
def _setup_tracing(engine) -> None:
provider = TracerProvider(resource=Resource.create({"service.name": "taskflow-backend"}))
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)
SQLAlchemyInstrumentor().instrument(engine=engine)
def init_observability(app, engine) -> None:
_setup_json_logging()
_setup_tracing(engine)
# Exclude /metrics so the scrape endpoint is not traced or counted in the SLO.
FastAPIInstrumentor.instrument_app(app, excluded_urls="/metrics")
instrumentator = Instrumentator(
excluded_handlers=["/metrics"],
)
# Replace the default duration metric with one that has a 0.3s bucket
# AND a status label, so the SLI can filter fast-and-non-error requests.
instrumentator.add(
metrics.latency(
buckets=LATENCY_BUCKETS,
should_include_handler=True,
should_include_method=True,
should_include_status=True,
)
)
instrumentator.instrument(app).expose(app, endpoint="/metrics")
Reading this file
LATENCY_BUCKETS = (0.05, 0.1, 0.2, 0.3, 0.5, 1.0, 2.0, float("inf"))The explicit histogram bucket edges; including 0.3 is what gives the SLO query a real le="0.3" series to count fast requests.should_include_status=TrueAdds a status label to the duration histogram so the SLI can count requests that are both fast and non-error in one metric.excluded_handlers=["/metrics"]Stops the scrape endpoint from being recorded in the RED metrics, so the monitoring endpoint does not pollute the numbers it reports.excluded_urls="/metrics"Tells the tracer not to create a server span for the scrape endpoint, keeping scrape traffic out of traces and the SLO.OTLPSpanExporter()Sends finished trace spans over OTLP gRPC to the OpenTelemetry Collector; the endpoint comes from an env var set in the Deployment.
Single module that turns on all three pillars. Imported and called from main.py so the rest of the app stays clean. The latency metric is configured explicitly with a 0.3s bucket and a status label so the SLO query actually works, and /metrics is excluded from instrumentation so the scrape endpoint does not pollute its own numbers.
That's 1 of 9 explained code blocks in this single project.
The build, milestone by milestone
- 1
Instrument the FastAPI backend with a Prometheus /metrics endpoint
3 guided stepsYou cannot graph what the app does not emit, and you cannot compute a 300ms SLO against a histogram that has no 0.3 boundary. A histogram (not a gauge) is what lets you compute real percentile latency, and the bucket edges you choose are the only latency thresholds you can ever query. Adding a status label to the duration metric is what lets the SLI count fast-and-non-error requests in one place instead of stitching two metrics with mismatched labels.
- 2
Deploy Prometheus and Grafana and scrape the backend
3 guided stepsMetrics that no one collects are useless. Prometheus pulls /metrics on a schedule and stores a time series; Grafana turns those series into the three RED graphs an on-call engineer reads first. Annotation-based discovery means future services get scraped automatically just by adding three lines, but only if the relabel config actually consumes the port annotation, otherwise Prometheus scrapes the wrong port and the target shows DOWN.
- 3
Ship structured JSON logs to Loki and query them
3 guided stepskubectl logs on one pod is fine for a single replica on your laptop and useless in production with many pods behind a load balancer. Centralized, structured logs let you search by field (level, path, trace_id) across the whole fleet. JSON logging is what makes those fields queryable instead of forcing fragile regex on free text.
- 4
Trace a request end to end with OpenTelemetry
3 guided stepsMetrics tell you that /tasks is slow; a trace tells you why. A distributed trace links the frontend fetch, the backend handler, and the exact SQL statement into one waterfall, so a 280ms request that spends 240ms in Postgres immediately points at the database, not your code. Without trace context propagation, you would be correlating timestamps across three systems by hand.
- 5
Define the SLO and alert on multi-window burn rate
3 guided stepsAn alert on every 5xx or every slow request trains on-call to ignore the pager. A single-window alert, or one compared against its own value an hour ago, either flaps or misses sustained burns. The Google SRE Workbook's multi-window approach requires a short window for responsiveness and a long window for confirmation, so you page only when the budget is genuinely burning fast and not on a one-off spike. Computing good events once (fast and non-error together) also keeps the ratio in [0,1], so the 14.4x threshold comparison is coherent.
What's inside when you start
You'll walk away with
This is portfolio-grade. Build it free.
Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.
Start building