Back to path
AdvancedTaskFlow · Project 11 of 13 ~12h· 5 milestones

See it in production: metrics, logs, traces, SLOs

Continues from the last build: TaskFlow auto-syncs to EKS via Argo CD but you are blind to its health.

TaskFlow now ships itself. In the last rung you wired Argo CD so a merge to main reconciles the desired state onto your EKS cluster with no kubectl apply by hand.

Instrumenting a FastAPI service with Prometheus metrics (RED method) including custom latency bucketsDeploying Prometheus and Grafana on Kubernetes via kube-prometheus-stackShipping and querying structured JSON logs with Loki and LogQLDistributed tracing with OpenTelemetry across frontend, backend, and databaseDefining SLIs/SLOs and writing multi-window multi-burn-rate alerts

What you'll build

TaskFlow is fully observable on EKS: Prometheus scrapes a real /metrics endpoint whose latency histogram has a 0.3s bucket and a status label, Grafana shows live RED dashboards, structured logs are searchable in Loki, a distributed trace links frontend to backend to RDS, and a multi-window multi-burn-rate alert pages on-call only when the 99.5%/300ms SLO is genuinely at risk.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

taskflow/backend/app/observability.pypython
import logging
import sys
from pythonjsonlogger import jsonlogger
from prometheus_fastapi_instrumentator import Instrumentator, metrics
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

# Buckets MUST include 0.3 so the 300ms SLO query has a real boundary.
# The library default is (0.1, 0.5, 1, +Inf), which has no le="0.3".
LATENCY_BUCKETS = (0.05, 0.1, 0.2, 0.3, 0.5, 1.0, 2.0, float("inf"))


def _setup_json_logging() -> None:
    handler = logging.StreamHandler(sys.stdout)
    handler.setFormatter(jsonlogger.JsonFormatter(
        "%(asctime)s %(levelname)s %(name)s %(message)s"
    ))
    root = logging.getLogger()
    root.handlers = [handler]
    root.setLevel(logging.INFO)


def _setup_tracing(engine) -> None:
    provider = TracerProvider(resource=Resource.create({"service.name": "taskflow-backend"}))
    provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
    trace.set_tracer_provider(provider)
    SQLAlchemyInstrumentor().instrument(engine=engine)


def init_observability(app, engine) -> None:
    _setup_json_logging()
    _setup_tracing(engine)
    # Exclude /metrics so the scrape endpoint is not traced or counted in the SLO.
    FastAPIInstrumentor.instrument_app(app, excluded_urls="/metrics")
    instrumentator = Instrumentator(
        excluded_handlers=["/metrics"],
    )
    # Replace the default duration metric with one that has a 0.3s bucket
    # AND a status label, so the SLI can filter fast-and-non-error requests.
    instrumentator.add(
        metrics.latency(
            buckets=LATENCY_BUCKETS,
            should_include_handler=True,
            should_include_method=True,
            should_include_status=True,
        )
    )
    instrumentator.instrument(app).expose(app, endpoint="/metrics")

Reading this file

  • LATENCY_BUCKETS = (0.05, 0.1, 0.2, 0.3, 0.5, 1.0, 2.0, float("inf"))The explicit histogram bucket edges; including 0.3 is what gives the SLO query a real le="0.3" series to count fast requests.
  • should_include_status=TrueAdds a status label to the duration histogram so the SLI can count requests that are both fast and non-error in one metric.
  • excluded_handlers=["/metrics"]Stops the scrape endpoint from being recorded in the RED metrics, so the monitoring endpoint does not pollute the numbers it reports.
  • excluded_urls="/metrics"Tells the tracer not to create a server span for the scrape endpoint, keeping scrape traffic out of traces and the SLO.
  • OTLPSpanExporter()Sends finished trace spans over OTLP gRPC to the OpenTelemetry Collector; the endpoint comes from an env var set in the Deployment.

Single module that turns on all three pillars. Imported and called from main.py so the rest of the app stays clean. The latency metric is configured explicitly with a 0.3s bucket and a status label so the SLO query actually works, and /metrics is excluded from instrumentation so the scrape endpoint does not pollute its own numbers.

That's 1 of 9 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Instrument the FastAPI backend with a Prometheus /metrics endpoint

    3 guided steps

    You cannot graph what the app does not emit, and you cannot compute a 300ms SLO against a histogram that has no 0.3 boundary. A histogram (not a gauge) is what lets you compute real percentile latency, and the bucket edges you choose are the only latency thresholds you can ever query. Adding a status label to the duration metric is what lets the SLI count fast-and-non-error requests in one place instead of stitching two metrics with mismatched labels.

  2. 2

    Deploy Prometheus and Grafana and scrape the backend

    3 guided steps

    Metrics that no one collects are useless. Prometheus pulls /metrics on a schedule and stores a time series; Grafana turns those series into the three RED graphs an on-call engineer reads first. Annotation-based discovery means future services get scraped automatically just by adding three lines, but only if the relabel config actually consumes the port annotation, otherwise Prometheus scrapes the wrong port and the target shows DOWN.

  3. 3

    Ship structured JSON logs to Loki and query them

    3 guided steps

    kubectl logs on one pod is fine for a single replica on your laptop and useless in production with many pods behind a load balancer. Centralized, structured logs let you search by field (level, path, trace_id) across the whole fleet. JSON logging is what makes those fields queryable instead of forcing fragile regex on free text.

  4. 4

    Trace a request end to end with OpenTelemetry

    3 guided steps

    Metrics tell you that /tasks is slow; a trace tells you why. A distributed trace links the frontend fetch, the backend handler, and the exact SQL statement into one waterfall, so a 280ms request that spends 240ms in Postgres immediately points at the database, not your code. Without trace context propagation, you would be correlating timestamps across three systems by hand.

  5. 5

    Define the SLO and alert on multi-window burn rate

    3 guided steps

    An alert on every 5xx or every slow request trains on-call to ignore the pager. A single-window alert, or one compared against its own value an hour ago, either flaps or misses sustained burns. The Google SRE Workbook's multi-window approach requires a short window for responsiveness and a long window for confirmation, so you page only when the budget is genuinely burning fast and not on a one-off spike. Computing good events once (fast and non-error together) also keeps the ratio in [0,1], so the 14.4x threshold comparison is coherent.

What's inside when you start

4 starter files, ready to clone
5 guided milestones
5 full reference solutions
9 code blocks explained line-by-line
5 "is it working?" checks
3 interview questions it prepares you for

You'll walk away with

FastAPI backend exposing a working /metrics endpoint with RED counters and a latency histogram that has a 0.3s bucket and a status label
kube-prometheus-stack deployed, scraping the backend on port 8000 via the port-aware relabel config, with a Grafana RED dashboard for /tasks
Structured JSON logs from all backend pods searchable in Loki via LogQL
A single end-to-end OpenTelemetry trace spanning frontend, FastAPI, and the RDS query
A documented 99.5%/300ms SLO encoded as a good-events recording rule plus a multi-window multi-burn-rate alert routed through Alertmanager

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building