Build Challenge: Make It Observable and Resilient
The SRE team just rejected your deploy. Ship a microservice that passes their go-live checklist.
Why this matters at your level
Understands why each of the four pillars matters. Can implement /health and /ready endpoints. Knows the difference between liveness and readiness. Can add OpenTelemetry auto-instrumentation.
Implements all four pillars correctly. Tunes circuit breaker thresholds for the actual failure profile. Implements SIGTERM handler that correctly drains requests. Adds business-relevant span attributes.
Designs the production-readiness checklist for the entire org. Reviews PRs for missing observability and resilience patterns. Establishes org-wide defaults for circuit breaker configuration and graceful shutdown timeout. Architects the observability platform.
Build Challenge: Make It Observable and Resilient
The SRE team just rejected your deploy. Ship a microservice that passes their go-live checklist.
Your microservice is built and tested. The SRE team reviews the deploy request and rejects it. Their go-live checklist has four failing items: no distributed traces in requests, no /health or /ready endpoints for Kubernetes probes, no SIGTERM handler (Kubernetes sends SIGTERM 30 seconds before SIGKILL — in-flight requests will be cut), and no circuit breaker on the database call (one slow database will take down the entire service). You have three hours to fix all four items before the deployment window closes.
The question this raises
What does it mean for a service to be production-ready — and can you build all four requirements from scratch in one session?
Kubernetes is performing a rolling deploy: it starts a new pod and terminates the old one. The old pod has 5 in-flight requests. What must your service do to prevent those 5 requests from receiving errors?
Lesson outline
Before observability: the broken contract
How this concept changes your thinking
Debugging a latency issue in production
“grep through logs on 8 servers looking for the request ID, manually reconstruct which service called which, spend 3 weeks finding a 400ms bug”
“paste the trace ID into Jaeger, see the waterfall, find the slow span in 20 minutes — OpenTelemetry auto-instruments every HTTP call and database query”
Kubernetes rolling deploy
“Pod restarts, in-flight requests get SIGKILL, 503 errors spike for 30 seconds. Users see errors during every deploy.”
“SIGTERM handler stops accepting new connections, drains in-flight requests over 10 seconds, then exits cleanly. Zero 503s during deploy.”
Database goes slow or unreachable
“Requests pile up waiting for DB, connection pool exhausted, entire service becomes unresponsive, cascades to all upstream callers”
“Circuit breaker opens after 50% failure rate, immediately returns 503 with cached fallback data, upstream callers get fast failures instead of hangs”
Kubernetes liveness vs readiness
“Single /health endpoint returns 200 even during startup (DB not connected yet). Kubernetes sends traffic before service is ready. First 30 seconds of requests fail.”
“/health returns 200 if process is alive (always). /ready returns 200 only when DB and Redis connections are confirmed. Kubernetes waits for /ready before routing traffic.”
The SRE go-live checklist
Deploy Rejected — SRE Go-Live Checklist: 0/4 passing
DEPLOY REQUEST: my-service v1.2.3 -> production STATUS: BLOCKED — SRE go-live checklist failed [ ] FAIL: Distributed tracing No traceparent headers propagated. Cannot correlate requests across services. Required: OpenTelemetry SDK initialized before first import. [ ] FAIL: Health + readiness endpoints /health not found (404). Kubernetes liveness probe will restart pods on any 404. /ready not found (404). Traffic will be sent to unready pods. Required: GET /health -> 200 (process alive) | GET /ready -> 200 (dependencies ok) [ ] FAIL: Graceful shutdown SIGTERM handler not implemented. K8s sends SIGTERM 30s before SIGKILL. In-flight requests will be cut on every rolling deploy. Required: stop accepting connections on SIGTERM, drain requests, then exit. [ ] FAIL: Circuit breaker on downstream DB No circuit breaker detected on database calls. One slow DB will exhaust connection pool and cascade to all callers. Required: circuit breaker with fallback response. Deployment window: 3 hours. Fix all 4 items and re-submit.
Reading what production tells you
Each of the four checklist items produces a distinct signal in production. Knowing the signal tells you whether your implementation is working.
1// What a valid OpenTelemetry trace looks like in Jaeger2// After implementing auto-instrumentation, every request produces this:34{5"traceID": "abc123def456789",6"spans": [7{8"operationName": "GET /api/users/:id",9"serviceName": "my-service",10"duration": 145, // microseconds11"tags": {12"http.method": "GET",13"http.url": "/api/users/42",14"http.status_code": 200,15"service.version": "1.2.3"16}17},18{19"operationName": "pg.query",20"serviceName": "my-service",21"parentSpanID": "root-span-id",22"duration": 130, // DB query took 130ms — visible as child span23"tags": {24"db.type": "postgresql",25"db.statement": "SELECT * FROM users WHERE id = $1",26"db.rows_affected": 127}28}29]30}3132// Circuit breaker state transitions — opossum emits these events:33circuitBreaker.on('open', () => {34// Fires when failure rate exceeds threshold (50%)35// All subsequent calls return fallback immediately (no DB hit)36metrics.increment('circuit_breaker.opened', { service: 'db' });37});3839circuitBreaker.on('halfOpen', () => {40// After resetTimeout (30s), allows ONE test request through41// If it succeeds -> close. If it fails -> open again.42metrics.increment('circuit_breaker.half_open', { service: 'db' });43});4445circuitBreaker.on('close', () => {46// Test request succeeded — circuit closed, normal operation resumed47metrics.increment('circuit_breaker.closed', { service: 'db' });48});
The four pillars of production readiness
What you are building
- Pillar 1: Distributed tracing (OpenTelemetry) — Auto-instruments every HTTP request, Express route, and Postgres query. Injects W3C traceparent headers on outgoing calls. Every request in production is traceable from entry to exit across all services.
- Pillar 2: Health + readiness endpoints — /health = liveness probe (is this process alive?). Returns 200 always if the process is running. /ready = readiness probe (can this pod serve traffic?). Returns 200 only when DB and Redis connections are confirmed. Kubernetes stops routing traffic to pods that fail /ready.
- Pillar 3: Graceful shutdown on SIGTERM — Kubernetes sends SIGTERM 30 seconds before SIGKILL on pod termination (rolling deploy, scale-down, eviction). Your SIGTERM handler must: stop accepting new connections, wait for in-flight requests to complete, then call process.exit(0). Without this, every deploy drops active requests.
- Pillar 4: Circuit breaker on DB calls (opossum) — When the DB failure rate exceeds 50% over 10 seconds, the circuit opens. All subsequent calls return the fallback response immediately (cached data, empty array, or explicit 503) without hitting the DB. Prevents connection pool exhaustion and cascade failure. After 30 seconds, the circuit half-opens to test recovery.
| Kubernetes probe | Endpoint | Returns 200 when | Returns 503 when |
|---|---|---|---|
| Liveness | /health | Process is running | Process is deadlocked or unresponsive (K8s restarts it) |
| Readiness | /ready | DB + Redis connections healthy | DB or Redis unreachable (K8s stops routing traffic) |
What Stripe requires before every deploy
Stripe requires all four pillars before any service ships to production. The full working implementation is a single Node.js/Express file you can copy-paste as a starting point.
The Limit: observability adds overhead
OpenTelemetry adds 3-5% CPU overhead per service (span creation, attribute serialization, exporter network calls). At very high throughput (10k+ req/s per instance), use head-based sampling (1-5%) to reduce this. The circuit breaker timeout window (10s) means brief DB hiccups will open the circuit — tune the threshold (errorThresholdPercentage and resetTimeout) for your DB reliability profile. A circuit that opens on every deploy is worse than no circuit.
1// production-ready-service.ts2// Full implementation: OpenTelemetry + /health + /ready + SIGTERM + circuit breaker3// Run: npx ts-node --require ./tracing.ts production-ready-service.ts45// ── tracing.ts (import FIRST) ─────────────────────────────────────6import { NodeSDK } from '@opentelemetry/sdk-node';7import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';8import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';9import { Resource } from '@opentelemetry/resources';10export const sdk = new NodeSDK({11resource: new Resource({ 'service.name': 'my-service' }),12traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_ENDPOINT }),13instrumentations: [getNodeAutoInstrumentations()],14});15sdk.start();16// Every HTTP, Express, pg, redis call now creates spans automatically.1718// ── server.ts ─────────────────────────────────────────────────────19import express from 'express';20import CircuitBreaker from 'opossum';21import { Pool } from 'pg';2223const app = express();24const db = new Pool({ connectionString: process.env.DATABASE_URL });2526// ── Pillar 4: Circuit breaker on DB calls ─────────────────────────27async function queryDb(sql: string, params: unknown[]) {28const result = await db.query(sql, params as string[]);29return result.rows;30}3132const dbBreaker = new CircuitBreaker(queryDb, {33errorThresholdPercentage: 50, // open when 50%+ of calls fail34resetTimeout: 30000, // try again after 30 seconds35timeout: 3000, // treat calls taking > 3s as failures36volumeThreshold: 10, // need at least 10 calls before tripping37});3839dbBreaker.fallback(() => []); // return empty array when circuit is open4041// ── Pillar 2: Health + Readiness ──────────────────────────────────42app.get('/health', (_req, res) => {43res.json({ status: 'ok', uptime: process.uptime() });44});4546app.get('/ready', async (_req, res) => {47try {48await db.query('SELECT 1'); // verify DB is reachable49res.json({ status: 'ready' });50} catch (err) {51res.status(503).json({52status: 'not ready',53error: (err as Error).message,54});55}56});5758// ── Application routes ────────────────────────────────────────────59app.get('/api/users/:id', async (req, res) => {60const users = await dbBreaker.fire(61'SELECT id, name, email FROM users WHERE id = $1',62[req.params.id],63);64res.json(users[0] ?? { error: 'not found' });65});6667// ── Pillar 3: Graceful shutdown ───────────────────────────────────68const server = app.listen(8080, () => {69console.log('Listening on :8080');70});7172let isShuttingDown = false;7374process.on('SIGTERM', () => {75console.log('SIGTERM received — starting graceful shutdown');76isShuttingDown = true;7778// Stop accepting new connections79server.close(() => {80console.log('HTTP server closed — all in-flight requests drained');81// Shutdown OpenTelemetry exporter (flush remaining spans)82sdk.shutdown().finally(() => {83process.exit(0);84});85});8687// Force exit after 25 seconds (K8s terminationGracePeriodSeconds is 30)88setTimeout(() => {89console.error('Graceful shutdown timeout — forcing exit');90process.exit(1);91}, 25_000);92});9394// Middleware: reject new requests during shutdown95app.use((_req, res, next) => {96if (isShuttingDown) {97res.setHeader('Connection', 'close');98return res.status(503).json({ error: 'Service shutting down' });99}100next();101});
1# Kubernetes deployment with liveness + readiness probes + graceful termination2apiVersion: apps/v13kind: Deployment4metadata:5name: my-service6spec:7replicas: 38template:9spec:10terminationGracePeriodSeconds: 30 # K8s waits 30s before SIGKILL11containers:12- name: my-service13image: my-service:1.2.314ports:15- containerPort: 80801617# Liveness probe: restart pod if process is deadlocked18livenessProbe:19httpGet:20path: /health21port: 808022initialDelaySeconds: 5 # wait 5s before first check23periodSeconds: 1024failureThreshold: 3 # restart after 3 consecutive failures2526# Readiness probe: stop routing traffic if dependencies down27readinessProbe:28httpGet:29path: /ready30port: 808031initialDelaySeconds: 10 # wait for DB connection to establish32periodSeconds: 533failureThreshold: 2 # stop traffic after 2 consecutive failures3435env:36- name: DATABASE_URL37valueFrom:38secretKeyRef:39name: db-credentials40key: url41- name: OTEL_ENDPOINT42value: "http://jaeger-collector:4318/v1/traces"
Exam Answer vs. Production Reality
Liveness vs readiness probes
📖 What the exam expects
Liveness checks if the process is alive — Kubernetes restarts the pod if it fails. Readiness checks if the pod can serve traffic — Kubernetes stops routing to it if it fails.
Toggle between what certifications teach and what production actually requires
How this might come up in interviews
Production readiness questions are the clearest signal of engineering maturity. Every L4+ engineer should be able to implement all four pillars without looking them up.
Common questions:
- What happens when a Kubernetes pod receives SIGTERM?
- How do you prevent a circuit breaker from opening on every deploy?
- What attributes should you add to OpenTelemetry spans?
- What is the difference between /health and /ready?
- How do you trace across service boundaries with OpenTelemetry?
- What does your circuit breaker return when the database is down?
- Why does graceful shutdown matter for Kubernetes deployments?
Strong answers include:
- Immediately distinguishes liveness (process alive) vs readiness (dependencies healthy)
- Knows Kubernetes sends SIGTERM 30s before SIGKILL and implements drain logic
- Mentions opossum or similar for circuit breakers, knows the three states (open/half-open/closed)
- Knows W3C traceparent propagates trace context across HTTP boundaries
- Mentions that circuit breaker fallback should return cached data or explicit 503, not hang
Red flags:
- "Both /health and /ready just check if the server is up" — conflates liveness and readiness
- Does not know what SIGTERM is or why Kubernetes sends it
- Has never implemented a circuit breaker — only knows retry logic
- Thinks tracing is just adding request ID to logs
- Cannot explain what happens to in-flight requests during a rolling deploy
Quick check · Build Challenge: Make It Observable and Resilient
1 / 2
Your Kubernetes pod receives SIGTERM. It has 8 in-flight requests. What is the correct behavior?
From the books
Release It! — Michael T. Nygard (2018)
Chapter 5: Stability Patterns
Circuit breakers, bulkheads, and timeouts are the difference between a service that fails fast and recovers and one that fails slow and cascades. Every production service needs all three.
Observability Engineering — Charity Majors, Liz Fong-Jones, George Miranda (2022)
Chapter 2: What Is Observability?
Observability is not a set of tools — it is the property of a system that allows you to ask any question about its internal state from its external outputs. You build for observability before incidents, not after.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Discussion
Questions? Discuss in the community or start a thread below.
Join DiscordIn-app Q&A
Sign in to start or join a thread.