Interactive Explainer

Relevant for:Mid-levelSeniorPrincipal

Why this matters at your level

Mid-level

Understands why each of the four pillars matters. Can implement /health and /ready endpoints. Knows the difference between liveness and readiness. Can add OpenTelemetry auto-instrumentation.

Senior

Implements all four pillars correctly. Tunes circuit breaker thresholds for the actual failure profile. Implements SIGTERM handler that correctly drains requests. Adds business-relevant span attributes.

Principal

Designs the production-readiness checklist for the entire org. Reviews PRs for missing observability and resilience patterns. Establishes org-wide defaults for circuit breaker configuration and graceful shutdown timeout. Architects the observability platform.

Build Challenge: Make It Observable and Resilient

The SRE team just rejected your deploy. Ship a microservice that passes their go-live checklist.

~6 min read

Be the first to complete!

Go-Live Blocker

Your microservice is built and tested. The SRE team reviews the deploy request and rejects it. Their go-live checklist has four failing items: no distributed traces in requests, no /health or /ready endpoints for Kubernetes probes, no SIGTERM handler (Kubernetes sends SIGTERM 30 seconds before SIGKILL — in-flight requests will be cut), and no circuit breaker on the database call (one slow database will take down the entire service). You have three hours to fix all four items before the deployment window closes.

The question this raises

What does it mean for a service to be production-ready — and can you build all four requirements from scratch in one session?

Test your assumption first

Kubernetes is performing a rolling deploy: it starts a new pod and terminates the old one. The old pod has 5 in-flight requests. What must your service do to prevent those 5 requests from receiving errors?

Lesson outline

Before observability: the broken contract

How this concept changes your thinking

Situation

Before

After

Debugging a latency issue in production

“grep through logs on 8 servers looking for the request ID, manually reconstruct which service called which, spend 3 weeks finding a 400ms bug”

“paste the trace ID into Jaeger, see the waterfall, find the slow span in 20 minutes — OpenTelemetry auto-instruments every HTTP call and database query”

Kubernetes rolling deploy

“Pod restarts, in-flight requests get SIGKILL, 503 errors spike for 30 seconds. Users see errors during every deploy.”

“SIGTERM handler stops accepting new connections, drains in-flight requests over 10 seconds, then exits cleanly. Zero 503s during deploy.”

Database goes slow or unreachable

“Requests pile up waiting for DB, connection pool exhausted, entire service becomes unresponsive, cascades to all upstream callers”

“Circuit breaker opens after 50% failure rate, immediately returns 503 with cached fallback data, upstream callers get fast failures instead of hangs”

Kubernetes liveness vs readiness

“Single /health endpoint returns 200 even during startup (DB not connected yet). Kubernetes sends traffic before service is ready. First 30 seconds of requests fail.”

“/health returns 200 if process is alive (always). /ready returns 200 only when DB and Redis connections are confirmed. Kubernetes waits for /ready before routing traffic.”

The SRE go-live checklist

Deploy Rejected — SRE Go-Live Checklist: 0/4 passing

DEPLOY REQUEST: my-service v1.2.3 -> production STATUS: BLOCKED — SRE go-live checklist failed [ ] FAIL: Distributed tracing No traceparent headers propagated. Cannot correlate requests across services. Required: OpenTelemetry SDK initialized before first import. [ ] FAIL: Health + readiness endpoints /health not found (404). Kubernetes liveness probe will restart pods on any 404. /ready not found (404). Traffic will be sent to unready pods. Required: GET /health -> 200 (process alive) | GET /ready -> 200 (dependencies ok) [ ] FAIL: Graceful shutdown SIGTERM handler not implemented. K8s sends SIGTERM 30s before SIGKILL. In-flight requests will be cut on every rolling deploy. Required: stop accepting connections on SIGTERM, drain requests, then exit. [ ] FAIL: Circuit breaker on downstream DB No circuit breaker detected on database calls. One slow DB will exhaust connection pool and cascade to all callers. Required: circuit breaker with fallback response. Deployment window: 3 hours. Fix all 4 items and re-submit.

Reading what production tells you

Each of the four checklist items produces a distinct signal in production. Knowing the signal tells you whether your implementation is working.

otel-trace-output.json

1// What a valid OpenTelemetry trace looks like in Jaeger
2// After implementing auto-instrumentation, every request produces this:
3 
4{
5  "traceID": "abc123def456789",
6  "spans": [
7    {
8      "operationName": "GET /api/users/:id",
9      "serviceName": "my-service",
10      "duration": 145,    // microseconds
11      "tags": {
12        "http.method": "GET",
13        "http.url": "/api/users/42",
14        "http.status_code": 200,
15        "service.version": "1.2.3"
16      }
17    },
18    {
19      "operationName": "pg.query",
20      "serviceName": "my-service",
21      "parentSpanID": "root-span-id",
22      "duration": 130,    // DB query took 130ms — visible as child span
23      "tags": {
24        "db.type": "postgresql",
25        "db.statement": "SELECT * FROM users WHERE id = $1",
26        "db.rows_affected": 1
27      }
28    }
29  ]
30}
31 
32// Circuit breaker state transitions — opossum emits these events:
33circuitBreaker.on('open', () => {
34  // Fires when failure rate exceeds threshold (50%)
35  // All subsequent calls return fallback immediately (no DB hit)
36  metrics.increment('circuit_breaker.opened', { service: 'db' });
37});
38 
39circuitBreaker.on('halfOpen', () => {
40  // After resetTimeout (30s), allows ONE test request through
41  // If it succeeds -> close. If it fails -> open again.
42  metrics.increment('circuit_breaker.half_open', { service: 'db' });
43});
44 
45circuitBreaker.on('close', () => {
46  // Test request succeeded — circuit closed, normal operation resumed
47  metrics.increment('circuit_breaker.closed', { service: 'db' });
48});

The four pillars of production readiness

What you are building

Pillar 1: Distributed tracing (OpenTelemetry) — Auto-instruments every HTTP request, Express route, and Postgres query. Injects W3C traceparent headers on outgoing calls. Every request in production is traceable from entry to exit across all services.
Pillar 2: Health + readiness endpoints — /health = liveness probe (is this process alive?). Returns 200 always if the process is running. /ready = readiness probe (can this pod serve traffic?). Returns 200 only when DB and Redis connections are confirmed. Kubernetes stops routing traffic to pods that fail /ready.
Pillar 3: Graceful shutdown on SIGTERM — Kubernetes sends SIGTERM 30 seconds before SIGKILL on pod termination (rolling deploy, scale-down, eviction). Your SIGTERM handler must: stop accepting new connections, wait for in-flight requests to complete, then call process.exit(0). Without this, every deploy drops active requests.
Pillar 4: Circuit breaker on DB calls (opossum) — When the DB failure rate exceeds 50% over 10 seconds, the circuit opens. All subsequent calls return the fallback response immediately (cached data, empty array, or explicit 503) without hitting the DB. Prevents connection pool exhaustion and cascade failure. After 30 seconds, the circuit half-opens to test recovery.

Kubernetes probe	Endpoint	Returns 200 when	Returns 503 when
Liveness	/health	Process is running	Process is deadlocked or unresponsive (K8s restarts it)
Readiness	/ready	DB + Redis connections healthy	DB or Redis unreachable (K8s stops routing traffic)

What Stripe requires before every deploy

Stripe requires all four pillars before any service ships to production. The full working implementation is a single Node.js/Express file you can copy-paste as a starting point.

The Limit: observability adds overhead

OpenTelemetry adds 3-5% CPU overhead per service (span creation, attribute serialization, exporter network calls). At very high throughput (10k+ req/s per instance), use head-based sampling (1-5%) to reduce this. The circuit breaker timeout window (10s) means brief DB hiccups will open the circuit — tune the threshold (errorThresholdPercentage and resetTimeout) for your DB reliability profile. A circuit that opens on every deploy is worse than no circuit.

production-ready-service.ts

1// production-ready-service.ts
2// Full implementation: OpenTelemetry + /health + /ready + SIGTERM + circuit breaker
3// Run: npx ts-node --require ./tracing.ts production-ready-service.ts
4 
5// ── tracing.ts (import FIRST) ─────────────────────────────────────
6import { NodeSDK } from '@opentelemetry/sdk-node';
7import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
8import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
9import { Resource } from '@opentelemetry/resources';
10export const sdk = new NodeSDK({
11  resource: new Resource({ 'service.name': 'my-service' }),
12  traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_ENDPOINT }),
13  instrumentations: [getNodeAutoInstrumentations()],
14});
15sdk.start();
16// Every HTTP, Express, pg, redis call now creates spans automatically.
17 
18// ── server.ts ─────────────────────────────────────────────────────
19import express from 'express';
20import CircuitBreaker from 'opossum';
21import { Pool } from 'pg';
22 
23const app = express();
24const db = new Pool({ connectionString: process.env.DATABASE_URL });
25 
26// ── Pillar 4: Circuit breaker on DB calls ─────────────────────────
27async function queryDb(sql: string, params: unknown[]) {
28  const result = await db.query(sql, params as string[]);
29  return result.rows;
30}
31 
32const dbBreaker = new CircuitBreaker(queryDb, {
33  errorThresholdPercentage: 50,  // open when 50%+ of calls fail
34  resetTimeout: 30000,           // try again after 30 seconds
35  timeout: 3000,                 // treat calls taking > 3s as failures
36  volumeThreshold: 10,           // need at least 10 calls before tripping
37});
38 
39dbBreaker.fallback(() => []);    // return empty array when circuit is open
40 
41// ── Pillar 2: Health + Readiness ──────────────────────────────────
42app.get('/health', (_req, res) => {
43  res.json({ status: 'ok', uptime: process.uptime() });
44});
45 
46app.get('/ready', async (_req, res) => {
47  try {
48    await db.query('SELECT 1');  // verify DB is reachable
49    res.json({ status: 'ready' });
50  } catch (err) {
51    res.status(503).json({
52      status: 'not ready',
53      error: (err as Error).message,
54    });
55  }
56});
57 
58// ── Application routes ────────────────────────────────────────────
59app.get('/api/users/:id', async (req, res) => {
60  const users = await dbBreaker.fire(
61    'SELECT id, name, email FROM users WHERE id = $1',
62    [req.params.id],
63  );
64  res.json(users[0] ?? { error: 'not found' });
65});
66 
67// ── Pillar 3: Graceful shutdown ───────────────────────────────────
68const server = app.listen(8080, () => {
69  console.log('Listening on :8080');
70});
71 
72let isShuttingDown = false;
73 
74process.on('SIGTERM', () => {
75  console.log('SIGTERM received — starting graceful shutdown');
76  isShuttingDown = true;
77 
78  // Stop accepting new connections
79  server.close(() => {
80    console.log('HTTP server closed — all in-flight requests drained');
81    // Shutdown OpenTelemetry exporter (flush remaining spans)
82    sdk.shutdown().finally(() => {
83      process.exit(0);
84    });
85  });
86 
87  // Force exit after 25 seconds (K8s terminationGracePeriodSeconds is 30)
88  setTimeout(() => {
89    console.error('Graceful shutdown timeout — forcing exit');
90    process.exit(1);
91  }, 25_000);
92});
93 
94// Middleware: reject new requests during shutdown
95app.use((_req, res, next) => {
96  if (isShuttingDown) {
97    res.setHeader('Connection', 'close');
98    return res.status(503).json({ error: 'Service shutting down' });
99  }
100  next();
101});

kubernetes-probes.yaml

1# Kubernetes deployment with liveness + readiness probes + graceful termination
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5  name: my-service
6spec:
7  replicas: 3
8  template:
9    spec:
10      terminationGracePeriodSeconds: 30  # K8s waits 30s before SIGKILL
11      containers:
12        - name: my-service
13          image: my-service:1.2.3
14          ports:
15            - containerPort: 8080
16 
17          # Liveness probe: restart pod if process is deadlocked
18          livenessProbe:
19            httpGet:
20              path: /health
21              port: 8080
22            initialDelaySeconds: 5    # wait 5s before first check
23            periodSeconds: 10
24            failureThreshold: 3       # restart after 3 consecutive failures
25 
26          # Readiness probe: stop routing traffic if dependencies down
27          readinessProbe:
28            httpGet:
29              path: /ready
30              port: 8080
31            initialDelaySeconds: 10   # wait for DB connection to establish
32            periodSeconds: 5
33            failureThreshold: 2       # stop traffic after 2 consecutive failures
34 
35          env:
36            - name: DATABASE_URL
37              valueFrom:
38                secretKeyRef:
39                  name: db-credentials
40                  key: url
41            - name: OTEL_ENDPOINT
42              value: "http://jaeger-collector:4318/v1/traces"

Exam Answer vs. Production Reality

1 / 3

Liveness vs readiness probes

📖 What the exam expects

Liveness checks if the process is alive — Kubernetes restarts the pod if it fails. Readiness checks if the pod can serve traffic — Kubernetes stops routing to it if it fails.

Toggle between what certifications teach and what production actually requires

How this might come up in interviews

Production readiness questions are the clearest signal of engineering maturity. Every L4+ engineer should be able to implement all four pillars without looking them up.

Common questions:

What happens when a Kubernetes pod receives SIGTERM?
How do you prevent a circuit breaker from opening on every deploy?
What attributes should you add to OpenTelemetry spans?
What is the difference between /health and /ready?
How do you trace across service boundaries with OpenTelemetry?
What does your circuit breaker return when the database is down?
Why does graceful shutdown matter for Kubernetes deployments?

Strong answers include:

Immediately distinguishes liveness (process alive) vs readiness (dependencies healthy)
Knows Kubernetes sends SIGTERM 30s before SIGKILL and implements drain logic
Mentions opossum or similar for circuit breakers, knows the three states (open/half-open/closed)
Knows W3C traceparent propagates trace context across HTTP boundaries
Mentions that circuit breaker fallback should return cached data or explicit 503, not hang

Red flags:

"Both /health and /ready just check if the server is up" — conflates liveness and readiness
Does not know what SIGTERM is or why Kubernetes sends it
Has never implemented a circuit breaker — only knows retry logic
Thinks tracing is just adding request ID to logs
Cannot explain what happens to in-flight requests during a rolling deploy

Quick check · Build Challenge: Make It Observable and Resilient

1 / 2

Your Kubernetes pod receives SIGTERM. It has 8 in-flight requests. What is the correct behavior?

From the books

Release It! — Michael T. Nygard (2018)

Chapter 5: Stability Patterns

Circuit breakers, bulkheads, and timeouts are the difference between a service that fails fast and recovers and one that fails slow and cascades. Every production service needs all three.

Observability Engineering — Charity Majors, Liz Fong-Jones, George Miranda (2022)

Chapter 2: What Is Observability?

Observability is not a set of tools — it is the property of a system that allows you to ask any question about its internal state from its external outputs. You build for observability before incidents, not after.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord