Interactive Explainer

Relevant for:Mid-levelSenior

Why this matters at your level

Mid-level

Implement the 4 go-live items from scratch. Deploy to Fly.io. Know what each health check does and why they are separate.

Senior

Design the full observability stack. Set sampling rate. Define the SLO based on the checklist metrics. Own the incident response runbook.

Build Challenge: Ship the Job Board to Production

The job board has real users. The SRE team has a go-live checklist. Before they approve production traffic, you need: distributed traces, health + readiness endpoints, rate limiting, and a zero-downtime deploy pipeline. You have until Friday.

~4 min read

Be the first to complete!

LEVEL 3 BUILD CHALLENGE

The final gap between a working app and a production app: observability, resilience, and deployment. This is what every senior engineer adds to a junior engineer's project before it goes live. The four items on the SRE checklist seem bureaucratic until the first 3AM incident — then each one is the difference between a 5-minute fix and a 4-hour outage. This challenge is that checklist, made concrete.

The question this raises

Your app works perfectly in development. What will break at 3AM in production that you cannot debug without these four things?

Test your assumption first

A Kubernetes liveness probe calls /health which connects to the database and runs SELECT 1. The database is slow but healthy. The SELECT 1 takes 3 seconds. The liveness probe timeout is 2 seconds. What happens?

Lesson outline

Before the SRE checklist: two desks, one bug

How this concept changes your thinking

Situation

Before

After

A job application fails in production

“Server logs: [Error: Cannot read properties of undefined]. No request ID. No user ID. No trace. No stack trace. 3AM. Engineer guesses. Tries to reproduce locally. Cannot reproduce. Fixes the wrong thing. Bug persists.”

“Structured log with OpenTelemetry trace: { traceId: "abc123", spanId: "def456", userId: "user_789", operation: "createApplication", error: "Foreign key constraint failed on field: jobId", duration: 84 }. Engineer knows exactly what failed, who was affected, and what input caused it.”

Kubernetes restarts the pod under load

“/health endpoint is the same as /ready. Kubernetes kills and restarts a pod because the DB connection pool is temporarily exhausted (readiness issue). But the pod is running fine — the DB is momentarily overloaded. Restarting the pod makes the DB overload worse. Cascade begins.”

“/health (liveness): return 200 immediately — is the process running? /ready (readiness): check DB connection pool (< 90% used) + Redis ping. If DB is overloaded: readiness fails, pod is removed from load balancer, does not receive new traffic, does not restart. DB recovers. Pod is added back.”

Someone submits 1,000 applications in 1 minute via a script

“No rate limiting. 1,000 DB writes in 1 minute. Email worker fires 1,000 times. Resend API rate limit hit. Email worker enters retry loop. Queue depth: 10,000. DB connections exhausted. Legitimate users cannot apply.”

“Rate limit: 10 applications per user per hour via Upstash Redis. Application endpoint returns 429 Too Many Requests with Retry-After header. Legitimate users unaffected. Abuse contained at the HTTP layer before any DB write.”

The SRE go-live checklist

The SRE go-live checklist: 4 items, all required

1. OpenTelemetry traces: every request has a trace from browser click to DB query. 2. /healthz: returns 200 if process is running (liveness probe). 3. /readyz: returns 200 only if DB + Redis are reachable AND under capacity (readiness probe). 4. Rate limiting: 10 applications per user per hour. Without these: a 3AM incident takes 4 hours to diagnose instead of 5 minutes, a slow DB cascades into a pod restart loop, and a script can take down the site for all users.

What both sides tell you

Production observability is about closing the gap between what the browser sees (HTTP status codes, response times, error messages) and what the server knows (which function failed, which DB query timed out, which user triggered the error). OpenTelemetry traces connect the two with a shared traceId.

health-checks-and-traces.ts

1// app/api/healthz/route.ts — liveness probe (always fast)
2export async function GET() {
3  // ONLY checks if the process is running. No DB. No Redis.
4  // If this fails: the process is dead — Kubernetes should restart it.
5  return Response.json({ status: 'ok', ts: Date.now() }, { status: 200 });
6}
7 
8// app/api/readyz/route.ts — readiness probe (checks dependencies)
9import { db } from '@/lib/db';
10import { redis } from '@/lib/redis';
11 
12export async function GET() {
13  const checks: Record<string, boolean> = {};
14 
15  // DB check: can we execute a query? (with timeout)
16  try {
17    await Promise.race([
18      db.$queryRaw`SELECT 1`,
19      new Promise((_, reject) => setTimeout(() => reject(new Error('timeout')), 2000)),
20    ]);
21    checks.db = true;
22  } catch {
23    checks.db = false;
24  }
25 
26  // Redis check: can we ping?
27  try {
28    await redis.ping();
29    checks.redis = true;
30  } catch {
31    checks.redis = false;
32  }
33 
34  const ready = Object.values(checks).every(Boolean);
35  // Return 503 if NOT ready — Kubernetes removes pod from load balancer
36  return Response.json({ status: ready ? 'ready' : 'not ready', checks }, {
37    status: ready ? 200 : 503,
38  });
39}
40 
41// OpenTelemetry trace around createApplication
42import { trace, SpanStatusCode } from '@opentelemetry/api';
43 
44const tracer = trace.getTracer('job-board-api');
45 
46export async function createApplication(data: CreateApplicationInput) {
47  return tracer.startActiveSpan('createApplication', async (span) => {
48    span.setAttributes({
49      'app.userId': data.userId,
50      'app.jobId': data.jobId,
51    });
52    try {
53      const application = await db.application.create({ data });
54      span.setStatus({ code: SpanStatusCode.OK });
55      return application;
56    } catch (error) {
57      span.setStatus({ code: SpanStatusCode.ERROR, message: String(error) });
58      span.recordException(error as Error);
59      throw error;
60    } finally {
61      span.end();
62    }
63  });
64}
65// Structured log output:
66// { traceId: "abc123", spanId: "def456", userId: "user_789",
67//   operation: "createApplication", error: "Foreign key constraint...", duration: 84 }

github-actions-fly-deploy.yml

1# .github/workflows/deploy.yml
2name: Deploy to Fly.io
3 
4on:
5  push:
6    branches: [main]
7 
8jobs:
9  test:
10    runs-on: ubuntu-latest
11    steps:
12      - uses: actions/checkout@v4
13      - uses: actions/setup-node@v4
14        with: { node-version: '20' }
15      - run: npm ci && npm test
16 
17  deploy:
18    needs: test
19    runs-on: ubuntu-latest
20    steps:
21      - uses: actions/checkout@v4
22      - uses: superfly/flyctl-actions/setup-flyctl@master
23      - run: flyctl deploy --remote-only
24        env:
25          FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
26 
27# fly.toml — health checks + rolling deploy
28# [http_service]
29#   internal_port = 3000
30#   [http_service.checks]
31#     [http_service.checks.health]
32#       grace_period = "5s"
33#       interval = "15s"
34#       method = "GET"
35#       path = "/api/healthz"
36#       timeout = "2s"
37#     [http_service.checks.ready]
38#       grace_period = "10s"
39#       interval = "15s"
40#       method = "GET"
41#       path = "/api/readyz"
42#       timeout = "5s"
43 
44# Rate limiting middleware — Upstash Redis sliding window
45# middleware.ts
46import { Ratelimit } from '@upstash/ratelimit';
47import { Redis } from '@upstash/redis';
48 
49const ratelimit = new Ratelimit({
50  redis: Redis.fromEnv(),
51  limiter: Ratelimit.slidingWindow(10, '1 h'), // 10 per user per hour
52});
53 
54export async function middleware(req: NextRequest) {
55  const userId = req.headers.get('x-user-id') ?? req.ip ?? 'anonymous';
56  const { success, reset } = await ratelimit.limit(`apply:${userId}`);
57  if (!success) {
58    return new NextResponse('Too Many Requests', {
59      status: 429,
60      headers: { 'Retry-After': String(Math.ceil((reset - Date.now()) / 1000)) },
61    });
62  }
63}

4 Pillars of the Go-Live Checklist

Distributed traces (OpenTelemetry) — Trace every request from browser to DB. 3AM diagnosis: 5 minutes with traces, 4 hours without.
Liveness + readiness probes — /healthz for Kubernetes restarts, /readyz for load balancer routing. Never put DB checks in /healthz.
Rate limiting — Protect the DB from abuse and spikes at the HTTP layer. Use Redis for distributed rate limiting (works across multiple pods).
Zero-downtime deploys (rolling update) — New pod starts, health check passes, old pod drains. If health check fails: rollback automatic.

How companies go to production

The interview question: "What is the first thing you look at when an alert fires at 3AM?"

Prepared answer: the trace. Every alert should link to the trace for the failing request. The trace shows: which service, which function, which DB query, which external API call failed. If there is no trace, the answer is "I look at the logs and grep for the error message" — which takes 40 minutes instead of 5. This is why traces are on the SRE checklist.

production-checklist-implementation.ts

1// instrumentation.ts — Next.js OTel initialization
2import { NodeSDK } from '@opentelemetry/sdk-node';
3import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
4import { PrismaInstrumentation } from '@prisma/instrumentation';
5 
6export function register() {
7  const sdk = new NodeSDK({
8    traceExporter: new OTLPTraceExporter({
9      url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT, // Jaeger, Honeycomb, etc.
10    }),
11    instrumentations: [
12      new PrismaInstrumentation(), // Auto-traces all Prisma queries
13    ],
14    serviceName: 'job-board-api',
15  });
16  sdk.start();
17}
18 
19// Complete production additions summary:
20// 1. instrumentation.ts → OTel SDK (traces all Next.js routes + Prisma)
21// 2. app/api/healthz/route.ts → return 200 immediately (liveness)
22// 3. app/api/readyz/route.ts → check DB + Redis (readiness)
23// 4. middleware.ts → Upstash rate limiting (10 apps/user/hour)
24// 5. fly.toml → http_checks for /api/healthz + /api/readyz
25// 6. .github/workflows/deploy.yml → test → deploy on merge to main
26 
27// Zero-downtime deploy sequence (Fly.io rolling update):
28// 1. flyctl deploy starts → new container image built
29// 2. New pod starts → /api/healthz returns 200 → Fly marks it healthy
30// 3. /api/readyz returns 200 → Fly adds pod to load balancer
31// 4. Old pod drains (finishes in-flight requests, max 30s)
32// 5. Old pod terminated — zero dropped requests throughout
33 
34// If new pod fails /api/readyz: Fly never adds it to load balancer.
35// If new pod fails /api/healthz: Fly terminates it, rolls back to previous.

Exam Answer vs. Production Reality

1 / 3

Liveness vs readiness

📖 What the exam expects

Liveness checks if the process is alive. Readiness checks if it can serve traffic.

Toggle between what certifications teach and what production actually requires

How this might come up in interviews

Production readiness questions test whether you have shipped real systems, not just working prototypes.

Common questions:

What is the difference between a liveness probe and a readiness probe?
What is the first thing you look at when an alert fires at 3AM?
How would you implement rate limiting across multiple API servers?
Walk me through your zero-downtime deployment process.

Strong answers include:

Immediately explains why liveness and readiness are separate probes
Answers "traces" to the 3AM question without hesitation
Knows that IP-based rate limiting fails behind NAT — uses user ID
Describes rolling deploy with health check gates

Red flags:

Puts DB query in the liveness probe
Cannot explain what OpenTelemetry is
Does not know what a readiness probe is
Thinks rate limiting is just nginx config

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord