Build Challenge: Ship the Job Board to Production
The job board has real users. The SRE team has a go-live checklist. Before they approve production traffic, you need: distributed traces, health + readiness endpoints, rate limiting, and a zero-downtime deploy pipeline. You have until Friday.
Why this matters at your level
Implement the 4 go-live items from scratch. Deploy to Fly.io. Know what each health check does and why they are separate.
Design the full observability stack. Set sampling rate. Define the SLO based on the checklist metrics. Own the incident response runbook.
Build Challenge: Ship the Job Board to Production
The job board has real users. The SRE team has a go-live checklist. Before they approve production traffic, you need: distributed traces, health + readiness endpoints, rate limiting, and a zero-downtime deploy pipeline. You have until Friday.
The final gap between a working app and a production app: observability, resilience, and deployment. This is what every senior engineer adds to a junior engineer's project before it goes live. The four items on the SRE checklist seem bureaucratic until the first 3AM incident — then each one is the difference between a 5-minute fix and a 4-hour outage. This challenge is that checklist, made concrete.
The question this raises
Your app works perfectly in development. What will break at 3AM in production that you cannot debug without these four things?
A Kubernetes liveness probe calls /health which connects to the database and runs SELECT 1. The database is slow but healthy. The SELECT 1 takes 3 seconds. The liveness probe timeout is 2 seconds. What happens?
Lesson outline
Before the SRE checklist: two desks, one bug
How this concept changes your thinking
A job application fails in production
“Server logs: [Error: Cannot read properties of undefined]. No request ID. No user ID. No trace. No stack trace. 3AM. Engineer guesses. Tries to reproduce locally. Cannot reproduce. Fixes the wrong thing. Bug persists.”
“Structured log with OpenTelemetry trace: { traceId: "abc123", spanId: "def456", userId: "user_789", operation: "createApplication", error: "Foreign key constraint failed on field: jobId", duration: 84 }. Engineer knows exactly what failed, who was affected, and what input caused it.”
Kubernetes restarts the pod under load
“/health endpoint is the same as /ready. Kubernetes kills and restarts a pod because the DB connection pool is temporarily exhausted (readiness issue). But the pod is running fine — the DB is momentarily overloaded. Restarting the pod makes the DB overload worse. Cascade begins.”
“/health (liveness): return 200 immediately — is the process running? /ready (readiness): check DB connection pool (< 90% used) + Redis ping. If DB is overloaded: readiness fails, pod is removed from load balancer, does not receive new traffic, does not restart. DB recovers. Pod is added back.”
Someone submits 1,000 applications in 1 minute via a script
“No rate limiting. 1,000 DB writes in 1 minute. Email worker fires 1,000 times. Resend API rate limit hit. Email worker enters retry loop. Queue depth: 10,000. DB connections exhausted. Legitimate users cannot apply.”
“Rate limit: 10 applications per user per hour via Upstash Redis. Application endpoint returns 429 Too Many Requests with Retry-After header. Legitimate users unaffected. Abuse contained at the HTTP layer before any DB write.”
The SRE go-live checklist
The SRE go-live checklist: 4 items, all required
1. OpenTelemetry traces: every request has a trace from browser click to DB query. 2. /healthz: returns 200 if process is running (liveness probe). 3. /readyz: returns 200 only if DB + Redis are reachable AND under capacity (readiness probe). 4. Rate limiting: 10 applications per user per hour. Without these: a 3AM incident takes 4 hours to diagnose instead of 5 minutes, a slow DB cascades into a pod restart loop, and a script can take down the site for all users.
What both sides tell you
Production observability is about closing the gap between what the browser sees (HTTP status codes, response times, error messages) and what the server knows (which function failed, which DB query timed out, which user triggered the error). OpenTelemetry traces connect the two with a shared traceId.
1// app/api/healthz/route.ts — liveness probe (always fast)2export async function GET() {3// ONLY checks if the process is running. No DB. No Redis.4// If this fails: the process is dead — Kubernetes should restart it.5return Response.json({ status: 'ok', ts: Date.now() }, { status: 200 });6}78// app/api/readyz/route.ts — readiness probe (checks dependencies)9import { db } from '@/lib/db';10import { redis } from '@/lib/redis';1112export async function GET() {13const checks: Record<string, boolean> = {};1415// DB check: can we execute a query? (with timeout)16try {17await Promise.race([18db.$queryRaw`SELECT 1`,19new Promise((_, reject) => setTimeout(() => reject(new Error('timeout')), 2000)),20]);21checks.db = true;22} catch {23checks.db = false;24}2526// Redis check: can we ping?27try {28await redis.ping();29checks.redis = true;30} catch {31checks.redis = false;32}3334const ready = Object.values(checks).every(Boolean);35// Return 503 if NOT ready — Kubernetes removes pod from load balancer36return Response.json({ status: ready ? 'ready' : 'not ready', checks }, {37status: ready ? 200 : 503,38});39}4041// OpenTelemetry trace around createApplication42import { trace, SpanStatusCode } from '@opentelemetry/api';4344const tracer = trace.getTracer('job-board-api');4546export async function createApplication(data: CreateApplicationInput) {47return tracer.startActiveSpan('createApplication', async (span) => {48span.setAttributes({49'app.userId': data.userId,50'app.jobId': data.jobId,51});52try {53const application = await db.application.create({ data });54span.setStatus({ code: SpanStatusCode.OK });55return application;56} catch (error) {57span.setStatus({ code: SpanStatusCode.ERROR, message: String(error) });58span.recordException(error as Error);59throw error;60} finally {61span.end();62}63});64}65// Structured log output:66// { traceId: "abc123", spanId: "def456", userId: "user_789",67// operation: "createApplication", error: "Foreign key constraint...", duration: 84 }
1# .github/workflows/deploy.yml2name: Deploy to Fly.io34on:5push:6branches: [main]78jobs:9test:10runs-on: ubuntu-latest11steps:12- uses: actions/checkout@v413- uses: actions/setup-node@v414with: { node-version: '20' }15- run: npm ci && npm test1617deploy:18needs: test19runs-on: ubuntu-latest20steps:21- uses: actions/checkout@v422- uses: superfly/flyctl-actions/setup-flyctl@master23- run: flyctl deploy --remote-only24env:25FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}2627# fly.toml — health checks + rolling deploy28# [http_service]29# internal_port = 300030# [http_service.checks]31# [http_service.checks.health]32# grace_period = "5s"33# interval = "15s"34# method = "GET"35# path = "/api/healthz"36# timeout = "2s"37# [http_service.checks.ready]38# grace_period = "10s"39# interval = "15s"40# method = "GET"41# path = "/api/readyz"42# timeout = "5s"4344# Rate limiting middleware — Upstash Redis sliding window45# middleware.ts46import { Ratelimit } from '@upstash/ratelimit';47import { Redis } from '@upstash/redis';4849const ratelimit = new Ratelimit({50redis: Redis.fromEnv(),51limiter: Ratelimit.slidingWindow(10, '1 h'), // 10 per user per hour52});5354export async function middleware(req: NextRequest) {55const userId = req.headers.get('x-user-id') ?? req.ip ?? 'anonymous';56const { success, reset } = await ratelimit.limit(`apply:${userId}`);57if (!success) {58return new NextResponse('Too Many Requests', {59status: 429,60headers: { 'Retry-After': String(Math.ceil((reset - Date.now()) / 1000)) },61});62}63}
4 Pillars of the Go-Live Checklist
4 Pillars of the Go-Live Checklist
- Distributed traces (OpenTelemetry) — Trace every request from browser to DB. 3AM diagnosis: 5 minutes with traces, 4 hours without.
- Liveness + readiness probes — /healthz for Kubernetes restarts, /readyz for load balancer routing. Never put DB checks in /healthz.
- Rate limiting — Protect the DB from abuse and spikes at the HTTP layer. Use Redis for distributed rate limiting (works across multiple pods).
- Zero-downtime deploys (rolling update) — New pod starts, health check passes, old pod drains. If health check fails: rollback automatic.
How companies go to production
The interview question: "What is the first thing you look at when an alert fires at 3AM?"
Prepared answer: the trace. Every alert should link to the trace for the failing request. The trace shows: which service, which function, which DB query, which external API call failed. If there is no trace, the answer is "I look at the logs and grep for the error message" — which takes 40 minutes instead of 5. This is why traces are on the SRE checklist.
1// instrumentation.ts — Next.js OTel initialization2import { NodeSDK } from '@opentelemetry/sdk-node';3import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';4import { PrismaInstrumentation } from '@prisma/instrumentation';56export function register() {7const sdk = new NodeSDK({8traceExporter: new OTLPTraceExporter({9url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT, // Jaeger, Honeycomb, etc.10}),11instrumentations: [12new PrismaInstrumentation(), // Auto-traces all Prisma queries13],14serviceName: 'job-board-api',15});16sdk.start();17}1819// Complete production additions summary:20// 1. instrumentation.ts → OTel SDK (traces all Next.js routes + Prisma)21// 2. app/api/healthz/route.ts → return 200 immediately (liveness)22// 3. app/api/readyz/route.ts → check DB + Redis (readiness)23// 4. middleware.ts → Upstash rate limiting (10 apps/user/hour)24// 5. fly.toml → http_checks for /api/healthz + /api/readyz25// 6. .github/workflows/deploy.yml → test → deploy on merge to main2627// Zero-downtime deploy sequence (Fly.io rolling update):28// 1. flyctl deploy starts → new container image built29// 2. New pod starts → /api/healthz returns 200 → Fly marks it healthy30// 3. /api/readyz returns 200 → Fly adds pod to load balancer31// 4. Old pod drains (finishes in-flight requests, max 30s)32// 5. Old pod terminated — zero dropped requests throughout3334// If new pod fails /api/readyz: Fly never adds it to load balancer.35// If new pod fails /api/healthz: Fly terminates it, rolls back to previous.
Exam Answer vs. Production Reality
Liveness vs readiness
📖 What the exam expects
Liveness checks if the process is alive. Readiness checks if it can serve traffic.
Toggle between what certifications teach and what production actually requires
How this might come up in interviews
Production readiness questions test whether you have shipped real systems, not just working prototypes.
Common questions:
- What is the difference between a liveness probe and a readiness probe?
- What is the first thing you look at when an alert fires at 3AM?
- How would you implement rate limiting across multiple API servers?
- Walk me through your zero-downtime deployment process.
Strong answers include:
- Immediately explains why liveness and readiness are separate probes
- Answers "traces" to the 3AM question without hesitation
- Knows that IP-based rate limiting fails behind NAT — uses user ID
- Describes rolling deploy with health check gates
Red flags:
- Puts DB query in the liveness probe
- Cannot explain what OpenTelemetry is
- Does not know what a readiness probe is
- Thinks rate limiting is just nginx config
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Discussion
Questions? Discuss in the community or start a thread below.
Join DiscordIn-app Q&A
Sign in to start or join a thread.