Back
Interactive Explainer

Build Challenge: Diagnose and Fix a Slow API

Your senior hands you a PR with a working endpoint that has p99 of 4 seconds. Add EXPLAIN ANALYZE, fix the query, add Redis caching, add rate limiting. Code review in 24 hours.

Relevant for:Mid-levelSenior
Why this matters at your level
Mid-level

This challenge is the exact bar for a mid-level backend engineer: independently diagnose N+1, add caching with TTL, implement rate limiting. If you can do all three in 24 hours with tests, you are operating at mid-level. Focus on using EXPLAIN ANALYZE before adding any fix.

Senior

Senior engineers do all three AND explain why (not just how). In code review, a senior catches N+1 in the PR diff without running the code. They specify TTL based on data freshness requirements, not "5 minutes sounds right." They design rate limit tiers, not a single limit.

Build Challenge: Diagnose and Fix a Slow API

Your senior hands you a PR with a working endpoint that has p99 of 4 seconds. Add EXPLAIN ANALYZE, fix the query, add Redis caching, add rate limiting. Code review in 24 hours.

~4 min read
Be the first to complete!
LIVEReal Code Review Task
PR received

GET /api/feed working but p99=4.2s. k6: fails at 200 concurrent users. Senior says: fix it.

WARNING
+2 hours

EXPLAIN ANALYZE run. N+1 found: 20 author queries per feed request. Fix: single JOIN query. p99: 180ms.

+4 hours

Redis cache-aside added with 5min TTL + jitter. Cache hit p99: 3ms. Load test: passes 500 concurrent users.

+6 hours

Rate limiting added: 60 requests/minute per user ID (token bucket in Redis). 429 returns Retry-After header.

+24 hours

Code review: 3 comments (naming, test coverage), 0 architecture comments. PR merged.

feed endpoint p99 — after N+1 fix + Redis cache hit
to identify and fix 3 production-blocking issues that were invisible in development

The question this raises

When a working endpoint has 4-second p99, N+1 queries, no caching, and no rate limiting — and you have 24 hours to fix it — where do you start?

Test your assumption first

Your ORM code loads 20 posts for a user's feed, then accesses post.author.name for each post in a loop. How many database queries does this make?

Lesson outline

Before performance-aware API design: the broken contract

How this concept changes your thinking

Situation
Before
After

ORM query for related data

Fetch 20 posts. For each post, ORM lazy-loads the author: 20 separate SELECT queries. N+1 = 21 total queries. Invisible in development at N=5, catastrophic in production at N=20 under load.

Eager load: single JOIN query fetches posts + authors together. 1 query instead of 21. ORM: include("author"). Raw SQL: LEFT JOIN users ON posts.author_id = users.id.

Repeated identical requests

Every GET /feed/123 hits the database. 1000 users refreshing their feed per second = 1000 DB queries/second for the same data. Database CPU at 90%. Connection pool exhausted.

Cache-aside with TTL jitter: first request fetches from DB and caches. Next 999 requests served from Redis in 3ms. Database sees 1 query per cache TTL window, not 1000/second.

No rate limiting on read endpoints

One user with a script can call GET /feed/123 in a tight loop: 10,000 requests/minute. Connection pool saturates. All other users experience slow responses or errors.

Token bucket: 60 requests/minute per authenticated user ID. 429 response with Retry-After header. Legitimate users: unaffected. Scripts: blocked at gateway, not at database.

The 3AM page

k6 load test results — 200 concurrent users — FAIL

scenarios: 200 VUs | http_req_duration p99=4.218s (threshold: 500ms) FAIL | http_req_failed rate=34.2% | db.connection_pool.active: 48/50 (96%) | db.query_count: 4,200/sec (21 queries per request x 200 concurrent) | redis.connected: false | rate_limiting: not configured | Verdict: not production-ready — fix before merge

Reading the signals

EXPLAIN ANALYZE on the feed query reveals the N+1. 21 queries per request is visible in pg_stat_statements: the same SELECT users WHERE id = ? query running 20 times per feed load, with the same execution plan.

n-plus-one-diagnosis.sql
1-- pg_stat_statements: find N+1 pattern
2-- (same query template running many times per second)
3SELECT
4 left(query, 80) AS query_snippet,
5 calls,
6 round(mean_exec_time::numeric, 2) AS mean_ms,
7 round(total_exec_time::numeric, 0) AS total_ms
8FROM pg_stat_statements
9WHERE query LIKE '%SELECT%users%WHERE%id%'
10ORDER BY calls DESC
11LIMIT 5;
12
13-- Output revealing the N+1:
84,000 calls of the same single-author query = N+1. Total time: 176 seconds wasted on redundant queries.
14-- query_snippet calls mean_ms total_ms
15-- SELECT id, name, avatar FROM users WHERE id = $1 84000 2.1 176400
16-- 84000 calls = 4200 requests * 20 authors each
17-- mean_ms=2.1ms looks innocent — total_ms=176,400ms is the damage
18
19-- EXPLAIN ANALYZE on the N+1 query (single author fetch):
20EXPLAIN ANALYZE SELECT id, name, avatar FROM users WHERE id = 123;
21-- Index Scan using users_pkey on users (cost=0.42..2.44 rows=1)
22-- Actual time=0.043..0.045 rows=1 <- fast per query
23-- But 20x per feed request = 40ms wasted per feed load on author fetches alone
24
The JOIN fetches all 20 posts and all 20 authors in a single round-trip to the database.
25-- Fix: single JOIN query
26EXPLAIN ANALYZE
27SELECT p.*, u.id AS author_id, u.name AS author_name, u.avatar
28FROM posts p
29LEFT JOIN users u ON p.author_id = u.id
30WHERE p.user_id = 123
31ORDER BY p.created_at DESC
32LIMIT 20;
33-- Index Scan using idx_posts_user_created on posts
34-- Hash Join on users
35-- Actual time=0.8..1.2 rows=20 <- 1 query for all 20 posts + all 20 authors

The production checklist

FixToolBeforeAfterVerification
N+1 queryEXPLAIN ANALYZE + JOIN21 queries, 4.2s p991 query, 180ms p99pg_stat_statements: author query calls drop from 84K/min to 0
No cachingRedis cache-aside + TTL jitter180ms every request3ms cache hit, 180ms missredis INFO: keyspace_hits/misses ratio > 95% hit rate
No rate limitingToken bucket (Redis SETNX)Unlimited, pool saturates60 req/min/user, 429 on excessLoad test: 1000 req/user/min → 429 after 60, pool stays <20%

The order matters: fix query first, then cache

Fix N+1 before adding Redis. Caching a 4.2s query means every cache miss is still 4.2s — and under load, cache misses are frequent. After the JOIN fix: cache misses are 180ms, cache hits are 3ms. Now caching is worth it.

How to talk about this in your interview

Strong vs weak interview answers for the "slow API" scenario

  • Weak: "I would add Redis"Caching before diagnosing is cargo-culting. The interviewer asks: "How did you know the problem was latency and not N+1?" You do not.
  • Strong: "I would run EXPLAIN ANALYZE first"Diagnosis before fix. EXPLAIN ANALYZE shows whether the query is slow (bad plan, missing index) or whether 21 fast queries are the problem (N+1). Then you add Redis for the remaining latency.
  • Weak: "I would rate limit at 1000 req/min per IP"IP rate limiting collapses for users behind NAT. 30 engineers sharing one office IP = 33 req/min each. Wrong scope.
  • Strong: "Rate limit by authenticated user ID with token bucket in Redis"User ID scoping survives NAT. Token bucket allows bursts (30 req in first second) while enforcing steady state (60 req/min). 429 with Retry-After so clients back off gracefully.

Exam Answer vs. Production Reality

1 / 3

N+1 queries

📖 What the exam expects

N+1 occurs when you fetch N parent records and then issue one additional query per record to fetch related data, resulting in N+1 total database queries instead of 1.

Toggle between what certifications teach and what production actually requires

How this might come up in interviews

The "fix this slow endpoint" scenario appears in backend interviews across FAANG and growth-stage companies. The evaluation is: do you diagnose before fixing, do you know the correct order of fixes (query first, cache second, rate limit third), and do you verify with metrics?

Common questions:

  • How would you diagnose a slow endpoint in production?
  • What is N+1 and how do you detect it in a running system?
  • How do you decide the TTL for a Redis cache?
  • Why should you rate limit by user ID instead of IP address?

Strong answers include:

  • Runs EXPLAIN ANALYZE as the first diagnostic step
  • Explains N+1 and how ORM lazy loading causes it
  • Specifies TTL based on data freshness requirements
  • Designs rate limiting by user ID with token bucket and Retry-After header

Red flags:

  • Adds Redis before diagnosing the root cause
  • Does not know what N+1 is or how to detect it
  • Rate limits by IP address without mentioning the NAT problem
  • Cannot explain what EXPLAIN ANALYZE shows

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Sign in to track your progress and mark lessons complete.

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.