Build Challenge: Diagnose and Fix a Slow API
Your senior hands you a PR with a working endpoint that has p99 of 4 seconds. Add EXPLAIN ANALYZE, fix the query, add Redis caching, add rate limiting. Code review in 24 hours.
Why this matters at your level
This challenge is the exact bar for a mid-level backend engineer: independently diagnose N+1, add caching with TTL, implement rate limiting. If you can do all three in 24 hours with tests, you are operating at mid-level. Focus on using EXPLAIN ANALYZE before adding any fix.
Senior engineers do all three AND explain why (not just how). In code review, a senior catches N+1 in the PR diff without running the code. They specify TTL based on data freshness requirements, not "5 minutes sounds right." They design rate limit tiers, not a single limit.
Build Challenge: Diagnose and Fix a Slow API
Your senior hands you a PR with a working endpoint that has p99 of 4 seconds. Add EXPLAIN ANALYZE, fix the query, add Redis caching, add rate limiting. Code review in 24 hours.
GET /api/feed working but p99=4.2s. k6: fails at 200 concurrent users. Senior says: fix it.
WARNINGEXPLAIN ANALYZE run. N+1 found: 20 author queries per feed request. Fix: single JOIN query. p99: 180ms.
Redis cache-aside added with 5min TTL + jitter. Cache hit p99: 3ms. Load test: passes 500 concurrent users.
Rate limiting added: 60 requests/minute per user ID (token bucket in Redis). 429 returns Retry-After header.
Code review: 3 comments (naming, test coverage), 0 architecture comments. PR merged.
The question this raises
When a working endpoint has 4-second p99, N+1 queries, no caching, and no rate limiting — and you have 24 hours to fix it — where do you start?
Your ORM code loads 20 posts for a user's feed, then accesses post.author.name for each post in a loop. How many database queries does this make?
Lesson outline
Before performance-aware API design: the broken contract
How this concept changes your thinking
ORM query for related data
“Fetch 20 posts. For each post, ORM lazy-loads the author: 20 separate SELECT queries. N+1 = 21 total queries. Invisible in development at N=5, catastrophic in production at N=20 under load.”
“Eager load: single JOIN query fetches posts + authors together. 1 query instead of 21. ORM: include("author"). Raw SQL: LEFT JOIN users ON posts.author_id = users.id.”
Repeated identical requests
“Every GET /feed/123 hits the database. 1000 users refreshing their feed per second = 1000 DB queries/second for the same data. Database CPU at 90%. Connection pool exhausted.”
“Cache-aside with TTL jitter: first request fetches from DB and caches. Next 999 requests served from Redis in 3ms. Database sees 1 query per cache TTL window, not 1000/second.”
No rate limiting on read endpoints
“One user with a script can call GET /feed/123 in a tight loop: 10,000 requests/minute. Connection pool saturates. All other users experience slow responses or errors.”
“Token bucket: 60 requests/minute per authenticated user ID. 429 response with Retry-After header. Legitimate users: unaffected. Scripts: blocked at gateway, not at database.”
The 3AM page
k6 load test results — 200 concurrent users — FAIL
scenarios: 200 VUs | http_req_duration p99=4.218s (threshold: 500ms) FAIL | http_req_failed rate=34.2% | db.connection_pool.active: 48/50 (96%) | db.query_count: 4,200/sec (21 queries per request x 200 concurrent) | redis.connected: false | rate_limiting: not configured | Verdict: not production-ready — fix before merge
Reading the signals
EXPLAIN ANALYZE on the feed query reveals the N+1. 21 queries per request is visible in pg_stat_statements: the same SELECT users WHERE id = ? query running 20 times per feed load, with the same execution plan.
1-- pg_stat_statements: find N+1 pattern2-- (same query template running many times per second)3SELECT4left(query, 80) AS query_snippet,5calls,6round(mean_exec_time::numeric, 2) AS mean_ms,7round(total_exec_time::numeric, 0) AS total_ms8FROM pg_stat_statements9WHERE query LIKE '%SELECT%users%WHERE%id%'10ORDER BY calls DESC11LIMIT 5;1213-- Output revealing the N+1:84,000 calls of the same single-author query = N+1. Total time: 176 seconds wasted on redundant queries.14-- query_snippet calls mean_ms total_ms15-- SELECT id, name, avatar FROM users WHERE id = $1 84000 2.1 17640016-- 84000 calls = 4200 requests * 20 authors each17-- mean_ms=2.1ms looks innocent — total_ms=176,400ms is the damage1819-- EXPLAIN ANALYZE on the N+1 query (single author fetch):20EXPLAIN ANALYZE SELECT id, name, avatar FROM users WHERE id = 123;21-- Index Scan using users_pkey on users (cost=0.42..2.44 rows=1)22-- Actual time=0.043..0.045 rows=1 <- fast per query23-- But 20x per feed request = 40ms wasted per feed load on author fetches alone24The JOIN fetches all 20 posts and all 20 authors in a single round-trip to the database.25-- Fix: single JOIN query26EXPLAIN ANALYZE27SELECT p.*, u.id AS author_id, u.name AS author_name, u.avatar28FROM posts p29LEFT JOIN users u ON p.author_id = u.id30WHERE p.user_id = 12331ORDER BY p.created_at DESC32LIMIT 20;33-- Index Scan using idx_posts_user_created on posts34-- Hash Join on users35-- Actual time=0.8..1.2 rows=20 <- 1 query for all 20 posts + all 20 authors
The production checklist
| Fix | Tool | Before | After | Verification |
|---|---|---|---|---|
| N+1 query | EXPLAIN ANALYZE + JOIN | 21 queries, 4.2s p99 | 1 query, 180ms p99 | pg_stat_statements: author query calls drop from 84K/min to 0 |
| No caching | Redis cache-aside + TTL jitter | 180ms every request | 3ms cache hit, 180ms miss | redis INFO: keyspace_hits/misses ratio > 95% hit rate |
| No rate limiting | Token bucket (Redis SETNX) | Unlimited, pool saturates | 60 req/min/user, 429 on excess | Load test: 1000 req/user/min → 429 after 60, pool stays <20% |
The order matters: fix query first, then cache
Fix N+1 before adding Redis. Caching a 4.2s query means every cache miss is still 4.2s — and under load, cache misses are frequent. After the JOIN fix: cache misses are 180ms, cache hits are 3ms. Now caching is worth it.
How to talk about this in your interview
Strong vs weak interview answers for the "slow API" scenario
- Weak: "I would add Redis" — Caching before diagnosing is cargo-culting. The interviewer asks: "How did you know the problem was latency and not N+1?" You do not.
- Strong: "I would run EXPLAIN ANALYZE first" — Diagnosis before fix. EXPLAIN ANALYZE shows whether the query is slow (bad plan, missing index) or whether 21 fast queries are the problem (N+1). Then you add Redis for the remaining latency.
- Weak: "I would rate limit at 1000 req/min per IP" — IP rate limiting collapses for users behind NAT. 30 engineers sharing one office IP = 33 req/min each. Wrong scope.
- Strong: "Rate limit by authenticated user ID with token bucket in Redis" — User ID scoping survives NAT. Token bucket allows bursts (30 req in first second) while enforcing steady state (60 req/min). 429 with Retry-After so clients back off gracefully.
Exam Answer vs. Production Reality
N+1 queries
📖 What the exam expects
N+1 occurs when you fetch N parent records and then issue one additional query per record to fetch related data, resulting in N+1 total database queries instead of 1.
Toggle between what certifications teach and what production actually requires
How this might come up in interviews
The "fix this slow endpoint" scenario appears in backend interviews across FAANG and growth-stage companies. The evaluation is: do you diagnose before fixing, do you know the correct order of fixes (query first, cache second, rate limit third), and do you verify with metrics?
Common questions:
- How would you diagnose a slow endpoint in production?
- What is N+1 and how do you detect it in a running system?
- How do you decide the TTL for a Redis cache?
- Why should you rate limit by user ID instead of IP address?
Strong answers include:
- Runs EXPLAIN ANALYZE as the first diagnostic step
- Explains N+1 and how ORM lazy loading causes it
- Specifies TTL based on data freshness requirements
- Designs rate limiting by user ID with token bucket and Retry-After header
Red flags:
- Adds Redis before diagnosing the root cause
- Does not know what N+1 is or how to detect it
- Rate limits by IP address without mentioning the NAT problem
- Cannot explain what EXPLAIN ANALYZE shows
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Discussion
Questions? Discuss in the community or start a thread below.
Join DiscordIn-app Q&A
Sign in to start or join a thread.