Database Optimization: Indexing, Query Performance, and Sharding
A bad query plan at 1k users becomes a 3am PagerDuty alert at 10M users
Database Optimization: Indexing, Query Performance, and Sharding
A bad query plan at 1k users becomes a 3am PagerDuty alert at 10M users
What you'll learn
- EXPLAIN ANALYZE is your most important tool โ run it on every slow query.
- Always index foreign keys, WHERE columns, and ORDER BY columns in critical queries.
- Composite index column order: most selective first, sort column last (leftmost-prefix rule).
- Partial indexes on high-selectivity subsets can be 1000ร smaller than full indexes.
- Partition large tables first (transparent to app) before considering sharding.
- Sharding eliminates cross-shard JOINs and transactions forever โ it's the nuclear option.
Indexes: How Databases Find Data in Milliseconds
Without indexes, a database scans every row in a table (Sequential Scan). For 100 million rows, that's catastrophic. An index is a separate B-tree data structure that lets the database jump directly to matching rows.
B-Tree Index Internals
PostgreSQL's default index is a B-tree (balanced tree). Values stored in sorted order. WHERE email = 'alice@example.com' navigates the tree: O(log N) comparisons vs O(N) for a full scan. For 100M rows: tree = ~27 comparisons vs 100M row scan.
Index Types and When to Use Each
- ๐ณB-tree (default) โ Equality (=), range (<, >, BETWEEN), ORDER BY. Works for most queries.
- ๐GIN โ Array containment, JSONB queries, full-text search. PostgreSQL-specific.
- ๐GiST โ Geographic/geospatial queries (PostGIS), range types, fuzzy matching.
- โ๏ธPartial index โ Index a subset of rows: CREATE INDEX ON orders(user_id) WHERE status = 'pending'. Tiny and fast for hot subsets.
- ๐ฆCovering index (INCLUDE) โ Store extra columns in the index leaf. Satisfies queries without touching main table (index-only scan).
1-- Diagnose with EXPLAIN ANALYZEEXPLAIN ANALYZE actually runs the query โ use on production with caution for expensive queries2EXPLAIN ANALYZE3SELECT u.name, COUNT(o.id) as order_count4FROM users u5LEFT JOIN orders o ON o.user_id = u.id6WHERE u.created_at > NOW() - INTERVAL '30 days'7GROUP BY u.id, u.name8ORDER BY order_count DESC9LIMIT 100;1011-- Bad plan output:12-- Seq Scan on users (cost=0.00..12847.00 rows=50000)13-- Seq Scan on orders (cost=0.00..48000.00 rows=2000000)1415-- Fix: add indexes16CREATE INDEX idx_users_created_at ON users(created_at);17CREATE INDEX idx_orders_user_id ON orders(user_id);1819-- Good plan after indexes:20-- Index Scan on users using idx_users_created_at21-- Index Scan on orders using idx_orders_user_id2223-- Composite index: column order matters24-- Query: WHERE user_id = ? AND status = ? ORDER BY created_at25CREATE INDEX idx_orders_user_status_date ON orders(user_id, status, created_at);Composite index leftmost-prefix rule: (a, b, c) supports queries on a, (a,b), or (a,b,c) โ not b or c alone26-- Supports: WHERE user_id=? AND status=?27-- Also supports: WHERE user_id=? alone28-- Does NOT support: WHERE status=? alone (leftmost prefix rule)2930-- Partial index: only pending orders (tiny fraction)Partial index: if only 0.1% of orders are pending, this index is 1000x smaller than a full index31CREATE INDEX idx_orders_pending ON orders(user_id, created_at)32WHERE status = 'pending';33-- 1000x smaller than full index on large tables with few pending orders3435-- Covering index: satisfy query from index alone36CREATE INDEX idx_users_email_name ON users(email) INCLUDE (name, id);INCLUDE stores columns in leaf nodes โ query never hits main table, much faster for read-heavy workloads37-- Query: SELECT id, name FROM users WHERE email = ?38-- โ Index-only scan โ never touches the main table3940-- EXPLAIN keywords to know:41-- "Seq Scan" โ full table scan, add an index42-- "Index Scan" โ using index + fetching from table43-- "Index Only" โ index has all needed data (fastest)
Query Optimization: The Top Anti-Patterns
Top Query Performance Anti-Patterns
- โSELECT * in production โ Fetches all columns including blobs you don't need. Prevents index-only scans. Always SELECT specific columns.
- โN+1 queries in ORM โ Fetching a list then calling .relation for each item. Use eager loading: include() / load().
- โFunction on indexed column in WHERE โ WHERE YEAR(created_at) = 2024 can't use an index. Use range: WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01'.
- โLIKE '%prefix%' โ Leading wildcard prevents index use. Use full-text search (pg_trgm, Elasticsearch) for substring matching.
- โNo LIMIT on unbounded queries โ SELECT * FROM logs WHERE level = 'error' can return millions of rows and OOM your app server.
- โImplicit type coercion โ WHERE user_id = '123' when user_id is INTEGER causes type coercion and can invalidate indexes.
Enable the Slow Query Log
Log all queries > 100ms. PostgreSQL: log_min_duration_statement = 100. MySQL: slow_query_log = ON, long_query_time = 0.1. Review weekly. This is your most valuable diagnostic tool.
Partitioning vs Sharding: Managing Large Tables
| Technique | Where Data Lives | App Transparency | Cross-JOINs | When to Use |
|---|---|---|---|---|
| Partitioning | Same server, different files | Transparent to app | Supported within DB | Large tables (>100M rows), time-series, archiving |
| Sharding | Different database servers | Requires routing layer | Not possible across shards | When one server can't handle the load |
Partition Before You Shard
Table partitioning (PostgreSQL PARTITION BY RANGE/LIST/HASH) is transparent to your application and dramatically improves query performance for time-series and large tables. You can still run cross-partition JOINs. Sharding requires routing logic in your app and eliminates cross-shard JOINs forever.
Sharding Strategies
- Hash sharding โ shard = hash(user_id) % N. Even distribution but resharding is expensive (moves half the data).
- ๐ Range sharding โ Shard by ID range or date. Easy to add shards but risk of hot spots on the latest shard.
- ๐Consistent hashing โ Virtual nodes on a ring; adding/removing nodes only moves ~1/N data. Used by DynamoDB, Cassandra.
How this might come up in interviews
Database optimization is a practical skill. Show you can diagnose slow queries with EXPLAIN ANALYZE and apply the right fix โ not just "add an index" or "shard it".
Common questions:
- How would you diagnose a slow database query?
- Explain how B-tree indexes work internally
- When would you use a partial index?
- What is the difference between database partitioning and sharding?
Strong answers include:
- Mentions EXPLAIN ANALYZE as first diagnostic step
- Knows composite index column ordering (leftmost-prefix rule)
- Understands when planner prefers seq scan over index scan
- Clearly distinguishes partitioning from sharding
Red flags:
- Jumps to "we should shard" without considering indexes first
- Doesn't know EXPLAIN ANALYZE or query plans
- "Always add indexes" โ ignores write overhead
Quick check ยท Database Optimization: Indexing, Query Performance, and Sharding
1 / 2
A query WHERE created_at > '2024-01-01' has an index on created_at but EXPLAIN ANALYZE shows a Seq Scan. Why?
Key takeaways
- EXPLAIN ANALYZE is your most important tool โ run it on every slow query.
- Always index foreign keys, WHERE columns, and ORDER BY columns in critical queries.
- Composite index column order: most selective first, sort column last (leftmost-prefix rule).
- Partial indexes on high-selectivity subsets can be 1000ร smaller than full indexes.
- Partition large tables first (transparent to app) before considering sharding.
- Sharding eliminates cross-shard JOINs and transactions forever โ it's the nuclear option.
From the books
Use The Index, Luke โ Markus Winand (2011)
Chapter 1: Anatomy of an Index
Indexes speed up reads but slow down writes. The art is knowing when an index is worth the write overhead โ and knowing which index type to use.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Discussion
Questions? Discuss in the community or start a thread below.
Join DiscordIn-app Q&A
Sign in to start or join a thread.