Interactive Explainer

🎯Key Takeaways

EXPLAIN ANALYZE is your most important tool — run it on every slow query.

Always index foreign keys, WHERE columns, and ORDER BY columns in critical queries.

Composite index column order: most selective first, sort column last (leftmost-prefix rule).

Partial indexes on high-selectivity subsets can be 1000× smaller than full indexes.

Partition large tables first (transparent to app) before considering sharding.

Sharding eliminates cross-shard JOINs and transactions forever — it's the nuclear option.

Database Optimization: Indexing, Query Performance, and Sharding

A bad query plan at 1k users becomes a 3am PagerDuty alert at 10M users

~4 min read

Be the first to complete!

What you'll learn

EXPLAIN ANALYZE is your most important tool — run it on every slow query.
Always index foreign keys, WHERE columns, and ORDER BY columns in critical queries.
Composite index column order: most selective first, sort column last (leftmost-prefix rule).
Partial indexes on high-selectivity subsets can be 1000× smaller than full indexes.
Partition large tables first (transparent to app) before considering sharding.
Sharding eliminates cross-shard JOINs and transactions forever — it's the nuclear option.

Indexes: How Databases Find Data in Milliseconds

Without indexes, a database scans every row in a table (Sequential Scan). For 100 million rows, that's catastrophic. An index is a separate B-tree data structure that lets the database jump directly to matching rows.

B-Tree Index Internals

PostgreSQL's default index is a B-tree (balanced tree). Values stored in sorted order. WHERE email = 'alice@example.com' navigates the tree: O(log N) comparisons vs O(N) for a full scan. For 100M rows: tree = ~27 comparisons vs 100M row scan.

Index Types and When to Use Each

🌳B-tree (default) — Equality (=), range (<, >, BETWEEN), ORDER BY. Works for most queries.
🔍GIN — Array containment, JSONB queries, full-text search. PostgreSQL-specific.
📍GiST — Geographic/geospatial queries (PostGIS), range types, fuzzy matching.
⚙️Partial index — Index a subset of rows: CREATE INDEX ON orders(user_id) WHERE status = 'pending'. Tiny and fast for hot subsets.
📦Covering index (INCLUDE) — Store extra columns in the index leaf. Satisfies queries without touching main table (index-only scan).

index-optimization.sql

1-- Diagnose with EXPLAIN ANALYZE
EXPLAIN ANALYZE actually runs the query — use on production with caution for expensive queries
2EXPLAIN ANALYZE
3SELECT u.name, COUNT(o.id) as order_count
4FROM users u
5LEFT JOIN orders o ON o.user_id = u.id
6WHERE u.created_at > NOW() - INTERVAL '30 days'
7GROUP BY u.id, u.name
8ORDER BY order_count DESC
9LIMIT 100;
10 
11-- Bad plan output:
12-- Seq Scan on users  (cost=0.00..12847.00 rows=50000)
13-- Seq Scan on orders (cost=0.00..48000.00 rows=2000000)
14 
15-- Fix: add indexes
16CREATE INDEX idx_users_created_at ON users(created_at);
17CREATE INDEX idx_orders_user_id ON orders(user_id);
18 
19-- Good plan after indexes:
20-- Index Scan on users using idx_users_created_at
21-- Index Scan on orders using idx_orders_user_id
22 
23-- Composite index: column order matters
24-- Query: WHERE user_id = ? AND status = ? ORDER BY created_at
25CREATE INDEX idx_orders_user_status_date ON orders(user_id, status, created_at);
Composite index leftmost-prefix rule: (a, b, c) supports queries on a, (a,b), or (a,b,c) — not b or c alone
26-- Supports: WHERE user_id=? AND status=?
27-- Also supports: WHERE user_id=? alone
28-- Does NOT support: WHERE status=? alone (leftmost prefix rule)
29 
30-- Partial index: only pending orders (tiny fraction)
Partial index: if only 0.1% of orders are pending, this index is 1000x smaller than a full index
31CREATE INDEX idx_orders_pending ON orders(user_id, created_at)
32WHERE status = 'pending';
33-- 1000x smaller than full index on large tables with few pending orders
34 
35-- Covering index: satisfy query from index alone
36CREATE INDEX idx_users_email_name ON users(email) INCLUDE (name, id);
INCLUDE stores columns in leaf nodes — query never hits main table, much faster for read-heavy workloads
37-- Query: SELECT id, name FROM users WHERE email = ?
38-- → Index-only scan — never touches the main table
39 
40-- EXPLAIN keywords to know:
41-- "Seq Scan"    → full table scan, add an index
42-- "Index Scan"  → using index + fetching from table
43-- "Index Only"  → index has all needed data (fastest)

Query Optimization: The Top Anti-Patterns

Top Query Performance Anti-Patterns

❌SELECT * in production — Fetches all columns including blobs you don't need. Prevents index-only scans. Always SELECT specific columns.
❌N+1 queries in ORM — Fetching a list then calling .relation for each item. Use eager loading: include() / load().
❌Function on indexed column in WHERE — WHERE YEAR(created_at) = 2024 can't use an index. Use range: WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01'.
❌LIKE '%prefix%' — Leading wildcard prevents index use. Use full-text search (pg_trgm, Elasticsearch) for substring matching.
❌No LIMIT on unbounded queries — SELECT * FROM logs WHERE level = 'error' can return millions of rows and OOM your app server.
❌Implicit type coercion — WHERE user_id = '123' when user_id is INTEGER causes type coercion and can invalidate indexes.

Enable the Slow Query Log

Log all queries > 100ms. PostgreSQL: log_min_duration_statement = 100. MySQL: slow_query_log = ON, long_query_time = 0.1. Review weekly. This is your most valuable diagnostic tool.

Partitioning vs Sharding: Managing Large Tables

Technique	Where Data Lives	App Transparency	Cross-JOINs	When to Use
Partitioning	Same server, different files	Transparent to app	Supported within DB	Large tables (>100M rows), time-series, archiving
Sharding	Different database servers	Requires routing layer	Not possible across shards	When one server can't handle the load

Partition Before You Shard

Table partitioning (PostgreSQL PARTITION BY RANGE/LIST/HASH) is transparent to your application and dramatically improves query performance for time-series and large tables. You can still run cross-partition JOINs. Sharding requires routing logic in your app and eliminates cross-shard JOINs forever.

Sharding Strategies

Hash sharding — shard = hash(user_id) % N. Even distribution but resharding is expensive (moves half the data).
📅Range sharding — Shard by ID range or date. Easy to add shards but risk of hot spots on the latest shard.
🔄Consistent hashing — Virtual nodes on a ring; adding/removing nodes only moves ~1/N data. Used by DynamoDB, Cassandra.

How this might come up in interviews

Database optimization is a practical skill. Show you can diagnose slow queries with EXPLAIN ANALYZE and apply the right fix — not just "add an index" or "shard it".

Common questions:

How would you diagnose a slow database query?
Explain how B-tree indexes work internally
When would you use a partial index?
What is the difference between database partitioning and sharding?

Strong answers include:

Mentions EXPLAIN ANALYZE as first diagnostic step
Knows composite index column ordering (leftmost-prefix rule)
Understands when planner prefers seq scan over index scan
Clearly distinguishes partitioning from sharding

Red flags:

Jumps to "we should shard" without considering indexes first
Doesn't know EXPLAIN ANALYZE or query plans
"Always add indexes" — ignores write overhead

Quick check · Database Optimization: Indexing, Query Performance, and Sharding

1 / 2

A query WHERE created_at > '2024-01-01' has an index on created_at but EXPLAIN ANALYZE shows a Seq Scan. Why?

Key takeaways

EXPLAIN ANALYZE is your most important tool — run it on every slow query.
Always index foreign keys, WHERE columns, and ORDER BY columns in critical queries.
Composite index column order: most selective first, sort column last (leftmost-prefix rule).
Partial indexes on high-selectivity subsets can be 1000× smaller than full indexes.
Partition large tables first (transparent to app) before considering sharding.
Sharding eliminates cross-shard JOINs and transactions forever — it's the nuclear option.

From the books

Use The Index, Luke — Markus Winand (2011)

Chapter 1: Anatomy of an Index

Indexes speed up reads but slow down writes. The art is knowing when an index is worth the write overhead — and knowing which index type to use.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

Database Optimization: Indexing, Query Performance, and Sharding

Database Optimization: Indexing, Query Performance, and Sharding

Indexes: How Databases Find Data in Milliseconds

Query Optimization: The Top Anti-Patterns

Partitioning vs Sharding: Managing Large Tables

A query WHERE created_at > '2024-01-01' has an index on created_at but EXPLAIN ANALYZE shows a Seq Scan. Why?

Discussion

In-app Q&A