Back
Interactive Explainer

Database Optimization: Indexing, Query Performance, and Sharding

A bad query plan at 1k users becomes a 3am PagerDuty alert at 10M users

๐ŸŽฏKey Takeaways
EXPLAIN ANALYZE is your most important tool โ€” run it on every slow query.
Always index foreign keys, WHERE columns, and ORDER BY columns in critical queries.
Composite index column order: most selective first, sort column last (leftmost-prefix rule).
Partial indexes on high-selectivity subsets can be 1000ร— smaller than full indexes.
Partition large tables first (transparent to app) before considering sharding.
Sharding eliminates cross-shard JOINs and transactions forever โ€” it's the nuclear option.

Database Optimization: Indexing, Query Performance, and Sharding

A bad query plan at 1k users becomes a 3am PagerDuty alert at 10M users

~4 min read
Be the first to complete!
What you'll learn
  • EXPLAIN ANALYZE is your most important tool โ€” run it on every slow query.
  • Always index foreign keys, WHERE columns, and ORDER BY columns in critical queries.
  • Composite index column order: most selective first, sort column last (leftmost-prefix rule).
  • Partial indexes on high-selectivity subsets can be 1000ร— smaller than full indexes.
  • Partition large tables first (transparent to app) before considering sharding.
  • Sharding eliminates cross-shard JOINs and transactions forever โ€” it's the nuclear option.

Indexes: How Databases Find Data in Milliseconds

Without indexes, a database scans every row in a table (Sequential Scan). For 100 million rows, that's catastrophic. An index is a separate B-tree data structure that lets the database jump directly to matching rows.

B-Tree Index Internals

PostgreSQL's default index is a B-tree (balanced tree). Values stored in sorted order. WHERE email = 'alice@example.com' navigates the tree: O(log N) comparisons vs O(N) for a full scan. For 100M rows: tree = ~27 comparisons vs 100M row scan.

Index Types and When to Use Each

  • ๐ŸŒณB-tree (default) โ€” Equality (=), range (<, >, BETWEEN), ORDER BY. Works for most queries.
  • ๐Ÿ”GIN โ€” Array containment, JSONB queries, full-text search. PostgreSQL-specific.
  • ๐Ÿ“GiST โ€” Geographic/geospatial queries (PostGIS), range types, fuzzy matching.
  • โš™๏ธPartial index โ€” Index a subset of rows: CREATE INDEX ON orders(user_id) WHERE status = 'pending'. Tiny and fast for hot subsets.
  • ๐Ÿ“ฆCovering index (INCLUDE) โ€” Store extra columns in the index leaf. Satisfies queries without touching main table (index-only scan).
index-optimization.sql
1-- Diagnose with EXPLAIN ANALYZE
EXPLAIN ANALYZE actually runs the query โ€” use on production with caution for expensive queries
2EXPLAIN ANALYZE
3SELECT u.name, COUNT(o.id) as order_count
4FROM users u
5LEFT JOIN orders o ON o.user_id = u.id
6WHERE u.created_at > NOW() - INTERVAL '30 days'
7GROUP BY u.id, u.name
8ORDER BY order_count DESC
9LIMIT 100;
10
11-- Bad plan output:
12-- Seq Scan on users (cost=0.00..12847.00 rows=50000)
13-- Seq Scan on orders (cost=0.00..48000.00 rows=2000000)
14
15-- Fix: add indexes
16CREATE INDEX idx_users_created_at ON users(created_at);
17CREATE INDEX idx_orders_user_id ON orders(user_id);
18
19-- Good plan after indexes:
20-- Index Scan on users using idx_users_created_at
21-- Index Scan on orders using idx_orders_user_id
22
23-- Composite index: column order matters
24-- Query: WHERE user_id = ? AND status = ? ORDER BY created_at
25CREATE INDEX idx_orders_user_status_date ON orders(user_id, status, created_at);
Composite index leftmost-prefix rule: (a, b, c) supports queries on a, (a,b), or (a,b,c) โ€” not b or c alone
26-- Supports: WHERE user_id=? AND status=?
27-- Also supports: WHERE user_id=? alone
28-- Does NOT support: WHERE status=? alone (leftmost prefix rule)
29
30-- Partial index: only pending orders (tiny fraction)
Partial index: if only 0.1% of orders are pending, this index is 1000x smaller than a full index
31CREATE INDEX idx_orders_pending ON orders(user_id, created_at)
32WHERE status = 'pending';
33-- 1000x smaller than full index on large tables with few pending orders
34
35-- Covering index: satisfy query from index alone
36CREATE INDEX idx_users_email_name ON users(email) INCLUDE (name, id);
INCLUDE stores columns in leaf nodes โ€” query never hits main table, much faster for read-heavy workloads
37-- Query: SELECT id, name FROM users WHERE email = ?
38-- โ†’ Index-only scan โ€” never touches the main table
39
40-- EXPLAIN keywords to know:
41-- "Seq Scan" โ†’ full table scan, add an index
42-- "Index Scan" โ†’ using index + fetching from table
43-- "Index Only" โ†’ index has all needed data (fastest)

Query Optimization: The Top Anti-Patterns

Top Query Performance Anti-Patterns

  • โŒSELECT * in production โ€” Fetches all columns including blobs you don't need. Prevents index-only scans. Always SELECT specific columns.
  • โŒN+1 queries in ORM โ€” Fetching a list then calling .relation for each item. Use eager loading: include() / load().
  • โŒFunction on indexed column in WHERE โ€” WHERE YEAR(created_at) = 2024 can't use an index. Use range: WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01'.
  • โŒLIKE '%prefix%' โ€” Leading wildcard prevents index use. Use full-text search (pg_trgm, Elasticsearch) for substring matching.
  • โŒNo LIMIT on unbounded queries โ€” SELECT * FROM logs WHERE level = 'error' can return millions of rows and OOM your app server.
  • โŒImplicit type coercion โ€” WHERE user_id = '123' when user_id is INTEGER causes type coercion and can invalidate indexes.

Enable the Slow Query Log

Log all queries > 100ms. PostgreSQL: log_min_duration_statement = 100. MySQL: slow_query_log = ON, long_query_time = 0.1. Review weekly. This is your most valuable diagnostic tool.

Partitioning vs Sharding: Managing Large Tables

TechniqueWhere Data LivesApp TransparencyCross-JOINsWhen to Use
PartitioningSame server, different filesTransparent to appSupported within DBLarge tables (>100M rows), time-series, archiving
ShardingDifferent database serversRequires routing layerNot possible across shardsWhen one server can't handle the load

Partition Before You Shard

Table partitioning (PostgreSQL PARTITION BY RANGE/LIST/HASH) is transparent to your application and dramatically improves query performance for time-series and large tables. You can still run cross-partition JOINs. Sharding requires routing logic in your app and eliminates cross-shard JOINs forever.

Sharding Strategies

  • Hash sharding โ€” shard = hash(user_id) % N. Even distribution but resharding is expensive (moves half the data).
  • ๐Ÿ“…Range sharding โ€” Shard by ID range or date. Easy to add shards but risk of hot spots on the latest shard.
  • ๐Ÿ”„Consistent hashing โ€” Virtual nodes on a ring; adding/removing nodes only moves ~1/N data. Used by DynamoDB, Cassandra.
How this might come up in interviews

Database optimization is a practical skill. Show you can diagnose slow queries with EXPLAIN ANALYZE and apply the right fix โ€” not just "add an index" or "shard it".

Common questions:

  • How would you diagnose a slow database query?
  • Explain how B-tree indexes work internally
  • When would you use a partial index?
  • What is the difference between database partitioning and sharding?

Strong answers include:

  • Mentions EXPLAIN ANALYZE as first diagnostic step
  • Knows composite index column ordering (leftmost-prefix rule)
  • Understands when planner prefers seq scan over index scan
  • Clearly distinguishes partitioning from sharding

Red flags:

  • Jumps to "we should shard" without considering indexes first
  • Doesn't know EXPLAIN ANALYZE or query plans
  • "Always add indexes" โ€” ignores write overhead

Quick check ยท Database Optimization: Indexing, Query Performance, and Sharding

1 / 2

A query WHERE created_at > '2024-01-01' has an index on created_at but EXPLAIN ANALYZE shows a Seq Scan. Why?

Key takeaways

  • EXPLAIN ANALYZE is your most important tool โ€” run it on every slow query.
  • Always index foreign keys, WHERE columns, and ORDER BY columns in critical queries.
  • Composite index column order: most selective first, sort column last (leftmost-prefix rule).
  • Partial indexes on high-selectivity subsets can be 1000ร— smaller than full indexes.
  • Partition large tables first (transparent to app) before considering sharding.
  • Sharding eliminates cross-shard JOINs and transactions forever โ€” it's the nuclear option.

From the books

Use The Index, Luke โ€” Markus Winand (2011)

Chapter 1: Anatomy of an Index

Indexes speed up reads but slow down writes. The art is knowing when an index is worth the write overhead โ€” and knowing which index type to use.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Sign in to track your progress and mark lessons complete.

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.