Zero-Downtime Database Migrations

On this page

The migration that locked the table for 40 seconds
Renovating a shop while customers keep shopping
The expand/contract rollout, end to end
Safe operations vs. operations that lock
The dangerous way vs. the safe way
Backfilling big tables without melting the database
Online schema-change tools
App and schema must agree at every moment
Rollback strategy
Common mistakes that cost hours
Takeaways and where to go next

The migration that locked the table for 40 seconds

Who this is for

Engineers who ship schema changes to a database that is actively serving traffic. If you have ever run a `migrate` step in your deploy and watched p99 latency spike, requests time out, or the on-call channel light up, this is for you. We assume Postgres/MySQL and a relational schema, but the patterns are universal.

It is 2pm on a Tuesday. You merge a PR that adds a last_login_at column to the users table. CI runs the migration, the deploy goes green, and within seconds your error budget is on fire. Every query touching users is queued behind a lock. The migration is doing a full table rewrite on 80 million rows, and until it finishes, 40 seconds later, the app is effectively down.

Nothing in that migration was wrong SQL. It was perfectly valid ALTER TABLE. The problem is that a schema is not a static thing you edit in a maintenance window anymore, it is a shared, live data structure that running code depends on right now. Change it carelessly and you either lock readers out or break the code mid-flight.

The fix is not a faster server or a bigger maintenance window. It is a method: never change a schema and the code that uses it in a single irreversible step. Instead you evolve both in small, backward-compatible stages where the old and new shapes coexist. This article is that method, sequenced end to end.

Renovating a shop while customers keep shopping

A zero-downtime migration is a sequence of individually safe, backward-compatible steps, never one big switch, so the application and the schema are compatible at every single moment in between.

Think about renovating a grocery store that never closes. You do not rope off the whole floor, rip out the old shelves, and tell shoppers to come back tomorrow. You build the new aisle alongside the old one, quietly move stock across overnight, point the signage at the new aisle once it is fully stocked, and only then tear out the old shelves. Customers shop the entire time and never notice the seam.

Build the new aisle next to the old oneAdd a new nullable column alongside the old one

Move stock over after hours, a pallet at a timeBackfill data in small batches off the hot path

Stock both aisles during the transitionDual-write to old and new columns

Point the signage at the new aisleSwitch reads to the new column

Tear out the empty old shelvesDrop the old column once nothing reads it

Expand/contract is just careful renovation: build new before you close old.

That five-move sequence is the expand/contract pattern (also called parallel change). Expand = add the new shape without removing the old. Contract = remove the old shape once nothing depends on it. Everything risky lives safely in between.

The expand/contract rollout, end to end

Here is the full pipeline for a column change that is too dangerous to do in one shot, say, splitting a free-text full_name into a normalized name_id foreign key. Each box is a separate, independently deployable step. You can pause, verify, and roll back between any two of them.

Each stage is shippable on its own; the app and schema stay compatible throughout.

1
Deploy tolerant code first
Ship app code that does not assume the new column exists. It still reads the old shape but will not break when the column appears. This decouples the deploy from the migration.
2
Add the new column (expand)
Add it nullable with no default. On modern Postgres/MySQL this is a fast metadata-only change, no table rewrite, no long lock.
3
Backfill in batches
Copy old values into the new column in small, throttled chunks so you never hold a long transaction or saturate replication.
4
Dual-write
Deploy code that writes both the old and new columns on every insert/update. Now new rows are always correct while the backfill finishes.
5
Switch reads
After backfill completes and parity is verified, deploy code that reads from the new column. Keep writing both for one more release as a safety net.
6
Drop the old column (contract)
Once nothing reads or writes the old column and you are confident, drop it in a final migration. The renovation is done.

Safe operations vs. operations that lock

Not every ALTER is dangerous. The trick is knowing which ones take a metadata-only lock (instant) versus which ones rewrite the table or block writers for the duration. The exact behavior is engine- and version-specific, but the shape below holds across modern Postgres and MySQL/InnoDB.

Operation	Risk	Safe alternative
Add nullable column, no default	Safe, metadata only	Just do it
Add NOT NULL column with default	Risky, may rewrite/lock on older versions	Add nullable, backfill, then add the constraint as NOT VALID + VALIDATE
Add an index	Locks writes while building	CREATE INDEX CONCURRENTLY (PG) / online DDL (MySQL)
Rename a column	Breaks running code instantly	Expand/contract: add new, dual-write, switch, drop
Change a column type	Full table rewrite + lock	Add new typed column, backfill, switch, drop
Add a foreign key	Locks both tables to validate	Add constraint NOT VALID, then VALIDATE separately
Drop a column	Safe lock-wise, but breaks code that selects *	Stop referencing it in code first, drop later

Default behavior on large tables, with the safe alternative.

A rename is never safe in one step

The instant you rename `email` to `email_address`, every running app instance still issuing `SELECT email` starts erroring, and during a rolling deploy old and new code run simultaneously. There is no version of the schema that satisfies both. That is why a rename must always become an expand/contract sequence, never a single `ALTER`.

The dangerous way vs. the safe way

Say we want to rename users.full_name to users.display_name. Here is the tempting one-liner, the one that takes the app down during a rolling deploy:

dangerous.sql

sql

-- DO NOT DO THIS on a live table.
-- Old app instances still run SELECT full_name and break instantly.
ALTER TABLE users RENAME COLUMN full_name TO display_name;

Now the safe version, broken into the migrations and deploys that make up expand/contract. Step 1, expand. Add the new column nullable so it is a cheap metadata change:

01_expand.sql

sql

-- Metadata-only on modern PG/MySQL: fast, no rewrite.
ALTER TABLE users ADD COLUMN display_name text;

Step 2, backfill in batches. Never UPDATE users SET display_name = full_name in one statement on a big table; that is one giant transaction that locks rows and balloons replication lag. Chunk it by primary key:

02_backfill.sql

sql

-- Run in a loop from the app or a job runner, committing each batch.
-- Walk the table by primary key so each pass is bounded and indexed.
UPDATE users
SET    display_name = full_name
WHERE  id > :last_id
AND    id <= :last_id + 5000
AND    display_name IS NULL;

-- Then: :last_id += 5000, sleep ~100ms, repeat until no rows remain.
-- Each batch is a short transaction; readers and writers stay unblocked.

Step 3, dual-write. Deploy app code that writes both columns so any row changed during/after the backfill stays consistent:

userRepo.ts

typescript

// Dual-write: keep old and new in lockstep until reads have moved.
async function updateUser(id: string, name: string) {
  await db.query(
    `UPDATE users
     SET full_name = ${'$1'}, display_name = ${'$1'}
     WHERE id = ${'$2'}`,
    [name, id],
  );
}

Step 4, switch reads to display_name, verify parity in production for a release, then step 5, contract by dropping the old column:

03_contract.sql

sql

-- Only after NO deployed code reads or writes full_name.
ALTER TABLE users DROP COLUMN full_name;

And for indexes, the other classic table-locker, build them without blocking writers. Postgres has CONCURRENTLY; note it cannot run inside a transaction block, so it lives in its own migration:

04_index.sql

sql

-- Builds without an exclusive write lock. Cannot run in a txn block.
CREATE INDEX CONCURRENTLY idx_users_display_name
  ON users (display_name);

-- If it fails midway it leaves an INVALID index; drop and retry:
-- DROP INDEX CONCURRENTLY idx_users_display_name;

Backfilling big tables without melting the database

The backfill is where outages hide

The schema change is fast; the data migration is what hurts. A single UPDATE across 80M rows holds locks, generates a huge WAL/binlog burst, and pushes replication lag into the minutes, which silently breaks every read replica and any reader that requires fresh data. Always backfill in bounded batches.

Batch by primary key, not OFFSET. WHERE id > :last AND id <= :last + N stays indexed and constant-time; OFFSET gets slower every page.
Keep batches small, 1k–10k rows. Each batch should be a short, self-contained transaction that commits before the next starts.
Throttle between batches. A short sleep (50–200ms) lets replicas catch up and leaves I/O for live traffic. Watch replication lag and back off if it climbs.
Run off-peak where you can, and make the job resumable, persist last_id so a crash restarts mid-table, not from zero.
Make it idempotent. The display_name IS NULL guard means re-running a batch is harmless, which matters when jobs retry.

Online schema-change tools

For the operations that genuinely require a table rewrite (type changes, some constraint changes) on a huge MySQL table, hand-rolling expand/contract is painful. Battle-tested tools automate the safe version: they build a shadow copy of the table, sync changes via triggers or the binlog, backfill in throttled chunks, and atomically swap it in.

Tool	How it tracks changes	Notes
pt-online-schema-change	Triggers on the original table	Part of Percona Toolkit; triggers add write overhead
gh-ost (GitHub)	Reads the binlog (triggerless)	Lighter on the primary; pausable and throttleable mid-run

The two standard MySQL online schema-change tools.

gh-ost.sh

bash

# gh-ost builds a ghost table, backfills + tails the binlog, then cuts over.
gh-ost \
  --host=primary.db.internal \
  --database=app \
  --table=users \
  --alter="MODIFY COLUMN bio MEDIUMTEXT" \
  --max-load=Threads_running=25 \
  --chunk-size=1000 \
  --throttle-control-replicas=replica1.db.internal \
  --execute

Postgres usually does not need these

Postgres added fast metadata-only paths for adding columns and `CONCURRENTLY` for indexes, so the gh-ost/pt-osc family is mostly a MySQL story. On Postgres, lean on nullable adds, `CONCURRENTLY`, and NOT VALID + VALIDATE for constraints.

App and schema must agree at every moment

Deploy ordering is the whole game

During a rolling deploy, old and new code run at the same time. So the schema must be compatible with BOTH versions simultaneously. The rule: an additive schema change goes out before the code that needs it; a removal goes out after the code that stopped using it. Expand before deploy; contract after deploy.

This is why we deploy tolerant code first and drop the old column last. A column the new code reads must already exist when that code rolls out, otherwise the first new instance errors before the migration lands. A column the old code reads must still exist until that old code is fully gone. Get the ordering backwards and you create exactly the simultaneous incompatibility you were trying to avoid.

Two failure modes worth naming: backward compatibility means new code must still work against the old schema (true right after you deploy, before the migration). Forward compatibility means old code must still work against the new schema (true right after the migration, before old instances drain). Expand/contract is the discipline of keeping both true at all times.

In CI/CD, this means migrations and code deploys are separate, ordered steps, not one atomic migrate && deploy. Run additive migrations as a pre-deploy step, deploy the app, and schedule contracting migrations as a later, separate change once the previous release is fully rolled out. Pair this with a safe release mechanism, see deployment strategies: blue-green and canary, so you can halt a rollout before a bad schema assumption spreads.

Rollback strategy

The best rollback is the one you never need because every step was independently safe. But plan for it anyway, and notice that rolling back a destructive migration is often impossible, which is itself an argument for expand/contract.

Roll back code, not data, whenever possible. If each step is backward-compatible, reverting the app to the previous release is a clean, instant rollback, the schema still supports it.
Never drop in the same release that adds. Keep expand and contract in separate deploys so a problem after the switch is fixed by redeploying old code, not by un-dropping a column.
Treat destructive migrations as irreversible. A DROP COLUMN cannot be undone by a down-migration, the data is gone. Delay drops until you are certain, and keep a backup/snapshot taken immediately before.
Make migrations reversible where you can, but test the down path. An untested down() is not a rollback plan; it is a hope.
Keep the dual-write window open long enough that switching reads back to the old column is a viable retreat if parity checks fail.

Common mistakes that cost hours

Coupling migrate and deploy into one atomic step. They have different safe orderings (expand-before, contract-after); fusing them guarantees one of them is wrong.
Adding a NOT NULL column with a default on an old engine, it rewrites the whole table under a lock. Add nullable, backfill, then validate the constraint.
Backfilling in a single UPDATE. One giant transaction = long locks, WAL/binlog flood, replication lag, and a stalled app. Batch it.
Renaming or retyping a column in place while the app is live. There is no schema that satisfies old and new code at once, always expand/contract.
Building an index without CONCURRENTLY (or MySQL online DDL), blocking every writer for the build.
Dropping the old column too early, before every instance running old code has drained. Old code reading a gone column errors instantly.
No throttling and no replication-lag check during backfill, so read replicas fall minutes behind and stale-read bugs appear out of nowhere.

Takeaways and where to go next

The whole article in seven lines

A live schema is shared state running code depends on right now, never change it in one irreversible step.
Expand/contract: add the new shape, migrate to it, then remove the old shape, risk lives safely in between.
Add columns nullable; build indexes CONCURRENTLY; turn renames and type changes into multi-step sequences.
Backfill in small, throttled, idempotent batches keyed by primary key, the data migration is where outages hide.
The schema must satisfy old AND new code simultaneously: expand before deploy, contract after deploy.
On big MySQL tables, let gh-ost or pt-online-schema-change run the safe rewrite for you.
Roll back code, not data; treat drops as irreversible and snapshot before them.

Migrations are where application code and database internals meet, so the next steps deepen both sides. Understand what locks and visibility actually mean during a migration with database transactions and consistency, and learn why replication lag bites your backfills (and how replicas and shards change the picture) in scaling databases: replication and sharding.

Practice safe rollouts end to end in the CI/CD lab.
Pair migrations with progressive delivery in deployment strategies: blue-green and canary.
Solidify the database fundamentals behind locks and isolation in database transactions and consistency.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

Cloud

Managed Databases: Relational vs NoSQL in the Cloud

Read

Backend

What Is a Backend Engineer?

Read

Backend

How the Web Works: HTTP Requests

Read