Designing for High Availability & Disaster Recovery (RTO/RPO)
HA keeps you alive when a server dies; DR brings you back when a whole region burns down. This is the senior mental model, RTO vs RPO defined without hand-waving, the four DR strategies as a cost/recovery table, and when Multi-AZ is enough versus when you actually need multi-region.
At 3am, an availability zone goes dark. If your system is well-designed, nothing happens, traffic shifts, replicas take over, and you sleep through it. If it isn't, you wake up to a dead service, a CEO asking "how long until we're back," and a second question that's even worse: "how much data did we lose?" Those two questions, *how long* and *how much*, are the entire subject of this article.
Most engineers conflate high availability and disaster recovery. They're not the same thing, they cost wildly different amounts, and treating them as one is how teams either over-spend on resilience they'll never use or discover, during an actual outage, that their "backups" can't be restored. This is a senior topic because the hard part isn't the technology. It's deciding how much resilience is worth paying for.
Who this is for
Engineers who can already deploy a service and now own its uptime. You should know what an availability zone and a database replica are. We use AWS terms, but Multi-AZ, cross-region replication, and the RTO/RPO framework map directly onto Azure and GCP.
HA vs DR: survive now, recover later
High availability is designing so a single failure causes no downtime. Disaster recovery is the plan for getting back when something larger than a single failure takes you out anyway.
HA is about redundancy within a system that's still running: two app servers behind a load balancer, a database with a standby replica, spread across availability zones so one AZ failing is a non-event. DR is about rebuilding after the system itself is gone, a region-wide outage, a bad deploy that corrupts every replica, a ransomware event, an rm -rf on production. HA has no answer for those. DR does.
๐ A spare tyre in the bootHigh availability, one part fails, you keep driving
๐ Home insuranceDisaster recovery, the whole thing is gone, you rebuild
๐ Keep driving, no stopMulti-AZ failover (seconds, automatic)
๐ File a claim, wait for the payoutRestore from backup in another region (hours)
Two different problems that need two different plans.
The trap is assuming HA gives you DR for free. It doesn't. Three replicas across three AZs in one region are all destroyed by one region outage, one corrupting bug, one fat-fingered delete. HA protects against random hardware failure. DR protects against correlated, catastrophic failure. You need both, and they're budgeted separately.
RTO and RPO: the only two numbers that matter
Every DR conversation should start with two numbers, agreed with the business, written down. Get these wrong and you'll either build a Ferrari to deliver pizza or a bicycle to win a Grand Prix.
RTO, Recovery Time Objective
RPO, Recovery Point Objective
Question it answers
How long can we be down?
How much data can we lose?
Measures
Time to restore service
Time between last backup and the failure
Direction
Forward from the disaster
Backward from the disaster
Driven by
Restore + failover speed
Backup / replication frequency
RTO=0 / RPO=0 means
Zero downtime
Zero data loss
RTO and RPO are measured from the moment disaster strikes, one looks forward, one looks back.
Picture a timeline. Disaster hits at noon. Your last backup was at 11:30. RPO is the gap *before* the disaster, you lose 30 minutes of data, so your RPO is 30 minutes. RTO is the gap *after*, if you're back online by 2pm, your RTO is 2 hours. Lowering RPO means backing up more often (or replicating continuously). Lowering RTO means restoring faster (or having standby infrastructure already warm). They're independent dials, and each one costs money to turn down.
RTO/RPO are business decisions, not engineering ones
Don't pick these numbers yourself. Ask the business: "if we lose the last 5 minutes of orders, what does that cost?" and "if we're down for 4 hours on a Tuesday, what does that cost?" The answers, in real money, tell you how much resilience to buy. An internal tool and a payments system deserve completely different numbers.
The four DR strategies, by cost and speed
There's a well-worn spectrum of DR strategies, from cheap-and-slow to expensive-and-instant. Pick the cheapest one that still meets your agreed RTO/RPO. Paying for active/active when the business is fine with a 4-hour RTO is just burning money.
Strategy
How it works
RTO / RPO
Cost
Backup & restore
Backups copied to another region; spin everything up on disaster
Hours / hours
$, cheapest
Pilot light
Core data replicated live; minimal infra idle, scaled up on failover
10s of mins / mins
$$
Warm standby
A scaled-down but running copy in another region, ready to scale up
Minutes / seconds
$$$
Multi-site active/active
Full capacity running in both regions, both serving traffic
Near-zero / near-zero
$$$$, most expensive
The DR spectrum: every step down in RTO/RPO is a step up in cost. Pick the one your numbers require, not the most impressive one.
1
Backup & restore
Your data and infra-as-code live in another region as backups. On disaster you provision everything fresh and restore. Cheapest, slowest, recovery is measured in hours. Fine for internal tools and anything where a few hours down is survivable.
2
Pilot light
The critical core, your database, is continuously replicated to the DR region and always on. Everything else sits dormant as templates. On failover you light up the rest around the already-warm data. Like a furnace's pilot flame: small, always lit, ready to ignite the whole system.
3
Warm standby
A complete but under-provisioned copy of your stack runs in the DR region right now, smaller instance counts, minimal capacity. On failover you scale it up and redirect traffic. Faster than pilot light because the app tier is already running, not just the data.
4
Multi-site active/active
Both regions run full production capacity and both serve live traffic. A region dying just means its share of traffic shifts to the survivor. Near-zero RTO and RPO, and you pay for double infrastructure plus the brutal complexity of multi-region data consistency.
Multi-AZ vs multi-region: where the line is
Here's the distinction that separates HA from DR in practice. Multi-AZ is your HA story. Multi-region is your DR story. They protect against different blast radii, and conflating them is the most expensive mistake in this whole topic.
Multi-AZ
Multi-region
Protects against
One AZ (datacenter) failing
An entire region failing
Latency between sites
~1ms, synchronous replication works
10sโ100s of ms, async usually required
Data consistency
Strong (synchronous)
Eventual or hard trade-offs
Cost overhead
Small, same region, no egress
Large, duplicate infra + cross-region data
Use it for
Almost everything (it's the default)
When RTO/RPO truly demand region survival
Two different failure domains. Multi-AZ is cheap and should be your default; multi-region is expensive and needs justification.
Multi-AZ is close to free in engineering terms: AZs in a region are linked by single-digit-millisecond fibre, so a database can replicate synchronously to a standby in another AZ with no meaningful latency cost. You get automatic failover with zero data loss. There is almost no reason a production database shouldn't be Multi-AZ. If you understand why regions and AZs are structured this way, see how the cloud actually works.
Multi-region is a different beast. Regions are hundreds of miles apart, so synchronous replication would add crippling latency to every write. You're forced into asynchronous replication, which means a non-zero RPO (some in-flight data is lost on failover), or into genuinely hard distributed-systems trade-offs. That cost and complexity is exactly why multi-region architecture deserves its own decision, not a reflex.
A warm-standby topology across two regions
Warm standby is the sweet spot for a lot of serious systems, minutes of RTO without the cost and complexity of full active/active. Here's what it looks like: a full-size primary in one region, a scaled-down running copy in another, the database replicating across, and a health-checked router ready to flip.
Warm standby across two regions. The primary (us-east-1) serves all live traffic. A scaled-down copy runs in us-west-2 with the database replicating asynchronously (dashed). Route 53 health-checks the primary; on failure it flips DNS to the standby, which then scales up to full capacity.
1
Normal operation
Route 53 points all traffic at the primary region. The standby runs at minimal capacity, and the database streams changes to the replica asynchronously, so the replica is seconds behind, which sets your RPO.
2
Disaster strikes
The primary region becomes unreachable. Route 53's health check fails after its configured threshold, this detection delay is part of your RTO, so tune it carefully.
3
Failover
Route 53 flips DNS to the standby's load balancer. The replica is promoted to primary. New traffic now lands in us-west-2.
4
Scale up
The standby app tier scales out from minimal to full capacity (auto-scaling helps here). Once warmed, you're serving at full strength, total elapsed time is your real-world RTO.
Pro tip
DNS TTL is a sneaky RTO killer. If your records have a 300-second TTL, some clients keep hitting the dead region for up to 5 minutes after you fail over. For DR-critical records, drop the TTL to 60s or lower so failover actually propagates fast.
Common mistakes that cost hours (or the company)
Never testing the restore. A backup you've never restored is a hope, not a plan. The outage is the worst possible time to discover the backup is corrupt, incomplete, or takes 9 hours to restore. Run game days. Actually fail over.
Confusing Multi-AZ with DR. Three AZs in one region do nothing for a region-wide outage or a corrupting bug that replicates to every standby. Multi-AZ is HA; it is not a DR strategy.
Picking RTO/RPO without the business. Engineers default to "as low as possible," which means "as expensive as possible." Without dollar figures from the business you can't right-size, and you'll either overspend or under-protect.
Ignoring the dependencies. Your app fails over cleanly, but DNS, secrets, the container registry, and your CI/CD all live only in the dead region. DR scope is the *whole* critical path, not just compute and database.
Forgetting RPO is set by replication lag. "We replicate to another region" sounds like RPO zero. If it's asynchronous, your RPO is however many seconds (or minutes) the replica lags, measure it, don't assume it.
No runbook, no owner. At 3am, under pressure, nobody remembers the steps. A failover that depends on one person's memory is a single point of failure. Write the runbook; rehearse it.
Takeaways
The whole article in nine lines
HA = survive a single failure with no downtime. DR = recover after catastrophe. You need both, budgeted separately.
RTO = how long you can be down. RPO = how much data you can lose. Two independent dials.
RTO and RPO are business decisions, set in real money, not engineering defaults.
Four DR strategies, cheapโexpensive: backup & restore, pilot light, warm standby, active/active.
Pick the cheapest strategy that still meets your agreed RTO/RPO.
Multi-AZ is your HA story (cheap, synchronous, the default). Multi-region is your DR story (expensive, async).
Async replication means a non-zero RPO, measure the lag, don't assume zero.
An untested backup is not a backup. Run game days and rehearse failover.
DR scope is the entire critical path: DNS, secrets, registry, CI/CD, not just app and DB.
Where to go next
Resilience is a layered discipline. Build the mental model of failure domains first, then decide how far across regions you need to go.
This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.