Designing for High Availability & Disaster Recovery (RTO/RPO)

On this page

The 3am page that defines your career
HA vs DR: survive now, recover later
RTO and RPO: the only two numbers that matter
The four DR strategies, by cost and speed
Multi-AZ vs multi-region: where the line is
A warm-standby topology across two regions
Common mistakes that cost hours (or the company)
Takeaways
Where to go next

The 3am page that defines your career

At 3am, an availability zone goes dark. If your system is well-designed, nothing happens, traffic shifts, replicas take over, and you sleep through it. If it isn't, you wake up to a dead service, a CEO asking "how long until we're back," and a second question that's even worse: "how much data did we lose?" Those two questions, *how long* and *how much*, are the entire subject of this article.

Most engineers conflate high availability and disaster recovery. They're not the same thing, they cost wildly different amounts, and treating them as one is how teams either over-spend on resilience they'll never use or discover, during an actual outage, that their "backups" can't be restored. This is a senior topic because the hard part isn't the technology. It's deciding how much resilience is worth paying for.

Who this is for

Engineers who can already deploy a service and now own its uptime. You should know what an availability zone and a database replica are. We use AWS terms, but Multi-AZ, cross-region replication, and the RTO/RPO framework map directly onto Azure and GCP.

HA vs DR: survive now, recover later

High availability is designing so a single failure causes no downtime. Disaster recovery is the plan for getting back when something larger than a single failure takes you out anyway.

HA is about redundancy within a system that's still running: two app servers behind a load balancer, a database with a standby replica, spread across availability zones so one AZ failing is a non-event. DR is about rebuilding after the system itself is gone, a region-wide outage, a bad deploy that corrupts every replica, a ransomware event, an rm -rf on production. HA has no answer for those. DR does.

🚗 A spare tyre in the bootHigh availability, one part fails, you keep driving

🏠 Home insuranceDisaster recovery, the whole thing is gone, you rebuild

🔁 Keep driving, no stopMulti-AZ failover (seconds, automatic)

📞 File a claim, wait for the payoutRestore from backup in another region (hours)

Two different problems that need two different plans.

The trap is assuming HA gives you DR for free. It doesn't. Three replicas across three AZs in one region are all destroyed by one region outage, one corrupting bug, one fat-fingered delete. HA protects against random hardware failure. DR protects against correlated, catastrophic failure. You need both, and they're budgeted separately.

RTO and RPO: the only two numbers that matter

Every DR conversation should start with two numbers, agreed with the business, written down. Get these wrong and you'll either build a Ferrari to deliver pizza or a bicycle to win a Grand Prix.

	RTO, Recovery Time Objective	RPO, Recovery Point Objective
Question it answers	How long can we be down?	How much data can we lose?
Measures	Time to restore service	Time between last backup and the failure
Direction	Forward from the disaster	Backward from the disaster
Driven by	Restore + failover speed	Backup / replication frequency
RTO=0 / RPO=0 means	Zero downtime	Zero data loss

RTO and RPO are measured from the moment disaster strikes, one looks forward, one looks back.

Picture a timeline. Disaster hits at noon. Your last backup was at 11:30. RPO is the gap *before* the disaster, you lose 30 minutes of data, so your RPO is 30 minutes. RTO is the gap *after*, if you're back online by 2pm, your RTO is 2 hours. Lowering RPO means backing up more often (or replicating continuously). Lowering RTO means restoring faster (or having standby infrastructure already warm). They're independent dials, and each one costs money to turn down.

RTO/RPO are business decisions, not engineering ones

Don't pick these numbers yourself. Ask the business: "if we lose the last 5 minutes of orders, what does that cost?" and "if we're down for 4 hours on a Tuesday, what does that cost?" The answers, in real money, tell you how much resilience to buy. An internal tool and a payments system deserve completely different numbers.

The four DR strategies, by cost and speed

There's a well-worn spectrum of DR strategies, from cheap-and-slow to expensive-and-instant. Pick the cheapest one that still meets your agreed RTO/RPO. Paying for active/active when the business is fine with a 4-hour RTO is just burning money.

Strategy	How it works	RTO / RPO	Cost
Backup & restore	Backups copied to another region; spin everything up on disaster	Hours / hours	$, cheapest
Pilot light	Core data replicated live; minimal infra idle, scaled up on failover	10s of mins / mins	$$
Warm standby	A scaled-down but running copy in another region, ready to scale up	Minutes / seconds	$$$
Multi-site active/active	Full capacity running in both regions, both serving traffic	Near-zero / near-zero	$$$$, most expensive

The DR spectrum: every step down in RTO/RPO is a step up in cost. Pick the one your numbers require, not the most impressive one.

1
Backup & restore
Your data and infra-as-code live in another region as backups. On disaster you provision everything fresh and restore. Cheapest, slowest, recovery is measured in hours. Fine for internal tools and anything where a few hours down is survivable.
2
Pilot light
The critical core, your database, is continuously replicated to the DR region and always on. Everything else sits dormant as templates. On failover you light up the rest around the already-warm data. Like a furnace's pilot flame: small, always lit, ready to ignite the whole system.
3
Warm standby
A complete but under-provisioned copy of your stack runs in the DR region right now, smaller instance counts, minimal capacity. On failover you scale it up and redirect traffic. Faster than pilot light because the app tier is already running, not just the data.
4
Multi-site active/active
Both regions run full production capacity and both serve live traffic. A region dying just means its share of traffic shifts to the survivor. Near-zero RTO and RPO, and you pay for double infrastructure plus the brutal complexity of multi-region data consistency.

Multi-AZ vs multi-region: where the line is

Here's the distinction that separates HA from DR in practice. Multi-AZ is your HA story. Multi-region is your DR story. They protect against different blast radii, and conflating them is the most expensive mistake in this whole topic.

	Multi-AZ	Multi-region
Protects against	One AZ (datacenter) failing	An entire region failing
Latency between sites	~1ms, synchronous replication works	10s–100s of ms, async usually required
Data consistency	Strong (synchronous)	Eventual or hard trade-offs
Cost overhead	Small, same region, no egress	Large, duplicate infra + cross-region data
Use it for	Almost everything (it's the default)	When RTO/RPO truly demand region survival

Two different failure domains. Multi-AZ is cheap and should be your default; multi-region is expensive and needs justification.

Multi-AZ is close to free in engineering terms: AZs in a region are linked by single-digit-millisecond fibre, so a database can replicate synchronously to a standby in another AZ with no meaningful latency cost. You get automatic failover with zero data loss. There is almost no reason a production database shouldn't be Multi-AZ. If you understand why regions and AZs are structured this way, see how the cloud actually works.

Multi-region is a different beast. Regions are hundreds of miles apart, so synchronous replication would add crippling latency to every write. You're forced into asynchronous replication, which means a non-zero RPO (some in-flight data is lost on failover), or into genuinely hard distributed-systems trade-offs. That cost and complexity is exactly why multi-region architecture deserves its own decision, not a reflex.

A warm-standby topology across two regions

Warm standby is the sweet spot for a lot of serious systems, minutes of RTO without the cost and complexity of full active/active. Here's what it looks like: a full-size primary in one region, a scaled-down running copy in another, the database replicating across, and a health-checked router ready to flip.

Warm standby across two regions. The primary (us-east-1) serves all live traffic. A scaled-down copy runs in us-west-2 with the database replicating asynchronously (dashed). Route 53 health-checks the primary; on failure it flips DNS to the standby, which then scales up to full capacity.

1
Normal operation
Route 53 points all traffic at the primary region. The standby runs at minimal capacity, and the database streams changes to the replica asynchronously, so the replica is seconds behind, which sets your RPO.
2
Disaster strikes
The primary region becomes unreachable. Route 53's health check fails after its configured threshold, this detection delay is part of your RTO, so tune it carefully.
3
Failover
Route 53 flips DNS to the standby's load balancer. The replica is promoted to primary. New traffic now lands in us-west-2.
4
Scale up
The standby app tier scales out from minimal to full capacity (auto-scaling helps here). Once warmed, you're serving at full strength, total elapsed time is your real-world RTO.

Pro tip

DNS TTL is a sneaky RTO killer. If your records have a 300-second TTL, some clients keep hitting the dead region for up to 5 minutes after you fail over. For DR-critical records, drop the TTL to 60s or lower so failover actually propagates fast.

Common mistakes that cost hours (or the company)

Never testing the restore. A backup you've never restored is a hope, not a plan. The outage is the worst possible time to discover the backup is corrupt, incomplete, or takes 9 hours to restore. Run game days. Actually fail over.
Confusing Multi-AZ with DR. Three AZs in one region do nothing for a region-wide outage or a corrupting bug that replicates to every standby. Multi-AZ is HA; it is not a DR strategy.
Picking RTO/RPO without the business. Engineers default to "as low as possible," which means "as expensive as possible." Without dollar figures from the business you can't right-size, and you'll either overspend or under-protect.
Ignoring the dependencies. Your app fails over cleanly, but DNS, secrets, the container registry, and your CI/CD all live only in the dead region. DR scope is the *whole* critical path, not just compute and database.
Forgetting RPO is set by replication lag. "We replicate to another region" sounds like RPO zero. If it's asynchronous, your RPO is however many seconds (or minutes) the replica lags, measure it, don't assume it.
No runbook, no owner. At 3am, under pressure, nobody remembers the steps. A failover that depends on one person's memory is a single point of failure. Write the runbook; rehearse it.

Takeaways

The whole article in nine lines

HA = survive a single failure with no downtime. DR = recover after catastrophe. You need both, budgeted separately.
RTO = how long you can be down. RPO = how much data you can lose. Two independent dials.
RTO and RPO are business decisions, set in real money, not engineering defaults.
Four DR strategies, cheap→expensive: backup & restore, pilot light, warm standby, active/active.
Pick the cheapest strategy that still meets your agreed RTO/RPO.
Multi-AZ is your HA story (cheap, synchronous, the default). Multi-region is your DR story (expensive, async).
Async replication means a non-zero RPO, measure the lag, don't assume zero.
An untested backup is not a backup. Run game days and rehearse failover.
DR scope is the entire critical path: DNS, secrets, registry, CI/CD, not just app and DB.

Where to go next

Resilience is a layered discipline. Build the mental model of failure domains first, then decide how far across regions you need to go.

How the cloud actually works: regions, AZs & edge, the failure domains everything in this article is built on.
Multi-region architecture: when you actually need it, the deep dive on the most expensive DR strategy.
Disaster recovery (concept lab), hands-on RTO/RPO and failover practice.
Reliability & resilience: design for failure, the broader engineering mindset behind all of this.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

Cloud

Cloud Networking Fundamentals: How a VPC Actually Works

Read

Cloud

How the Cloud Actually Works: Regions, AZs & the Edge

Read

Cloud

IaaS vs PaaS vs SaaS, What You Actually Manage

Read