RTO, RPO, and the four DR strategies — backup/restore, pilot light, warm standby, and active-active — with the cost and operational complexity trade-offs that determine which strategy fits each workload.
RTO, RPO, and the four DR strategies — backup/restore, pilot light, warm standby, and active-active — with the cost and operational complexity trade-offs that determine which strategy fits each workload.
Recovery Time Objective (RTO) is the maximum acceptable duration of a service outage — how long until you must be back online. An RTO of 4 hours means you have 4 hours to recover before the business impact becomes unacceptable.
Recovery Point Objective (RPO) is the maximum acceptable data loss measured in time — how old can the most recent good data be when you recover. An RPO of 1 hour means you can lose at most 1 hour of transactions before it becomes unacceptable.
RTO and RPO drive every architectural decision in DR
If your RPO is 24 hours, daily backups are sufficient. If your RPO is 15 minutes, you need continuous replication. If your RTO is 4 hours, backup/restore might work. If your RTO is 15 minutes, you need pre-provisioned infrastructure in the DR region. The cost of DR scales roughly inversely with RTO and RPO — tighter requirements cost dramatically more.
| RTO / RPO requirement | Strategy | Annual cost premium over single-region | Recovery mechanism |
|---|---|---|---|
| Hours to days / Hours to days | Backup & Restore | ~5–10% | Restore from S3/Glacier, recreate infrastructure |
| Hours / Minutes to hours | Pilot Light | ~15–25% | Scale up minimal DR infra; restore recent replica |
| Minutes to 1 hour / Minutes | Warm Standby | ~50–70% | Redirect DNS to scaled-down but running DR environment |
| Seconds / Near zero | Active-Active / Multi-site | ~100%+ | Automatic DNS/load balancer failover; dual active |
AWS and most cloud architects describe four DR strategies on a spectrum from cheapest/slowest to most expensive/fastest. Each is appropriate for different criticality tiers.
DR strategy breakdown
Never confuse multi-AZ HA with multi-region DR
Multi-AZ (e.g. RDS Multi-AZ, ALB across 3 AZs) protects against data centre failures within a region. Multi-region DR protects against an entire region becoming unavailable. These are different threat models and different cost profiles. Most teams nail multi-AZ but skip multi-region DR entirely — and discover they needed it during a region-level event.
An untested DR plan is not a DR plan. The most common DR failure mode is not a missing backup — it is that no one has ever run the restore procedure and it does not work under pressure.
Minimum viable DR testing programme
01
Define RTO and RPO targets per service tier. Document them in a service runbook. Get business sign-off on the numbers.
02
Test restore from backup quarterly: restore an RDS snapshot to a non-production environment and verify data integrity. Document exact time taken.
03
Test failover for pilot light and warm standby: simulate a region failure, execute the runbook, measure actual RTO and RPO. Compare to targets.
04
Publish the DR runbook to a location accessible without internet (the DR region) — if your runbook is in the failing region, you cannot read it during an incident.
05
After each test, update the runbook with lessons learned. Automate whatever caused delays.
Define RTO and RPO targets per service tier. Document them in a service runbook. Get business sign-off on the numbers.
Test restore from backup quarterly: restore an RDS snapshot to a non-production environment and verify data integrity. Document exact time taken.
Test failover for pilot light and warm standby: simulate a region failure, execute the runbook, measure actual RTO and RPO. Compare to targets.
Publish the DR runbook to a location accessible without internet (the DR region) — if your runbook is in the failing region, you cannot read it during an incident.
After each test, update the runbook with lessons learned. Automate whatever caused delays.
Use AWS Backup or equivalent for centralised backup management
AWS Backup provides a single pane of glass for backup policies across RDS, DynamoDB, EBS, EFS, FSx, S3, and EC2. Set backup plans with retention rules, cross-region copy rules, and vault lock (WORM — write once, read many — for ransomware protection). Without a centralised backup service, teams rely on per-service backup settings and inevitably miss something.
1# Create a cross-region RDS read replica (Pilot Light pattern)Cross-region read replica is the Pilot Light DR pattern for RDS — keeps data near-current at low cost2aws rds create-db-instance-read-replica \3--db-instance-identifier prod-db-dr-replica \4--source-db-instance-identifier arn:aws:rds:us-east-1:123456789012:db:prod-db \5--db-instance-class db.r6g.large \6--destination-region eu-west-1 \7--no-publicly-accessible89# Promote the read replica to a standalone instance during DR failoverPromotion is irreversible — only run this during an actual DR event or DR drill10# (irreversible — disconnects from primary)11aws rds promote-read-replica \12--db-instance-identifier prod-db-dr-replica \13--region eu-west-11415# Check RDS backup retention and automated backup status16aws rds describe-db-instances \17--query 'DBInstances[*].{ID:DBInstanceIdentifier,BackupRetention:BackupRetentionPeriod,BackupWindow:PreferredBackupWindow}' \18--output table
Cloud engineer and solutions architect interviews at senior level always include a DR scenario. Expect to be given RTO/RPO targets and asked to design the appropriate strategy and justify the cost.
Common questions:
Strong answer: Maps RTO/RPO requirements to specific strategies unprompted. Mentions Route 53 health checks for automatic DNS failover. Brings up the cost implications and the business case for each tier. Notes that the analytics read replica is an accidental pilot light.
Red flags: Treats multi-AZ as equivalent to DR. Cannot define RTO and RPO clearly. Recommends active-active for everything without discussing cost. Has never mentioned testing the DR plan.
Key takeaways
💡 Analogy
Think of DR strategies like insurance policies. Backup & Restore is basic liability insurance — cheap, but if something bad happens, you spend hours filing claims (restoring). Pilot Light is comprehensive insurance with a hotline that answers in an hour. Warm Standby is a pre-booked rental car that's waiting at the airport — you just have to drive it. Active-Active is owning two identical cars and driving both simultaneously — if one breaks down, you're already in the other one.
⚡ Core Idea
RTO (how fast) and RPO (how much data loss) determine which DR strategy you need. Tighter requirements cost exponentially more. The four strategies (backup/restore → pilot light → warm standby → active-active) are a spectrum of cost vs. recovery speed.
🎯 Why It Matters
DR is where most companies discover their architecture has hidden assumptions ("we assumed us-east-1 would never go down"). Designing DR into the architecture from the beginning is 5–10x cheaper than retrofitting it. Engineers who understand RTO/RPO and can map them to concrete AWS services and costs are invaluable in architecture discussions.
Related concepts
Explore topics that connect to this one.
Test yourself: 2 challenges
Apply this concept with scenario-based Q&A.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.