Keep systems up when things fail-redundancy, failover, and simple targets.
Keep systems up when things fail-redundancy, failover, and simple targets.
Lesson outline
High availability (HA) = your system stays up even when parts fail. One server dies? Traffic goes to another. One datacenter has a blackout? Another region takes over.
Resilience = the system bends instead of breaking. It recovers from failures without you having to fix it by hand every time.
Redundancy
More than one of something. One fails, the other keeps serving.
Failover
Primary fails → traffic switches to backup automatically. No manual flip.
RTO & RPO
RTO = how long you can be down. RPO = how much data loss you accept.
Multi-AZ / multi-region
Run or replicate in more than one place. One AZ down? Another takes over.
HA = system stays up when parts fail. Set RTO and RPO, then design redundancy and failover to meet them.
Redundancy = you have more than one of something. Two servers, two regions, two copies of data. If one fails, the other keeps serving.
Failover = when the primary fails, traffic or work automatically switches to the backup. No manual flip. Health checks detect failure; the system reroutes.
Cloud makes this easier: multiple availability zones (AZs), managed load balancers, and auto-scaling groups replace bad instances automatically.
RTO (Recovery Time Objective) = how long you can afford to be down. "We need to be back within 1 hour." That drives how fast failover and restore must be.
RPO (Recovery Point Objective) = how much data loss you can accept. "We can lose at most 5 minutes of data." That drives how often you replicate or back up.
Set RTO and RPO based on business impact. Then design redundancy, backups, and failover to meet them.
Scenario: Your app runs in two AZs, but the database lives in only one. That AZ goes down.
Decision: The app tier fails over to the other AZ, but the database is gone. You are down until you restore from backup. To get HA, you need a multi-AZ database (replica in another AZ with automatic failover) or accept a longer RTO and restore from backups.
System design and SRE interviews: expect to discuss redundancy, failover, and how you would meet given RTO/RPO targets.
Common questions:
Key takeaways
What is the difference between RTO and RPO?
RTO is recovery time (how long until you are back up); RPO is recovery point (how much data loss you can accept).
Related concepts
Explore topics that connect to this one.
Interview prep: 2 resources
Use these to reinforce this concept for interviews.
View all interview resources →Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.