AWS Well-Architected Framework
Six pillars for building secure, reliable, efficient, and cost-optimized systems — and the trade-offs between them.
AWS Well-Architected Framework
Six pillars for building secure, reliable, efficient, and cost-optimized systems — and the trade-offs between them.
What you'll learn
- Six pillars: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, Sustainability.
- Pillars trade off — maximize reliability for payment processing; optimize cost for internal batch jobs. Match investment to business criticality.
- Reliability pillar key patterns: Multi-AZ databases, Auto Scaling Groups, automated recovery, test recovery procedures with game days.
- Security pillar: least privilege IAM, all data encrypted at rest and in transit, no secrets in code or userdata, CloudTrail enabled.
- Use the free AWS Well-Architected Tool in the console to run a structured review against all six pillars.
Lesson outline
The checklist that saved a company from a $30M outage
A Series C startup was acquired and their architecture needed to scale from 50,000 to 5 million users. Before the migration, the acquiring company ran an AWS Well-Architected Review.
Three critical findings: single point of failure in the database (no Multi-AZ), secrets stored in EC2 userdata (security), and no documented runbook for any incident (reliability). All three were fixed before the migration.
Seven months later, the primary database failed. Instead of a 6-hour outage, RDS Multi-AZ automatic failover restored service in 73 seconds. The Well-Architected review paid for itself 1,000 times over.
What is the AWS Well-Architected Framework?
A set of architectural best practices organized into six pillars, developed by AWS from reviewing thousands of customer workloads. Each pillar has a set of design principles, questions, and best practices. The free AWS Well-Architected Tool in the console lets you review any workload against these pillars.
The six pillars — and what they each protect
The six pillars of the Well-Architected Framework
- I. Operational Excellence — Running and monitoring systems to deliver business value and continually improving operations. Key practices: infrastructure as code, annotated documentation, frequent small reversible changes, regular game days, post-mortems.
- II. Security — Protecting information, systems, and assets. Key practices: identity foundation (least privilege), traceability (CloudTrail), security at all layers, automated security best practices, protect data in transit and at rest, keep people away from data, prepare for security events.
- III. Reliability — Ensuring a workload performs its intended function correctly and consistently. Key practices: test recovery procedures, automatically recover from failure, scale horizontally, stop guessing capacity, manage change through automation.
- IV. Performance Efficiency — Using computing resources efficiently. Key practices: democratize advanced technologies (use managed services), go global in minutes, use serverless architectures, experiment more often, consider mechanical sympathy (use the right tool for the job).
- V. Cost Optimization — Avoiding unnecessary costs. Key practices: implement cloud financial management, adopt a consumption model (pay for what you use), measure overall efficiency, stop spending on undifferentiated heavy lifting, analyze and attribute expenditure.
- VI. Sustainability — (Added 2021) Minimizing the environmental impacts of running cloud workloads. Key practices: understand your impact, establish sustainability goals, maximize utilization, use managed services, use higher-level managed services that spread load efficiently, reduce downstream impact.
The trade-offs between pillars
The hardest part of the Well-Architected Framework is that the pillars trade off against each other. There is no architecture that is simultaneously maximally reliable, maximally performant, and maximally cost-optimized.
| Trade-off scenario | Decision | What you gain | What you sacrifice |
|---|---|---|---|
| Multi-AZ vs single-AZ database | Multi-AZ | Reliability (73s failover vs 6h recovery) | Cost (2× database cost) |
| DynamoDB on-demand vs provisioned | On-demand | Cost efficiency at low/unpredictable load; Reliability (no throttling) | Cost at high predictable load (provisioned is 80% cheaper) |
| Synchronous vs async processing | Async (SQS + Lambda) | Reliability (retries, dead letter queues); Cost (pay per invocation) | Complexity; Operational excellence (harder to debug) |
| Caching (ElastiCache vs no cache) | Add cache | Performance; Cost (fewer DB reads) | Reliability (cache invalidation bugs); Operational excellence (more components) |
| Spot instances vs On-Demand | Spot for batch jobs | Cost (70-90% cheaper) | Reliability (spot interruptions); must design for graceful shutdown |
How to make pillar trade-off decisions
Match your reliability and cost investments to the business criticality of the workload. A payment processing service needs maximum reliability (Multi-AZ, read replicas, chaos testing). An internal analytics dashboard can trade reliability for cost (single-AZ, no HA, spot instances for batch processing).
A company uses a single-AZ RDS instance to reduce database costs. Which Well-Architected pillar are they trading off?
Pillar deep-dive: Reliability
Reliability is most often the pillar with the highest failure risk. Here are the key design patterns:
Key reliability patterns with AWS services
- Automatic recovery (circuit breaker) — Amazon EC2 auto recovery, ECS health checks, ALB target group health checks — automatically replace unhealthy instances without human intervention.
- Test recovery procedures — Run game days: simulate AZ failures, instance terminations, database failovers. The Well-Architected Framework says "never guess your RTO — measure it." Most teams discover their actual recovery time is 10× their assumed RTO during game days.
- Horizontal scaling — Auto Scaling Groups with EC2, ECS Service autoscaling, DynamoDB on-demand — add capacity as load increases, remove it when load drops. Never a single point of failure.
- Manage change with automation — CloudFormation, CDK, or Terraform for all infrastructure changes. No manual console changes in production. Every change is code-reviewed, tested in staging, and deployed with rollback capability.
1# Well-Architected: Reliability pillar — RDS with Multi-AZ and automated backups2resource "aws_db_instance" "main" {3identifier = "production-db"4engine = "postgres"5engine_version = "15.4"6instance_class = "db.t3.large"78# Reliability: Multi-AZ for automatic failover (73s vs hours)9multi_az = true10Multi-AZ is the most important reliability setting — enables 73s automatic failover11# Reliability: Automated backups with 7-day retention12backup_retention_period = 713backup_window = "03:00-04:00"Automated backups enable point-in-time recovery — your RPO is minutes, not days14maintenance_window = "sun:05:00-sun:06:00"1516# Security: encryption at rest (Security pillar)17storage_encrypted = true18kms_key_id = aws_kms_key.db.arnEncryption at rest is a Security pillar requirement — always enable for production data1920# Reliability: automated minor version upgrades21auto_minor_version_upgrade = true2223# Reliability: deletion protection prevents accidental deletion24deletion_protection = trueDeletion protection prevents someone accidentally `terraform destroy`-ing your production database2526# Cost: storage autoscaling prevents manual intervention27max_allocated_storage = 100028}
How this might come up in interviews
Cloud architecture interviews, solutions architect roles, and technical leadership discussions. AWS certifications (SAA, SAP) test this extensively.
Common questions:
- What are the six pillars of the AWS Well-Architected Framework?
- How do you trade off reliability vs cost in a real architecture decision?
- What does the Well-Architected Framework say about security best practices?
- Have you done a Well-Architected Review? What did you find?
Key takeaways
- Six pillars: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, Sustainability.
- Pillars trade off — maximize reliability for payment processing; optimize cost for internal batch jobs. Match investment to business criticality.
- Reliability pillar key patterns: Multi-AZ databases, Auto Scaling Groups, automated recovery, test recovery procedures with game days.
- Security pillar: least privilege IAM, all data encrypted at rest and in transit, no secrets in code or userdata, CloudTrail enabled.
- Use the free AWS Well-Architected Tool in the console to run a structured review against all six pillars.
Before you move on: can you answer these?
A startup wants to minimize AWS costs on their MVP. Should they use Multi-AZ RDS?
It depends on business criticality. For an MVP with no paying customers, single-AZ is an acceptable cost trade-off. Once customers are paying or the product is business-critical, Multi-AZ is required (Reliability pillar). This is the explicit trade-off the framework asks you to make consciously.
What is the purpose of a "game day" in the Reliability pillar?
A game day is a scheduled exercise where you simulate failures (AZ outage, database failover, instance termination) to measure your actual RTO and RTO — and discover recovery procedures that do not work before a real incident does.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Discussion
Questions? Discuss in the community or start a thread below.
Join DiscordIn-app Q&A
Sign in to start or join a thread.