Add multi-region disaster recovery
Continues from the last build: TaskFlow runs securely and observably in one region on EKS via GitOps.
TaskFlow already runs well in eu-west-1: it is on EKS, deployed by Argo CD from Git, with Prometheus and Grafana watching it and security baked into the pipeline.
What you'll build
TaskFlow survives a full primary-region outage by failing over to a warm standby region, with a target RTO under 15 minutes and a measured RPO that you record during a real game-day (typically single-digit seconds in steady state, not a guarantee). Route 53 health-checked DNS failover routes traffic to the standby, whose database is kept current by asynchronous cross-region replication, and the whole stack is defined as reusable Terraform modules and proven by a rehearsed failover.
See how we teach, before you sign up
You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:
terraform {
required_version = ">= 1.6"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
# Primary region: eu-west-1
provider "aws" {
alias = "primary"
region = "eu-west-1"
}
# Standby region: eu-central-1
provider "aws" {
alias = "standby"
region = "eu-central-1"
}Reading this file
alias = "primary"A named copy of the AWS provider so you can target eu-west-1 explicitly instead of relying on a single default region.region = "eu-central-1"This is the standby region where the warm copy of TaskFlow lives, ready to take traffic.version = "~> 5.0"Pins the AWS provider to the 5.x line so a future major release cannot silently change resource behavior under you.
One provider per region using alias. Every module call must say which provider it uses, so resources land in the right region.
That's 1 of 7 explained code blocks in this single project.
The build, milestone by milestone
- 1
Decide the DR strategy and write down RTO/RPO targets
3 guided stepsRTO and RPO are the contract. Without them you cannot tell whether your architecture is over-engineered (paying for active-active when warm standby suffices) or under-engineered (promising 5 minutes but your replica lag is 10). They also make the build testable: the game-day passes only if you hit the numbers you wrote here.
- 2
Build a region as a reusable Terraform module and stamp it twice
3 guided stepsA region module is what makes the second region trustworthy. If the standby were hand-built, it would drift from the primary and your failover would fail on some forgotten config. One module stamped twice means the standby is the primary, minus the differences you chose on purpose.
- 3
Replicate images, config, and secrets so the standby can actually run
3 guided stepsDR failures are almost never about the database; they are about the boring supporting cast. The classic game-day surprise is 'the pods are CrashLoopBackOff in the standby because the image is not there' or 'the app cannot start because the secret only exists in the primary region.' Replicating images and secrets ahead of time removes the two most common failover blockers.
- 4
Wire Route 53 health-checked failover and prove it switches
3 guided stepsDNS failover is the automatic part of your RTO. If the health check is wrong (too slow, wrong path, too sensitive) your failover is either too slow to meet RTO or so trigger-happy it flaps during a minor blip. Proving the switch in a controlled way, before a real outage, is the only way to trust the number you wrote in milestone 1.
- 5
Write and rehearse the failover runbook with a game-day
3 guided stepsA DR plan you have never executed is fiction. The first time you promote a replica should not be during a real outage. The game-day surfaces the things no diagram shows: the replica took 4 minutes to promote, the app needed a restart to pick up the new DB endpoint, a secret was missing. Measuring the real RTO/RPO is what turns 'we have DR' into 'we have a tested 12-minute RTO,' which is the sentence that wins the audit and the interview.
What's inside when you start
You'll walk away with
This is portfolio-grade. Build it free.
Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.
Start building