Make the data layer survive bad days
Continues from the last build: From rung 12 you have a blameless postmortem habit and Loki/Grafana evidence skills, but every fix so far protected the request path, never the data underneath it.
Everything you have hardened so far protects the request path: retries, timeouts, circuit breakers, blameless postmortems.
What you'll build
You walk away able to say, with evidence, exactly how long it takes to restore Carta's database and how much data you would lose, because you measured both with a stopwatch. You will have a tested backup-and-restore pipeline, streaming replication with a replication-lag SLI and alert, a rehearsed failover runbook, a deliberate Redis persistence decision, and automated backup verification that fails loudly when a dump is bad. RTO and RPO stop being slideware and become numbers you defend in a review.
See how we teach, before you sign up
You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:
# postgres/primary.conf appended to the primary's postgresql.conf # Enables WAL archiving and streaming replication for Carta's data layer. # allow replication and archiving (minimal is not enough) wal_level = replica # keep archiving on and ship each completed segment to the mounted store archive_mode = on archive_command = 'test ! -f /wal_archive/%f && cp %p /wal_archive/%f' # allow standbys to connect and stream max_wal_senders = 5 max_replication_slots = 5 # retain enough WAL on the primary for a briefly-behind replica wal_keep_size = 256MB
Reading this file
wal_level = replicaWithout this raised above minimal, both archiving and replication silently do nothing.archive_command = 'test ! -f /wal_archive/%f && cp %p /wal_archive/%f'The test guard refuses to overwrite an existing segment, the idempotent pattern the Postgres docs require.max_wal_senders = 5Each streaming standby needs a sender slot; five leaves headroom for the replica plus drills.wal_keep_size = 256MBRetains WAL so a replica that briefly falls behind can still catch up without a full rebuild.
Drop-in primary config that turns on the two things this rung depends on: WAL archiving for point-in-time restore and streaming for the replica. You still create the replication slot and replica service yourself.
That's 1 of 7 explained code blocks in this single project.
The build, milestone by milestone
- 1
Automate logical backups and WAL archiving
4 guided stepsA backup you cannot point in time is a backup that loses everything since midnight. WAL archiving is what turns your worst-case data loss (RPO) from 24 hours into minutes. You cannot measure RPO honestly without it.
- 2
Run a restore drill and measure RTO and RPO with a stopwatch
4 guided stepsUntested backups are rumors. The only way to know your recovery objectives is to perform the recovery and watch the clock. This is the milestone that answers the teammate's question that started this rung.
- 3
Stand up streaming replication and a replication-lag SLI
4 guided stepsBackups recover you from a disaster; a hot replica lets you fail over in seconds and serves as a second copy that is always current. But a replica that has silently fallen far behind is a trap, so you must measure lag, not assume it is zero.
- 4
Run a failover game-day: kill the primary, promote the replica, repoint api
4 guided stepsA replica you never promote is untested theater. Failover under stress is full of sharp edges (connection storms, split-brain risk, stale DNS), and the only way to find them is to rehearse before a real bad day forces you to.
- 5
Decide Redis persistence and automate backup verification
4 guided stepsRedis backs the inventory cache; whether losing it on restart is fine or catastrophic is a decision, not a default. And a backup you never verify is Schrodinger's backup: you only learn it is bad during the restore you cannot afford to have fail.
What's inside when you start
You'll walk away with
This is portfolio-grade. Build it free.
Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.
Start building