Back to path
ExpertCarta · Project 13 of 14 ~9h· 5 milestones

Make the data layer survive bad days

Continues from the last build: From rung 12 you have a blameless postmortem habit and Loki/Grafana evidence skills, but every fix so far protected the request path, never the data underneath it.

Everything you have hardened so far protects the request path: retries, timeouts, circuit breakers, blameless postmortems.

PostgreSQL backup with pg_dump and WAL archivingMeasuring RTO and RPO with a real restore drillConfiguring streaming replicationBuilding a replication-lag SLI and alert in PrometheusRunning a failover game-day and repointing a serviceDeciding Redis persistence trade-offs (RDB vs AOF)Automating backup verification to catch silent corruption

What you'll build

You walk away able to say, with evidence, exactly how long it takes to restore Carta's database and how much data you would lose, because you measured both with a stopwatch. You will have a tested backup-and-restore pipeline, streaming replication with a replication-lag SLI and alert, a rehearsed failover runbook, a deliberate Redis persistence decision, and automated backup verification that fails loudly when a dump is bad. RTO and RPO stop being slideware and become numbers you defend in a review.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

prod-sim/postgres/primary.confini
# postgres/primary.conf  appended to the primary's postgresql.conf
# Enables WAL archiving and streaming replication for Carta's data layer.

# allow replication and archiving (minimal is not enough)
wal_level = replica

# keep archiving on and ship each completed segment to the mounted store
archive_mode = on
archive_command = 'test ! -f /wal_archive/%f && cp %p /wal_archive/%f'

# allow standbys to connect and stream
max_wal_senders = 5
max_replication_slots = 5

# retain enough WAL on the primary for a briefly-behind replica
wal_keep_size = 256MB

Reading this file

  • wal_level = replicaWithout this raised above minimal, both archiving and replication silently do nothing.
  • archive_command = 'test ! -f /wal_archive/%f && cp %p /wal_archive/%f'The test guard refuses to overwrite an existing segment, the idempotent pattern the Postgres docs require.
  • max_wal_senders = 5Each streaming standby needs a sender slot; five leaves headroom for the replica plus drills.
  • wal_keep_size = 256MBRetains WAL so a replica that briefly falls behind can still catch up without a full rebuild.

Drop-in primary config that turns on the two things this rung depends on: WAL archiving for point-in-time restore and streaming for the replica. You still create the replication slot and replica service yourself.

That's 1 of 7 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Automate logical backups and WAL archiving

    4 guided steps

    A backup you cannot point in time is a backup that loses everything since midnight. WAL archiving is what turns your worst-case data loss (RPO) from 24 hours into minutes. You cannot measure RPO honestly without it.

  2. 2

    Run a restore drill and measure RTO and RPO with a stopwatch

    4 guided steps

    Untested backups are rumors. The only way to know your recovery objectives is to perform the recovery and watch the clock. This is the milestone that answers the teammate's question that started this rung.

  3. 3

    Stand up streaming replication and a replication-lag SLI

    4 guided steps

    Backups recover you from a disaster; a hot replica lets you fail over in seconds and serves as a second copy that is always current. But a replica that has silently fallen far behind is a trap, so you must measure lag, not assume it is zero.

  4. 4

    Run a failover game-day: kill the primary, promote the replica, repoint api

    4 guided steps

    A replica you never promote is untested theater. Failover under stress is full of sharp edges (connection storms, split-brain risk, stale DNS), and the only way to find them is to rehearse before a real bad day forces you to.

  5. 5

    Decide Redis persistence and automate backup verification

    4 guided steps

    Redis backs the inventory cache; whether losing it on restart is fine or catastrophic is a decision, not a default. And a backup you never verify is Schrodinger's backup: you only learn it is bad during the restore you cannot afford to have fail.

What's inside when you start

2 starter files, ready to clone
5 guided milestones
5 full reference solutions
7 code blocks explained line-by-line
5 "is it working?" checks
4 interview questions it prepares you for

You'll walk away with

A backup pipeline (run-backup.sh plus WAL archiving config) producing timestamped, checksummed pg_dump files
A restore-drill.sh and a drill log recording measured RTO and RPO from a real restore
A streaming replica plus prometheus/rules/replication.yml with lag, replica-down, and slot-pileup alerts
A failover-runbook.md and run-failover.sh that promote the replica and repoint api to a working checkout
A documented Redis persistence decision (RDB vs AOF vs none) with the trade-off written down
A verify-backup.sh that exits non-zero on a deliberately corrupted dump

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building