Back to path
AdvancedCarta · Project 9 of 14 ~8h· 5 milestones

Automate the toil away, build a remediator

Continues from the last build: You left rung 8 with a capacity model and headroom policy, but inventory still leaks file descriptors and gets restarted by hand every week.

The inventory fd leak from rung 2 will not die. Roughly once a week the service slowly exhausts its file descriptors, /checkout starts throwing 5xx, an alert pages a human at 3am, and that human SSHes in and runs the same three commands: drain the sick replica, restart it, confirm recovery.

toil measurement and accountingPrometheus HTTP query APIDocker SDK for Pythonautomation safety (rate limiting, circuit breaking)audit loggingpytest unit testingwriting operational software

What you'll build

You walk away with a tested, rate-limited remediation service that turns a recurring 3am page into a logged, automatic, bounded action, plus a one-page toil ledger that justifies the work and a runbook entry that now reads "the remediator handles this; here is how to read its audit log."

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

remediator/Dockerfiledockerfile
FROM python:3.12-slim

WORKDIR /app

# Copy the dependency manifest first so the pip layer is cached
COPY remediator/requirements.txt ./requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# Then copy the source
COPY remediator/ ./remediator/

# Run as a non-root user with only the access it needs
RUN useradd --create-home runner
USER runner

CMD ["python", "-m", "remediator.controller"]

Reading this file

  • COPY remediator/requirements.txt ./requirements.txtCopy the dependency manifest before the source so the pip layer is cached when only code changes.
  • RUN pip install --no-cache-dir -r requirements.txtInstalls deps in the cached layer above the source copy.
  • RUN useradd --create-home runnerCreates a non-root user so the remediator does not run as root.
  • CMD ["python", "-m", "remediator.controller"]The control loop is the container entrypoint.

Manifest-before-source ordering keeps the pip layer cached across code edits; the non-root user limits blast radius even though it still reaches the Docker socket.

That's 1 of 8 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Measure the toil before you automate it

    3 guided steps

    Automating low-frequency toil can cost more than it saves. A ledger forces the payback question (does building this beat the manual cost) and gives you a metric to prove the remediator worked after you ship it.

  2. 2

    Inject the leak and detect its Prometheus signature

    3 guided steps

    A remediator is only as safe as its trigger. A precise, sustained-over-time signature (not a single spiky sample) is what stops the tool from restarting healthy replicas. This is the detection half of the control loop.

  3. 3

    Drain and restart the sick replica through the Docker API

    3 guided steps

    Doing the restart through the Docker API (not a shelled-out command) makes the action testable with a mock and gives you structured errors. Draining before restart avoids dropping in-flight checkouts, which is the part humans often skip at 3am.

  4. 4

    Add the rate-limit guard and audit log

    3 guided steps

    An unbounded remediator that hits a restart-crash loop becomes an outage amplifier. The rate limit is a circuit breaker, and the audit log is how a human reconstructs what the robot did. Both are non-negotiable for any automation that touches production.

  5. 5

    Test it like the software it is

    3 guided steps

    Operational automation that can restart production must be the most-tested code you own, because its failure mode is taking the system down faster than a human would. Mocking the Docker client lets you test the restart path without restarting anything real.

What's inside when you start

3 starter files, ready to clone
5 guided milestones
5 full reference solutions
8 code blocks explained line-by-line
5 "is it working?" checks
4 interview questions it prepares you for

You'll walk away with

remediator/ Python package: detector, actuator, rate limiter, audit log
remediator/Dockerfile and a prod-sim.yml service entry mounting the Docker socket read-only where possible
tests/ with pytest covering signature detection, rate-limit refusal, and a mocked Docker restart
toil-ledger.md: a week of manual restart events with time-cost totals and a payback calculation
An /leak inject path documented so the leak and the remediation are reproducible
A runbook update describing how to read the audit log and how to disable the remediator

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building