Interactive Explainer

Relevant for:JuniorMid-levelSeniorPrincipal

Why this matters at your level

Junior

Know that a failing CI is a broken feedback loop for the entire team. Keep CI green. Run linters and type checks locally before pushing. A flaky test is not "probably fine" — it degrades the signal for everyone.

Mid-level

Actively measure your team's feedback loop latencies: how long does CI take, how long from merge to production, how quickly do you detect incidents? If CI takes 45 minutes, it is your job to compress it. Know your team's DORA metrics.

Senior

Design systems where monitoring is architecturally independent from what it monitors. Architect for observability as a first-class property. Champion DORA metric tracking and use the data to justify engineering investment. Set explicit MTTD and lead time targets.

Principal

Set org-level feedback loop standards: CI must be under X minutes, MTTD under Y minutes, deployment frequency at least Z. Use DORA as a lagging indicator of feedback loop health across teams. Identify where monitoring is architecturally coupled to the systems it monitors and drive independence.

🎯Key Takeaways

Every engineering practice that works (CI/CD, observability, shift-left, DORA) exists to compress a specific feedback loop. Name the loop, measure its latency, then optimise it.

Four loops: compile-time (< 1s), CI (target < 10min), staging (production parity), production monitoring (MTTD < 5min). Compress each independently.

A feedback loop that runs but is not trusted is functionally broken. Fix flakiness and latency before adding more tests to a loop nobody reads.

DORA metrics are feedback loop latency measurements: Deployment Frequency = user feedback loop, Lead Time = pipeline latency, MTTR = incident recovery loop, Change Failure Rate = upstream loop quality.

Monitor your monitors. If your alerting runs on the same network it monitors, you will be blind exactly when you most need to see — the Facebook BGP lesson.

Feedback loops

The speed of your feedback loop determines the speed of your learning — and your systems. Every engineering practice that works is a feedback loop optimisation.

~8 min read

Be the first to complete!

LIVESelf-Inflicted Blind Spot — Facebook — October 4, 2021

Breaking News

15:39 UTC

BGP config update withdraws all Facebook routes — internal and external — from the internet

CRITICAL

15:41 UTC

Facebook.com, Instagram, WhatsApp offline for 3B users. Monitoring, Workplace, and badge-access systems simultaneously unreachable

CRITICAL

~16:00 UTC

Engineers identify the issue remotely but cannot access systems. Physical data-centre access required — door badges also offline

WARNING

21:31 UTC

Services restore after 6 hours as engineers manually apply BGP fix on-site

—Outage duration — caused by broken feedback loop, not the bug itself

—Users locked out of Facebook, Instagram, and WhatsApp simultaneously

—Estimated revenue lost during the outage

—Monitoring alerts fired — the alerting system was on the broken network

The question this raises

If your monitoring, alerting, and incident-response tools all live inside the systems they are supposed to monitor, what happens to your ability to see and fix problems when those systems fail?

What you'll learn

Every engineering practice that works (CI/CD, observability, shift-left, DORA) exists to compress a specific feedback loop. Name the loop, measure its latency, then optimise it.
Four loops: compile-time (< 1s), CI (target < 10min), staging (production parity), production monitoring (MTTD < 5min). Compress each independently.
A feedback loop that runs but is not trusted is functionally broken. Fix flakiness and latency before adding more tests to a loop nobody reads.
DORA metrics are feedback loop latency measurements: Deployment Frequency = user feedback loop, Lead Time = pipeline latency, MTTR = incident recovery loop, Change Failure Rate = upstream loop quality.
Monitor your monitors. If your alerting runs on the same network it monitors, you will be blind exactly when you most need to see — the Facebook BGP lesson.

Lesson outline

What is a feedback loop?

A feedback loop is any cycle where the output of a system is used as an input to adjust the system's next action. Engineering is entirely made of them: your compiler tells you if the code is valid. Your tests tell you if the logic is correct. Your monitoring tells you if the system is healthy. Your users tell you if the feature is useful.

The engineering principle: the speed of your feedback loop determines the speed of your learning. A feedback loop that takes 45 minutes to tell you your code is broken means 45 minutes of compounding wrong assumptions. A feedback loop that takes 78 days to tell you a CVE is in your dependencies (Equifax) means 78 days of exposure.

The missing leg: monitoring your monitoring

Facebook's October 2021 BGP outage lasted 6 hours not because the bug was hard to fix, but because the engineers' monitoring, communication tools, and physical door-badge access all ran on the same internal network that went down. The feedback loop that would have enabled the fix was part of what broke. Your system's health is only as visible as the architectural independence of your monitoring from the system being monitored.

The four feedback loops every engineer has

Every software delivery pipeline contains four distinct feedback loops, each with a different latency and cost of a missed signal. Engineers who understand all four can compress them independently.

Tap any card to reveal its rule, bad pattern, and good pattern

0/4

1⚡Compile-time

Catch errors at the moment they are introduced, before running any code.

✗ BAD

Silencing type errors with `as any` or `// @ts-ignore` — suppressing the earliest and fastest feedback signal you have.

✓ GOOD

Strict TypeScript, linter rules, editor inline errors. A red underline is a feedback loop completing in < 1 second. Never suppress it without a comment explaining why.

1 · Compile-time

2🔄CI Pipeline

Verify every change works in isolation before it reaches a human or staging environment.

✗ BAD

CI takes 48 minutes and is flaky 20% of the time. Engineers learn to merge without checking — the loop technically runs but is functionally broken.

✓ GOOD

CI under 10 minutes, deterministic, green means green. Parallelise test suites. Fail fast on cheapest checks first (lint → unit → integration → e2e).

2 · CI Pipeline

3🧪Staging

Verify the system works end-to-end in conditions that approximate production.

✗ BAD

Staging is "mostly like prod" — different data volumes, different config, different dependency versions. Bugs only appear in production.

✓ GOOD

Same Docker image from dev to prod. Anonymised production data at production scale. Run production traffic replay. "It works in staging" must actually mean something.

3 · Staging

4📡Production Monitoring

Know within minutes when something is wrong in production, before users report it.

✗ BAD

You learn about production errors from user complaints on social media. Alert emails fire on everything and get auto-archived. MTTD is measured in hours.

✓ GOOD

SLO-based alerting: fire when user-facing error rate exceeds X% for Y minutes. MTTD < 5 minutes. Monitor your monitors with a dead-man's-switch health check.

4 · Production Monitoring

Measuring loop speed: DORA metrics

The DORA research program identified four metrics that predict software delivery performance. Every one of them is a feedback loop latency measurement, not an arbitrary KPI.

DORA Metric	What feedback loop it measures	Elite benchmark
Deployment Frequency	How often you close the real-user feedback loop — how often users get your changes	Multiple times per day
Lead Time for Changes	Total pipeline latency: code committed → running in production	< 1 hour
Change Failure Rate	Signal quality of your upstream loops — if high, earlier loops aren't catching enough	< 5%
MTTR	Speed of your incident response feedback loop: detection → resolution	< 1 hour

Elite performers deploy at 10× the frequency of low performers

This is not because they are 10× better programmers. It is because they have 10× more opportunities to learn from real-user feedback. Each deployment is a completed feedback loop. Compressing pipeline latency is the mechanism behind DORA improvement — not deploying more for its own sake.

When feedback loops break

The feedback loop death spiral

Slow or flaky feedback → engineers ignore it → feedback quality degrades further → trust collapses → the loop is functionally broken even though it is technically running. The most common form: CI that takes 45 minutes and fails on flaky tests 20% of the time. Engineers stop looking at failures carefully. Start merging without green CI. The signal exists but carries no information.

4 symptoms your feedback loops are broken

Engineers merge without CI green — The loop runs but has zero trust. Engineers have learned failures are "probably not their fault." This is the death spiral in progress.
"It works on my machine" — The compile-time and CI loops are not catching environment drift. Your staging loop does not approximate production closely enough.
You learn about bugs from user reports — Your production monitoring loop is missing or has latency measured in hours. MTTD > 30 minutes is a red flag for any SLO-driven service.
Refactoring feels risky — Test coverage is too sparse for feedback loops to provide confidence. Engineers avoid changes because they do not trust the signal.

Exam Answer vs. Production Reality

1 / 3

What feedback loops are

📖 What the exam expects

Automated tests and CI/CD pipelines provide feedback on whether code is working correctly before it reaches production.

Toggle between what certifications teach and what production actually requires

How this might come up in interviews

Feedback loops appear in disguise across system design and reliability interviews. "How would you improve your CI/CD pipeline?" = "How would you compress the CI feedback loop?" "How do you approach monitoring?" = "How do you design production feedback loops?" Naming the pattern explicitly shows architectural thinking. DORA metric questions are almost always feedback loop questions.

Strong answer: Unprompted mentions of specific DORA metric targets. Distinguishes feedback loop existence from quality and trust. Brings up monitoring independence or references the Facebook BGP incident as a feedback loop design failure. Mentions the death spiral: slow/flaky feedback → ignored → functionally broken.

Red flags: "We have unit tests" without mentioning latency, flakiness, or trust. "Our users tell us about bugs" — the production monitoring loop is absent. "We deploy monthly" — the user feedback loop runs at 1/30th the speed of a daily-deploy team.

Quick check · Feedback loops

1 / 4

What does MTTR primarily measure in terms of feedback loops?

Key takeaways

Every engineering practice that works (CI/CD, observability, shift-left, DORA) exists to compress a specific feedback loop. Name the loop, measure its latency, then optimise it.
Four loops: compile-time (< 1s), CI (target < 10min), staging (production parity), production monitoring (MTTD < 5min). Compress each independently.
A feedback loop that runs but is not trusted is functionally broken. Fix flakiness and latency before adding more tests to a loop nobody reads.
DORA metrics are feedback loop latency measurements: Deployment Frequency = user feedback loop, Lead Time = pipeline latency, MTTR = incident recovery loop, Change Failure Rate = upstream loop quality.
Monitor your monitors. If your alerting runs on the same network it monitors, you will be blind exactly when you most need to see — the Facebook BGP lesson.

Before you move on: can you answer these?

Your CI pipeline takes 52 minutes. How would you identify which stages are slowest, and what patterns (parallelism, test ordering, caching) would you apply to bring it under 10 minutes?

Your team discovers a production bug through a user complaint on social media. Walk through every feedback loop in your delivery pipeline that failed to catch this bug. Which would you fix first?

Your monitoring system is deployed inside the same AWS account and VPC as the application it monitors. What failure modes does this create, and how would you re-architect for independence?

From the books

Atomic Habits

Chapter 15: The Cardinal Rule of Behavior Change

James Clear's habit loop — cue → craving → response → reward — is a feedback loop. The reward IS the feedback signal that tells the brain whether to repeat the behavior. Clear's key insight: "The greater the distance between the behavior and the reward, the harder it is to build the habit." This is identical to the engineering insight: the greater the distance between writing code and getting feedback on it, the harder it is to learn and improve. Compressing your engineering feedback loops — from 45-minute CI to 10-minute CI, from weekly deploys to daily deploys — is compressing the habit loop. You learn faster when the reward is tightly coupled to the behavior.

🧠Mental Model

💡 Analogy

The thermostat vs the "deploy and pray" pipeline

⚡ Core Idea

A thermostat has a 30-second feedback loop: measure temperature → compare to target → adjust heat → repeat. Most deployment pipelines have a 45-minute feedback loop, and no feedback loop at all for "is this feature actually useful to users?" The thermostat is smarter about learning from its environment than most engineering processes.

🎯 Why It Matters

Compressing feedback loops is the single highest-leverage activity in software engineering. A team that deploys multiple times per day and gets feedback in minutes is not more talented — they have more opportunities to learn and correct. Feedback loop speed IS learning speed. Every DORA metric is a feedback loop latency, and improving DORA means compressing loops — not gaming metrics.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

What is a feedback loop?

The missing leg: monitoring your monitoring

The four feedback loops every engineer has

Every software delivery pipeline contains four distinct feedback loops, each with a different latency and cost of a missed signal. Engineers who understand all four can compress them independently.

Tap any card to reveal its rule, bad pattern, and good pattern

0/4

1⚡Compile-time

Catch errors at the moment they are introduced, before running any code.

✗ BAD

Silencing type errors with `as any` or `// @ts-ignore` — suppressing the earliest and fastest feedback signal you have.

✓ GOOD

Strict TypeScript, linter rules, editor inline errors. A red underline is a feedback loop completing in < 1 second. Never suppress it without a comment explaining why.

1 · Compile-time

2🔄CI Pipeline

Verify every change works in isolation before it reaches a human or staging environment.

✗ BAD

CI takes 48 minutes and is flaky 20% of the time. Engineers learn to merge without checking — the loop technically runs but is functionally broken.

✓ GOOD

CI under 10 minutes, deterministic, green means green. Parallelise test suites. Fail fast on cheapest checks first (lint → unit → integration → e2e).

2 · CI Pipeline

3🧪Staging

Verify the system works end-to-end in conditions that approximate production.

✗ BAD

Staging is "mostly like prod" — different data volumes, different config, different dependency versions. Bugs only appear in production.

✓ GOOD

Same Docker image from dev to prod. Anonymised production data at production scale. Run production traffic replay. "It works in staging" must actually mean something.

3 · Staging

4📡Production Monitoring

Know within minutes when something is wrong in production, before users report it.

✗ BAD

You learn about production errors from user complaints on social media. Alert emails fire on everything and get auto-archived. MTTD is measured in hours.

✓ GOOD

SLO-based alerting: fire when user-facing error rate exceeds X% for Y minutes. MTTD < 5 minutes. Monitor your monitors with a dead-man's-switch health check.

4 · Production Monitoring

Measuring loop speed: DORA metrics

The DORA research program identified four metrics that predict software delivery performance. Every one of them is a feedback loop latency measurement, not an arbitrary KPI.

DORA Metric	What feedback loop it measures	Elite benchmark
Deployment Frequency	How often you close the real-user feedback loop — how often users get your changes	Multiple times per day
Lead Time for Changes	Total pipeline latency: code committed → running in production	< 1 hour
Change Failure Rate	Signal quality of your upstream loops — if high, earlier loops aren't catching enough	< 5%
MTTR	Speed of your incident response feedback loop: detection → resolution	< 1 hour

Elite performers deploy at 10× the frequency of low performers

When feedback loops break

The feedback loop death spiral

4 symptoms your feedback loops are broken

Engineers merge without CI green — The loop runs but has zero trust. Engineers have learned failures are "probably not their fault." This is the death spiral in progress.
"It works on my machine" — The compile-time and CI loops are not catching environment drift. Your staging loop does not approximate production closely enough.
You learn about bugs from user reports — Your production monitoring loop is missing or has latency measured in hours. MTTD > 30 minutes is a red flag for any SLO-driven service.
Refactoring feels risky — Test coverage is too sparse for feedback loops to provide confidence. Engineers avoid changes because they do not trust the signal.