Interactive Explainer

🎯Key Takeaways

The eight fallacies: the network is NOT reliable, zero latency, infinite bandwidth, or secure. Design for failure from the start.

CAP theorem: during a network partition, choose CP (consistent, may return error) or AP (available, may return stale). CA is impossible in distributed systems.

Three replication strategies: single-leader (simple, good for reads), multi-leader (complex, good for multi-region writes), leaderless (high availability, tunable consistency).

Circuit breakers prevent cascade failures by fast-failing calls to struggling services and allowing time to recover.

Always set timeouts on every network call. No timeout = threads blocked forever = cascade failure.

Distributed System Patterns

The eight fallacies of distributed computing, CAP theorem in practice, replication, partitioning, and circuit breaker patterns.

~6 min read

Be the first to complete!

What you'll learn

The eight fallacies: the network is NOT reliable, zero latency, infinite bandwidth, or secure. Design for failure from the start.
CAP theorem: during a network partition, choose CP (consistent, may return error) or AP (available, may return stale). CA is impossible in distributed systems.
Three replication strategies: single-leader (simple, good for reads), multi-leader (complex, good for multi-region writes), leaderless (high availability, tunable consistency).
Circuit breakers prevent cascade failures by fast-failing calls to struggling services and allowing time to recover.
Always set timeouts on every network call. No timeout = threads blocked forever = cascade failure.

Lesson outline

The eight lies everyone believes about distributed systems

In 1994, Peter Deutsch at Sun Microsystems wrote down "The Eight Fallacies of Distributed Computing" — incorrect assumptions that developers make when moving from single-machine to distributed systems. Thirty years later, they still cause production outages every day.

The eight fallacies — and what actually happens

1. The network is reliable — Reality: packets get dropped, connections time out, routers fail. Every network call can fail. Design for it with retries, timeouts, and circuit breakers.
2. Latency is zero — Reality: cross-AZ = ~1ms, cross-region = 30–150ms. N+1 queries that take 1ms locally take 150ms per call cross-region. Batch calls and cache aggressively.
3. Bandwidth is infinite — Reality: data transfer costs money and has throughput limits. Sending 1MB per API response at 10k RPS = 10GB/s bandwidth. Paginate, compress, and cache.
4. The network is secure — Reality: traffic can be intercepted, DNS can be spoofed. Encrypt everything in transit with TLS. Authenticate every service-to-service call.
5. Topology does not change — Reality: servers are added/removed constantly (autoscaling), IPs change, services move. Use service discovery (DNS, Consul, Kubernetes services), not hardcoded IPs.
6. There is one administrator — Reality: distributed systems span teams, vendors, and services. Any component can be changed independently. Plan for interface contracts and backward compatibility.
7. Transport cost is zero — Reality: serialization/deserialization, network round-trips, and protocol overhead all add latency. gRPC + Protobuf is 5–10× faster than REST + JSON for internal services.
8. The network is homogeneous — Reality: different services use different protocols, encodings, and retry policies. Use an API gateway or service mesh to normalize communication patterns.

CAP theorem: the fundamental trade-off

In any distributed system with a network partition (two parts of the system cannot communicate), you must choose between Consistency and Availability. You cannot have both.

Choice	Behavior on partition	Real-world example	When to choose
CP (Consistent + Partition tolerant)	Return error/timeout rather than stale data	HBase, Zookeeper, etcd, bank account balances	Financial transactions, leader election, inventory counts — where stale = wrong
AP (Available + Partition tolerant)	Return possibly stale data rather than error	DynamoDB, Cassandra, CouchDB, DNS, social feeds	User experience matters more than perfect consistency — shopping carts, social feeds, product catalog
CA (Consistent + Available)	Only possible without partitions — single-node systems	PostgreSQL on one server (not distributed)	Impossible in distributed systems — CAP theorem guarantees partitions happen

The practical implication: "eventual consistency"

Most distributed databases choose AP and offer eventual consistency — all replicas will converge to the same value eventually, but reads may return stale data briefly. This is acceptable for most use cases. The window of inconsistency is typically milliseconds to seconds, not hours.

Quick check

Your e-commerce site shows product inventory counts. During a network partition, which is worse: showing a customer "5 left" when there are actually 0, or showing an error page?

Replication patterns

The three replication strategies and when to use each

Single-leader (master-replica) — One node accepts all writes and replicates to read replicas. Simple, well-understood. Read scale: excellent (route reads to replicas). Write scale: limited by leader throughput. Used by: PostgreSQL streaming replication, MySQL binary log replication, DynamoDB global tables (within a region). Good for: read-heavy workloads, strong consistency for writes.
Multi-leader (active-active) — Multiple nodes accept writes. Conflict resolution required. Complex. Used for: multi-region active-active (each region writes locally, syncs globally). Good for: low-write-latency globally. Hard problem: what happens when two users update the same row in different regions simultaneously? Requires conflict resolution (last-write-wins, CRDTs, or application-level merge).
Leaderless (Dynamo-style) — Any node accepts writes. Quorum-based reads and writes (write to W of N nodes, read from R of N nodes, where W + R > N guarantees overlap). Used by: Cassandra, Amazon DynamoDB, Riak. Good for: high availability, no single point of failure, geographic distribution. Tunable consistency: lower W and R = higher availability but more staleness.

Circuit breaker and retry patterns

When a downstream service is slow or failing, naive retry logic makes it worse. If 1,000 clients are each retrying every second, the struggling service receives 1,000× normal load — a retry storm that prevents recovery.

Patterns for resilient distributed communication

Circuit breaker — Three states: Closed (requests pass through, track error rate), Open (fast-fail all requests without calling downstream — gives it time to recover), Half-Open (let one probe request through, if it succeeds, close the circuit). Libraries: Resilience4j (Java), polly (.NET), opossum (Node.js), Hystrix (Java, Netflix).
Exponential backoff with jitter — First retry after 1s, then 2s, 4s, 8s... with random jitter (±50%) to prevent thundering herds. Never retry with fixed intervals — 1,000 clients all retrying at t=1s creates a synchronized spike.
Bulkhead isolation — Separate thread pools (or connection pools) per downstream service. If Service A becomes slow and exhausts its connection pool, it does not affect traffic to Service B. Named after watertight compartments in ships — one flooded compartment does not sink the ship.
Timeout (always set one) — Every network call must have an explicit timeout. "No timeout" means a slow downstream can hold your thread/connection forever. Rule: timeout < downstream SLA. If the payment service SLA is 2s, your timeout should be 2.5s (with buffer).

The cascade failure pattern: how one slow service kills everything

Service A calls Service B (no timeout). Service B is slow. Service A threads block waiting. Service A runs out of threads. Service A starts failing. Service C calls Service A and starts failing. In 60 seconds, one slow service has cascaded to a full outage. Circuit breakers + timeouts break this chain.

How this might come up in interviews

Senior/staff engineering interviews and distributed systems design questions. The CAP theorem question is almost universal in backend system design interviews.

Common questions:

What is the CAP theorem?
What is a circuit breaker pattern and why does it matter?
What is the difference between eventual consistency and strong consistency?
How would you design a system to be resilient to network partitions?
What are the eight fallacies of distributed computing?

Key takeaways

The eight fallacies: the network is NOT reliable, zero latency, infinite bandwidth, or secure. Design for failure from the start.
CAP theorem: during a network partition, choose CP (consistent, may return error) or AP (available, may return stale). CA is impossible in distributed systems.
Three replication strategies: single-leader (simple, good for reads), multi-leader (complex, good for multi-region writes), leaderless (high availability, tunable consistency).
Circuit breakers prevent cascade failures by fast-failing calls to struggling services and allowing time to recover.
Always set timeouts on every network call. No timeout = threads blocked forever = cascade failure.

Before you move on: can you answer these?

What are the three states of a circuit breaker?

Closed (requests pass through, errors tracked), Open (all requests fast-fail to allow downstream recovery), Half-Open (one probe request allowed through; if successful, transitions back to Closed).

Why does retry with a fixed interval cause a "thundering herd"?

All clients fail at the same time and all retry at t+1s simultaneously, creating a synchronized spike that can overwhelm the recovering service. Exponential backoff with random jitter staggers retries to prevent this.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A