How do 5 nodes agree on anything when networks are unreliable?
How do 5 nodes agree on anything when networks are unreliable?
Lesson outline
In 2012, a network partition at GitHub caused two MySQL nodes to both believe they were the primary. Both accepted writes. When the partition healed, they had diverged — 6 hours to recover from the split-brain.
The Consensus Problem
Getting multiple nodes to agree on a single value, even when some nodes are slow, some fail, and the network drops messages. Solved problems: leader election, distributed locks, cluster membership, replicated state machines (foundation of all distributed databases).
| Algorithm | Used By | Key Property | Limitation |
|---|---|---|---|
| Paxos | Google Chubby, Cassandra Lightweight Transactions | Proven correct, minimal assumptions | Notoriously hard to implement correctly |
| Raft | etcd (Kubernetes), CockroachDB, TiKV, Consul | Designed for understandability | Same safety as Paxos, more opinionated |
| ZAB | Apache Zookeeper, Kafka controller | Strong ordering guarantees | Zookeeper-specific, not general consensus |
CAP theorem: A distributed system can guarantee at most two of three: Consistency, Availability, Partition Tolerance. Network partitions WILL happen — they're not optional. The real choice is CP vs AP.
| System | Choice | During Partition | Use Case |
|---|---|---|---|
| PostgreSQL + sync replication | CP | Rejects writes if replica unreachable | Banking — correctness > availability |
| DynamoDB / Cassandra | AP | Accepts writes, resolves conflicts later | Shopping carts — availability > strict consistency |
| etcd / ZooKeeper | CP | Rejects reads/writes without quorum | Configuration — must be correct |
PACELC: Beyond CAP
During Partitions choose Availability or Consistency; Else choose Latency or Consistency. DynamoDB: low Latency + Eventual Consistency. Spanner: Consistency everywhere via TrueTime hardware clocks. Most systems are tunable on the spectrum.
Raft Leader Election
01
All nodes start as Followers. If no heartbeat from Leader within election timeout (150–300ms), Follower becomes Candidate.
02
Candidate increments term number, votes for itself, requests votes from all other nodes.
03
Node grants vote if: (1) candidate's term >= own term, (2) hasn't voted this term, (3) candidate's log is at least as up-to-date.
04
Candidate with majority votes → becomes Leader. Immediately sends heartbeats.
05
Leader fails → Followers timeout → new election. Takes 150–500ms in production.
All nodes start as Followers. If no heartbeat from Leader within election timeout (150–300ms), Follower becomes Candidate.
Candidate increments term number, votes for itself, requests votes from all other nodes.
Node grants vote if: (1) candidate's term >= own term, (2) hasn't voted this term, (3) candidate's log is at least as up-to-date.
Candidate with majority votes → becomes Leader. Immediately sends heartbeats.
Leader fails → Followers timeout → new election. Takes 150–500ms in production.
Raft Log Replication
01
All writes go to the Leader.
02
Leader appends entry to local log (uncommitted) and sends AppendEntries RPC to all Followers in parallel.
03
Followers append to their logs and send ACK.
04
When majority (quorum: N/2 + 1) ACK, Leader commits and applies to state machine.
05
Leader informs Followers of commit in next heartbeat.
All writes go to the Leader.
Leader appends entry to local log (uncommitted) and sends AppendEntries RPC to all Followers in parallel.
Followers append to their logs and send ACK.
When majority (quorum: N/2 + 1) ACK, Leader commits and applies to state machine.
Leader informs Followers of commit in next heartbeat.
The Quorum Rule
A 5-node cluster tolerates 2 failures (quorum = 3). A 3-node cluster tolerates 1 failure (quorum = 2). ALWAYS run consensus clusters with odd numbers. A 4-node cluster tolerates only 1 failure — same as 3-node but twice the cost. 5 nodes is the production sweet spot.
1// Distributed Lock using etcd (Raft-backed)2// A lock granted by etcd is guaranteed held by at most one client cluster-wide34import { Etcd3 } from 'etcd3';56const etcd = new Etcd3({ hosts: process.env.ETCD_HOSTS?.split(',') });78async function withDistributedLock<T>(9lockName: string,10ttlSeconds: number,11operation: () => Promise<T>12): Promise<T> {TTL-based lease: etcd deletes the lock key if client crashes or disconnects13const lock = etcd.lock(lockName).ttl(ttlSeconds);1415try {16// Block until lock acquired (queued behind other holders)lock.acquire() is blocking — queues behind other lock holders automatically17await lock.acquire();18console.log(`Acquired lock: ${lockName}`);1920return await operation();21} finally {Always release in finally — even if operation throws an exception22// Always release — even if operation throws23await lock.release();24}25}2627// Usage: ensure only one instance processes a batch at a time28await withDistributedLock(29'batch-processor:nightly-report',3030, // 30s TTL — auto-released if process crashes (lease mechanism)31async () => {32await runNightlyReport();33}34);35Fencing tokens prevent zombie locks where a slow process uses an expired lock36// Fencing tokens: handle the "slow process" problem37// Problem: process holds lock, GC pause causes lock to expire,38// another process acquires lock → BOTH think they hold the lock39//40// Solution: etcd returns a monotonically increasing revision on each acquire.41// Include this "fencing token" in writes to the shared resource.42// Resource rejects writes with older tokens — zombie process rejected.
Consistency Models from Strongest to Weakest
Match Consistency to Use Case
Financial balances → Linearizability. Shopping cart → Eventual consistency. User profile after update → Read-your-writes. Feature flags → Linearizability (all servers must see same config).
Consensus and CAP questions are standard at senior/staff levels. Show you can reason about tradeoffs, not just recite definitions.
Common questions:
Strong answers include:
Red flags:
Quick check · Consensus Algorithms: Building Reliable Distributed Systems
1 / 3
Key takeaways
From the books
Designing Data-Intensive Applications — Martin Kleppmann (2017)
Chapter 9: Consistency and Consensus
The best textbook treatment of distributed consensus. Covers linearizability, CAP, Paxos, Raft, and why distributed transactions are hard — all in one chapter.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.