Interactive Explainer

🎯Key Takeaways

Consensus algorithms (Raft, Paxos) enable distributed systems to agree despite failures and partitions.

CAP: P is not optional. The choice is CP (reject requests during partitions) vs AP (serve stale data).

Raft: leader election + log replication + safety. Quorum = N/2 + 1 nodes must agree.

Always run consensus clusters with odd numbers (3, 5, 7). A 5-node cluster tolerates 2 failures.

Fencing tokens prevent zombie locks — include a monotonically increasing token with all shared resource writes.

Match consistency model to use case: financial data → linearizability; shopping carts → eventual consistency.

Consensus Algorithms: Building Reliable Distributed Systems

How do 5 nodes agree on anything when networks are unreliable?

~5 min read

Be the first to complete!

What you'll learn

Consensus algorithms (Raft, Paxos) enable distributed systems to agree despite failures and partitions.
CAP: P is not optional. The choice is CP (reject requests during partitions) vs AP (serve stale data).
Raft: leader election + log replication + safety. Quorum = N/2 + 1 nodes must agree.
Always run consensus clusters with odd numbers (3, 5, 7). A 5-node cluster tolerates 2 failures.
Fencing tokens prevent zombie locks — include a monotonically increasing token with all shared resource writes.
Match consistency model to use case: financial data → linearizability; shopping carts → eventual consistency.

Lesson outline

The Fundamental Problem: Agreement in the Face of Failure

In 2012, a network partition at GitHub caused two MySQL nodes to both believe they were the primary. Both accepted writes. When the partition healed, they had diverged — 6 hours to recover from the split-brain.

The Consensus Problem

Getting multiple nodes to agree on a single value, even when some nodes are slow, some fail, and the network drops messages. Solved problems: leader election, distributed locks, cluster membership, replicated state machines (foundation of all distributed databases).

Algorithm	Used By	Key Property	Limitation
Paxos	Google Chubby, Cassandra Lightweight Transactions	Proven correct, minimal assumptions	Notoriously hard to implement correctly
Raft	etcd (Kubernetes), CockroachDB, TiKV, Consul	Designed for understandability	Same safety as Paxos, more opinionated
ZAB	Apache Zookeeper, Kafka controller	Strong ordering guarantees	Zookeeper-specific, not general consensus

CAP Theorem: The Fundamental Trade-off

CAP theorem: A distributed system can guarantee at most two of three: Consistency, Availability, Partition Tolerance. Network partitions WILL happen — they're not optional. The real choice is CP vs AP.

System	Choice	During Partition	Use Case
PostgreSQL + sync replication	CP	Rejects writes if replica unreachable	Banking — correctness > availability
DynamoDB / Cassandra	AP	Accepts writes, resolves conflicts later	Shopping carts — availability > strict consistency
etcd / ZooKeeper	CP	Rejects reads/writes without quorum	Configuration — must be correct

PACELC: Beyond CAP

During Partitions choose Availability or Consistency; Else choose Latency or Consistency. DynamoDB: low Latency + Eventual Consistency. Spanner: Consistency everywhere via TrueTime hardware clocks. Most systems are tunable on the spectrum.

Raft: Consensus for the Rest of Us

Raft Leader Election

→

All nodes start as Followers. If no heartbeat from Leader within election timeout (150–300ms), Follower becomes Candidate.

→

Candidate increments term number, votes for itself, requests votes from all other nodes.

→

Node grants vote if: (1) candidate's term >= own term, (2) hasn't voted this term, (3) candidate's log is at least as up-to-date.

→

Candidate with majority votes → becomes Leader. Immediately sends heartbeats.

Leader fails → Followers timeout → new election. Takes 150–500ms in production.

All nodes start as Followers. If no heartbeat from Leader within election timeout (150–300ms), Follower becomes Candidate.

Candidate increments term number, votes for itself, requests votes from all other nodes.

Node grants vote if: (1) candidate's term >= own term, (2) hasn't voted this term, (3) candidate's log is at least as up-to-date.

Candidate with majority votes → becomes Leader. Immediately sends heartbeats.

Leader fails → Followers timeout → new election. Takes 150–500ms in production.

Raft Log Replication

→

All writes go to the Leader.

→

Leader appends entry to local log (uncommitted) and sends AppendEntries RPC to all Followers in parallel.

→

Followers append to their logs and send ACK.

→

When majority (quorum: N/2 + 1) ACK, Leader commits and applies to state machine.

Leader informs Followers of commit in next heartbeat.

All writes go to the Leader.

Leader appends entry to local log (uncommitted) and sends AppendEntries RPC to all Followers in parallel.

Followers append to their logs and send ACK.

When majority (quorum: N/2 + 1) ACK, Leader commits and applies to state machine.

Leader informs Followers of commit in next heartbeat.

The Quorum Rule

A 5-node cluster tolerates 2 failures (quorum = 3). A 3-node cluster tolerates 1 failure (quorum = 2). ALWAYS run consensus clusters with odd numbers. A 4-node cluster tolerates only 1 failure — same as 3-node but twice the cost. 5 nodes is the production sweet spot.

distributed-lock.ts

1// Distributed Lock using etcd (Raft-backed)
2// A lock granted by etcd is guaranteed held by at most one client cluster-wide
3 
4import { Etcd3 } from 'etcd3';
5 
6const etcd = new Etcd3({ hosts: process.env.ETCD_HOSTS?.split(',') });
7 
8async function withDistributedLock<T>(
9  lockName: string,
10  ttlSeconds: number,
11  operation: () => Promise<T>
12): Promise<T> {
TTL-based lease: etcd deletes the lock key if client crashes or disconnects
13  const lock = etcd.lock(lockName).ttl(ttlSeconds);
14 
15  try {
16    // Block until lock acquired (queued behind other holders)
lock.acquire() is blocking — queues behind other lock holders automatically
17    await lock.acquire();
18    console.log(`Acquired lock: ${lockName}`);
19 
20    return await operation();
21  } finally {
Always release in finally — even if operation throws an exception
22    // Always release — even if operation throws
23    await lock.release();
24  }
25}
26 
27// Usage: ensure only one instance processes a batch at a time
28await withDistributedLock(
29  'batch-processor:nightly-report',
30  30,  // 30s TTL — auto-released if process crashes (lease mechanism)
31  async () => {
32    await runNightlyReport();
33  }
34);
35 
Fencing tokens prevent zombie locks where a slow process uses an expired lock
36// Fencing tokens: handle the "slow process" problem
37// Problem: process holds lock, GC pause causes lock to expire,
38//          another process acquires lock → BOTH think they hold the lock
39//
40// Solution: etcd returns a monotonically increasing revision on each acquire.
41// Include this "fencing token" in writes to the shared resource.
42// Resource rejects writes with older tokens — zombie process rejected.

Consistency Models: Beyond CAP

Consistency Models from Strongest to Weakest

🔒Linearizability — Every read returns the most recent write. Operations appear atomic and globally ordered. Required for distributed locks, leader election. Used by etcd, Spanner.
🔑Causal Consistency — Causally related operations appear in order. If A happened before B, everyone sees A before B. DynamoDB consistent reads approximates this.
🗝️Eventual Consistency — Given no new updates, all replicas converge. No timing guarantee. DynamoDB default reads, Cassandra default.
📖Read-Your-Writes — A user always sees their own writes, even if others see stale data. Essential for good UX. Route user's reads to the replica they wrote to.

Match Consistency to Use Case

Financial balances → Linearizability. Shopping cart → Eventual consistency. User profile after update → Read-your-writes. Feature flags → Linearizability (all servers must see same config).

How this might come up in interviews

Consensus and CAP questions are standard at senior/staff levels. Show you can reason about tradeoffs, not just recite definitions.

Common questions:

Explain CAP theorem with a concrete example
How does Raft leader election work?
When would you choose eventual consistency over strong consistency?
How do you implement a distributed lock and what are its failure modes?

Strong answers include:

Says "P is not optional" when discussing CAP
Knows PACELC or discusses beyond simple CAP
Mentions fencing tokens unprompted
Explains when linearizability is required vs eventual consistency is fine

Red flags:

Thinks CAP means you can choose all three
"Just use eventual consistency everywhere"
Doesn't know what a quorum is
Never heard of fencing tokens

Quick check · Consensus Algorithms: Building Reliable Distributed Systems

1 / 3

In a 5-node Raft cluster, how many nodes can fail before the cluster loses consensus?

Key takeaways

Consensus algorithms (Raft, Paxos) enable distributed systems to agree despite failures and partitions.
CAP: P is not optional. The choice is CP (reject requests during partitions) vs AP (serve stale data).
Raft: leader election + log replication + safety. Quorum = N/2 + 1 nodes must agree.
Always run consensus clusters with odd numbers (3, 5, 7). A 5-node cluster tolerates 2 failures.
Fencing tokens prevent zombie locks — include a monotonically increasing token with all shared resource writes.
Match consistency model to use case: financial data → linearizability; shopping carts → eventual consistency.

From the books

Designing Data-Intensive Applications — Martin Kleppmann (2017)

Chapter 9: Consistency and Consensus

The best textbook treatment of distributed consensus. Covers linearizability, CAP, Paxos, Raft, and why distributed transactions are hard — all in one chapter.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

The Fundamental Problem: Agreement in the Face of Failure

The Consensus Problem

Algorithm	Used By	Key Property	Limitation
Paxos	Google Chubby, Cassandra Lightweight Transactions	Proven correct, minimal assumptions	Notoriously hard to implement correctly
Raft	etcd (Kubernetes), CockroachDB, TiKV, Consul	Designed for understandability	Same safety as Paxos, more opinionated
ZAB	Apache Zookeeper, Kafka controller	Strong ordering guarantees	Zookeeper-specific, not general consensus

CAP Theorem: The Fundamental Trade-off

System	Choice	During Partition	Use Case
PostgreSQL + sync replication	CP	Rejects writes if replica unreachable	Banking — correctness > availability
DynamoDB / Cassandra	AP	Accepts writes, resolves conflicts later	Shopping carts — availability > strict consistency
etcd / ZooKeeper	CP	Rejects reads/writes without quorum	Configuration — must be correct

PACELC: Beyond CAP

Raft: Consensus for the Rest of Us

Raft Leader Election

→

All nodes start as Followers. If no heartbeat from Leader within election timeout (150–300ms), Follower becomes Candidate.

→

Candidate increments term number, votes for itself, requests votes from all other nodes.

→

Node grants vote if: (1) candidate's term >= own term, (2) hasn't voted this term, (3) candidate's log is at least as up-to-date.

→

Candidate with majority votes → becomes Leader. Immediately sends heartbeats.

Leader fails → Followers timeout → new election. Takes 150–500ms in production.

All nodes start as Followers. If no heartbeat from Leader within election timeout (150–300ms), Follower becomes Candidate.

Candidate increments term number, votes for itself, requests votes from all other nodes.

Node grants vote if: (1) candidate's term >= own term, (2) hasn't voted this term, (3) candidate's log is at least as up-to-date.

Candidate with majority votes → becomes Leader. Immediately sends heartbeats.

Leader fails → Followers timeout → new election. Takes 150–500ms in production.

Raft Log Replication

→

All writes go to the Leader.

→

Leader appends entry to local log (uncommitted) and sends AppendEntries RPC to all Followers in parallel.

→

Followers append to their logs and send ACK.

→

When majority (quorum: N/2 + 1) ACK, Leader commits and applies to state machine.

Leader informs Followers of commit in next heartbeat.

All writes go to the Leader.

Leader appends entry to local log (uncommitted) and sends AppendEntries RPC to all Followers in parallel.

Followers append to their logs and send ACK.

When majority (quorum: N/2 + 1) ACK, Leader commits and applies to state machine.

Leader informs Followers of commit in next heartbeat.

The Quorum Rule

distributed-lock.ts

1// Distributed Lock using etcd (Raft-backed)
2// A lock granted by etcd is guaranteed held by at most one client cluster-wide
3 
4import { Etcd3 } from 'etcd3';
5 
6const etcd = new Etcd3({ hosts: process.env.ETCD_HOSTS?.split(',') });
7 
8async function withDistributedLock<T>(
9  lockName: string,
10  ttlSeconds: number,
11  operation: () => Promise<T>
12): Promise<T> {
TTL-based lease: etcd deletes the lock key if client crashes or disconnects
13  const lock = etcd.lock(lockName).ttl(ttlSeconds);
14 
15  try {
16    // Block until lock acquired (queued behind other holders)
lock.acquire() is blocking — queues behind other lock holders automatically
17    await lock.acquire();
18    console.log(`Acquired lock: ${lockName}`);
19 
20    return await operation();
21  } finally {
Always release in finally — even if operation throws an exception
22    // Always release — even if operation throws
23    await lock.release();
24  }
25}
26 
27// Usage: ensure only one instance processes a batch at a time
28await withDistributedLock(
29  'batch-processor:nightly-report',
30  30,  // 30s TTL — auto-released if process crashes (lease mechanism)
31  async () => {
32    await runNightlyReport();
33  }
34);
35 
Fencing tokens prevent zombie locks where a slow process uses an expired lock
36// Fencing tokens: handle the "slow process" problem
37// Problem: process holds lock, GC pause causes lock to expire,
38//          another process acquires lock → BOTH think they hold the lock
39//
40// Solution: etcd returns a monotonically increasing revision on each acquire.
41// Include this "fencing token" in writes to the shared resource.
42// Resource rejects writes with older tokens — zombie process rejected.

Consistency Models: Beyond CAP

Consistency Models from Strongest to Weakest

🔒Linearizability — Every read returns the most recent write. Operations appear atomic and globally ordered. Required for distributed locks, leader election. Used by etcd, Spanner.
🔑Causal Consistency — Causally related operations appear in order. If A happened before B, everyone sees A before B. DynamoDB consistent reads approximates this.
🗝️Eventual Consistency — Given no new updates, all replicas converge. No timing guarantee. DynamoDB default reads, Cassandra default.
📖Read-Your-Writes — A user always sees their own writes, even if others see stale data. Essential for good UX. Route user's reads to the replica they wrote to.

Match Consistency to Use Case

Financial balances → Linearizability. Shopping cart → Eventual consistency. User profile after update → Read-your-writes. Feature flags → Linearizability (all servers must see same config).