Consensus Algorithms in Distributed Systems: Paxos, Raft, and Beyond
Consensus algorithms are the formal mechanisms by which nodes in a distributed system agree on a single value or sequence of values despite failures, message delays, and network partitions. Paxos, Raft, Zab, and Byzantine fault-tolerant variants each impose different guarantees, performance envelopes, and operational complexity profiles. This page covers the definitional scope, mechanical structure, classification boundaries, and known tradeoffs of the major consensus algorithm families — serving as a reference for engineers, architects, and researchers working with distributed systems infrastructure in production environments.
- Definition and scope
- Core mechanics or structure
- Causal relationships or drivers
- Classification boundaries
- Tradeoffs and tensions
- Common misconceptions
- Checklist or steps (non-advisory)
- Reference table or matrix
- References
Definition and scope
Consensus in distributed computing is the problem of getting a fixed set of processes to agree on a single value when any subset of those processes may crash or become unreachable. The formal definition traces to the Fischer, Lynch, and Paterson result (FLP impossibility, 1985), published in the Journal of the ACM, which proved that no deterministic asynchronous system can guarantee consensus in the presence of even one faulty process. That theoretical boundary establishes the design space within which all practical consensus protocols operate — they relax either the asynchrony assumption (by adding timing bounds) or the liveness guarantee (by allowing the protocol to stall under adversarial conditions) rather than violating FLP.
The scope of consensus algorithms extends across distributed data storage, leader election, distributed transactions, and replication strategies. Any system that must maintain a consistent, replicated log — a database write-ahead log, a configuration store, a blockchain ledger — requires a consensus layer. Systems such as Apache ZooKeeper (which uses Zab), etcd (which uses Raft), and Google Chubby (which uses Paxos) are production implementations of this class of algorithm.
The scope excludes eventual consistency mechanisms, which do not require agreement on a total order of operations. Eventual consistency and CRDTs represent a separate design space where conflicts are resolved after the fact rather than prevented by protocol.
Core mechanics or structure
Paxos (Leslie Lamport, 1998, ACM TOCS) operates in two phases across three logical roles: Proposers, Acceptors, and Learners. In Phase 1 (Prepare/Promise), a Proposer sends a Prepare(n) message with a proposal number n to a majority quorum of Acceptors; each Acceptor that has not already promised a higher number replies with a Promise and any previously accepted value. In Phase 2 (Accept/Accepted), the Proposer sends an Accept(n, v) message where v is the highest-numbered previously accepted value returned in any Promise, or the Proposer's own value if none existed. A value is chosen when a majority of Acceptors have accepted it. Multi-Paxos extends single-decree Paxos to a replicated log by electing a stable leader who skips Phase 1 for subsequent slots.
Raft (Ongaro & Ousterhout, 2014, USENIX ATC) decomposes consensus into three sub-problems: leader election, log replication, and safety. A Raft cluster maintains a single leader at any term. The leader appends entries to its log and replicates them to followers; an entry is committed once stored on a majority of nodes. Leader elections use randomized timeouts (default range: 150–300 ms in the reference implementation) to reduce split-vote probability. Raft's explicit state machine and restricted log matching invariant make correctness verification more tractable than Paxos.
Zab (ZooKeeper Atomic Broadcast, Apache Foundation) is specifically designed for primary-backup systems. It guarantees that all updates from a primary are delivered to backups in the order they were issued, and that a new primary replays all unacknowledged updates before accepting new requests. Zab distinguishes between crash-recovery and broadcast phases, making leader failover explicit in the protocol.
Byzantine Fault Tolerant (BFT) consensus — exemplified by PBFT (Castro & Liskov, 1999, OSDI) — requires at least 3f + 1 nodes to tolerate f Byzantine (arbitrarily malicious) failures, compared to 2f + 1 for crash-fault-tolerant (CFT) protocols. This 50% overhead in replica count reflects the cost of handling equivocation and message forgery, as documented in the original PBFT paper. Tendermint and HotStuff are later BFT variants optimized for blockchain-as-distributed-system environments.
Causal relationships or drivers
The need for formal consensus protocols emerges from three structural properties of distributed systems. First, network partitions can isolate subsets of nodes, creating split-brain scenarios where two independent groups each believe they are the authoritative cluster. Without a quorum requirement — typically a strict majority of ⌊N/2⌋ + 1 nodes — both partitions could accept conflicting writes, destroying linearizability as defined by Herlihy (1990, ACM TOPLAS).
Second, fault tolerance and resilience requirements drive protocol selection. A cluster of 5 nodes tolerates 2 crash failures under CFT consensus; adding a 6th node provides no additional fault tolerance (still tolerates 2 failures), which is why deployments commonly use odd cluster sizes (3, 5, or 7 nodes). The relationship between cluster size, failure tolerance, and quorum latency directly shapes deployment topology decisions.
Third, the CAP theorem formally constrains consensus algorithm behavior during partitions. CP systems (Consistency + Partition Tolerance) — which includes most consensus-based stores — sacrifice availability during partitions by refusing to service requests without a quorum. This is not a bug in the protocol but an intentional consequence of the safety guarantee.
Message complexity is a fourth driver. Paxos requires O(n) messages per consensus round in the single-leader case; leaderless variants like Flexible Paxos (Howard et al., 2016) allow quorum size to vary, enabling lower-latency reads at the cost of higher-latency writes.
Classification boundaries
Consensus algorithms divide along four primary axes:
Fault model: Crash fault tolerant (CFT) vs. Byzantine fault tolerant (BFT). CFT protocols (Paxos, Raft, Zab) assume nodes fail by stopping. BFT protocols (PBFT, Tendermint, HotStuff) assume nodes may send arbitrary incorrect messages.
Leader structure: Leader-based (Raft, Multi-Paxos, Zab) vs. leaderless (Classic Paxos, EPaxos). Leader-based protocols serialize decisions through one node, reducing coordination messages but creating a single point of throughput constraint. Leader election failures in leader-based systems cause unavailability windows.
Synchrony assumption: Partially synchronous (Raft, PBFT) vs. asynchronous with randomization (Ben-Or, Bracha). Partially synchronous protocols assume that after some unknown global stabilization time (GST), message delays are bounded — a realistic model for LAN-scale clusters. Fully asynchronous BFT protocols sacrifice liveness guarantees except in probabilistic terms.
Log vs. single-value: Single-decree protocols decide one value; multi-decree or replicated log protocols (Multi-Paxos, Raft) maintain an ordered sequence of decisions. The latter maps directly to state machine replication as described in Lamport's 1978 Communications of the ACM paper on logical clocks.
The boundary between consensus and distributed system clocks is meaningful: consensus algorithms use logical terms or epochs to order leader transitions, not wall-clock time, because physical clocks are unreliable across nodes.
Tradeoffs and tensions
Safety vs. liveness under partition: CFT consensus protocols guarantee safety (no two nodes commit conflicting values) unconditionally, but liveness (progress is eventually made) only under partial synchrony. During a sustained partition where no majority is reachable, a correct Raft or Paxos cluster stalls indefinitely rather than risk a split decision. This is the defining tension for any system that must choose between stalling and serving potentially stale data — a choice that intersects consistency models directly.
Read latency vs. consistency: Serving reads from the leader guarantees linearizability but concentrates load. Read-from-follower optimization (used in some Raft implementations) introduces the risk of stale reads unless the follower's log position is verified against the leader's commit index, requiring an additional round-trip. The tradeoff between latency and throughput is quantified differently for read-heavy vs. write-heavy workloads.
Operational complexity: Paxos is notoriously difficult to implement correctly. The original paper describes single-decree Paxos; the jump to Multi-Paxos requires solving leader election, log compaction, and cluster reconfiguration — none of which are specified in the original paper. Raft was explicitly designed to address this gap. Ongaro's 2014 dissertation documents a user study in which Raft was rated significantly easier to understand by graduate students with distributed systems backgrounds, though formal complexity equivalence between Raft and Multi-Paxos has been established in subsequent literature.
Reconfiguration: Adding or removing nodes from a live consensus cluster is a consensus problem itself. Raft uses joint consensus (a two-phase membership change) to safely transition between configurations without risk of having two independent majorities. Incorrectly implemented reconfiguration is one of the top sources of data loss in production consensus deployments, as noted in Ongaro's dissertation.
BFT overhead: The 3f + 1 replica requirement and O(n²) message complexity of PBFT make it impractical for clusters larger than approximately 20 nodes under normal network conditions. This limits BFT consensus to peer-to-peer systems, permissioned blockchains, and high-security coordination services where Byzantine threat models are justified.
Common misconceptions
Misconception: Paxos and Raft are equivalent in practice. Correction: While theoretically equivalent in the values they can commit, the two protocols differ significantly in reconfiguration mechanics, log compaction approaches, and leader lease handling. Production systems built on each exhibit different failure mode profiles. Zookeeper and coordination services built on Zab further illustrate that "consensus" is not a single implementation category.
Misconception: A consensus cluster is always available to reads. Correction: A majority-quorum system with 2f + 1 nodes requires at least f + 1 nodes to respond. A 3-node cluster tolerates exactly 1 failure; losing 2 nodes makes the cluster unavailable for both reads and writes under strong consistency. Systems that advertise high availability without disclosing the quorum requirement are obscuring this constraint.
Misconception: Byzantine fault tolerance is required for any adversarial environment. Correction: BFT is required specifically when nodes themselves may be compromised and send malicious messages. Most cloud-native deployments operate in environments where nodes fail by crashing or becoming unreachable, not by sending forged protocol messages. CFT consensus is appropriate for the overwhelming majority of enterprise distributed systems; BFT carries substantial cost penalties documented in the PBFT paper.
Misconception: Consensus algorithms solve the two-phase commit problem. Correction: 2PC and consensus address different scopes. 2PC coordinates a transaction across heterogeneous participants (e.g., two independent databases) and is not fault-tolerant — a coordinator failure blocks the protocol. Consensus algorithms provide fault-tolerant agreement within a homogeneous replicated system. Distributed transactions may layer 2PC on top of consensus (as in Spanner's architecture) but the two are not interchangeable.
Misconception: Adding more nodes always improves fault tolerance. Correction: Fault tolerance improves only when moving between quorum thresholds. A 4-node Raft cluster tolerates 1 failure (majority = 3), identical to a 3-node cluster. Adding a 5th node enables tolerating 2 failures. This step-function relationship means even-numbered cluster sizes typically provide no fault-tolerance benefit over the next lower odd number.
Checklist or steps (non-advisory)
The following sequence describes the operational phases of a Raft consensus round as specified in Ongaro & Ousterhout (2014):
- Leader election initiation — A follower's election timeout (randomized between 150–300 ms in the reference implementation) expires without receiving a heartbeat; the follower increments its current term and transitions to candidate state.
- Vote solicitation — The candidate sends
RequestVoteRPCs to all other nodes, including its current term and the index and term of its last log entry. - Vote granting — Each node grants a vote if it has not voted in the current term and the candidate's log is at least as up-to-date as its own (log completeness check).
- Leader transition — Upon receiving votes from a majority (
⌊N/2⌋ + 1) of nodes, the candidate transitions to leader and begins sending heartbeatAppendEntriesRPCs to suppress new elections. - Log entry submission — The leader appends the client's command to its local log and broadcasts
AppendEntriesRPCs to all followers. - Replication acknowledgment — Followers append the entry to their logs and respond with success; the leader tracks acknowledgment counts.
- Commit — Once a majority of nodes have stored the entry, the leader advances its
commitIndexand applies the entry to the state machine. - Commit notification — The leader's next
AppendEntries(or heartbeat) carries the updatedcommitIndex, causing followers to apply the committed entry to their own state machines. - Client response — The leader returns the result of the state machine application to the client.
- Log compaction — Periodically, each node creates a snapshot of its state machine at a known log index and discards log entries preceding that index, as described in §7 of the Raft paper.
Reference table or matrix
| Algorithm | Fault Model | Minimum Nodes for f=1 | Message Complexity | Leader Structure | Primary Production Use |
|---|---|---|---|---|---|
| Classic Paxos | Crash (CFT) | 3 | O(n) per round | Leaderless | Foundational; rarely used directly |
| Multi-Paxos | Crash (CFT) | 3 | O(n) with stable leader | Leader-based | Google Chubby, some Spanner layers |
| Raft | Crash (CFT) | 3 | O(n) per round | Leader-based | etcd, CockroachDB, TiKV |
| Zab | Crash (CFT) | 3 | O(n) per round | Primary-backup | Apache ZooKeeper |
| PBFT | Byzantine (BFT) | 4 (3f+1) | O(n²) per round | Leader-based | Permissioned blockchains, research |
| Tendermint | Byzantine (BFT) | 4 (3f+1) | O(n²) per round | Rotating leader | Cosmos blockchain ecosystem |
| HotStuff | Byzantine (BFT) | 4 (3f+1) | O(n) per round (linear) | Leader-based | LibraBFT / Diem, Aptos |
| EPaxos | Crash (CFT) | 3 | O(n) fast path | Leaderless | Research; low-latency geo-distributed |
| Flexible Paxos | Crash (CFT) | Configurable | O(n) | Configurable | Research; asymmetric quorum systems |
Notes on table interpretation: Message complexity figures represent the best-case single-leader steady state. PBFT's O(n²) reflects the all-to-all prepare and commit phases required to tolerate Byzantine equivocation. HotStuff achieves O(n) BFT message complexity through a chained three-phase protocol, as documented in the 2019 ACM PODC paper by Abraham et al. Cluster sizes verified are minimums for f = 1; tolerating f = 2 requires 5 nodes (CFT) or