Understanding the Paxos Consensus Algorithm for Distributed Systems

The Problem That Paxos Addresses

The Paxos algorithm often appears cryptic and difficult to grasp. While numerous papers and articles exist, many feel abstract and unclear about its core principles and practical applications.

This complexity stems from two factors: most Paxos materials focus on formal proofs of algorithm correctness, making them inherently dense, and there are limited mature engineering implementations based on Paxos. This discussion aims to approach the topic progressively, starting from problems and gradually exploring Paxos concepts.

A Fundamental Concurrency Challenge

Consider a basic concurrency issue illustrated with a key-value storage cluster. Three clients concurrently send requests to a three-node cluster. When retrieving value X, what should its final state be?

The answer is: X could equal 1, 3, or 5 - all are valid! However, X=4 would be incorrect. Since clients operate concurrently, request arrival order at servers remains uncertain, allowing multiple valid outcomes.

A crucial insight: if the final cluster state shows X=1, then Client1's request (X=1) receives confirmation, while Client2's (X=3) and Client3's (X=5) requests effectively get rejected. Similar reasoning applies when X equals 3 or 5. This behavior represents a fundamental characteristic of the Paxos protocol.

Understanding Sequential Ordering

Expanding the scenario: consider a three-machine KV cluster where nodes communicate by propagating values. Each client sends requests to different machines. Each machine logs received requests (both client and inter-node communications). After processing completes, what sequence should each machine maintain?

The conclusion: regardless of ordering, if all three machines maintain identical log sequences, the result is correct. With three items, total permutations (6 cases) all remain valid. For instance, if all machines store X=1, X=3, X=5 in that order, the final cluster value becomes 5. Other scenarios follow similarly.

Incorrect scenarios occur when: Machine1 maintains sequence 1,3,5 (resulting in X=5), Machine2 has 3,5,1 (resulting in X=1), and Machine3 holds 1,5,3 (resulting in X=3). The machines disagree on X's value.

This example illustrates sequential ordering: while clients operate concurrently without precedence, server clusters must ensure identical log ordering across all machines - achieving distributed consistency.

What Paxos Resolves

In the example, Node1 receives X=1 then replicates to Node2 and Node3; Node2 receives X=3 then replicates to Node1 and Node3; Node3 receives X=5 then replicates to Node1 and Node2.

With concurrent clients and concurrent replication between nodes, how can we guarantee identical log sequences across all nodes? This challenge represents the core problem that Paxos addresses.

Replicated State Machines

Previous discussions addressed log replication issues - each node stores log sequences with guaranteed consistency between nodes. Some might ask: why store logs instead of final data directly?

Consider variables or objects as state machines. Each write request represents an event causing state machine transitions, which constitutes the log.

Using the simple variable X example: with one node and three client commands modifying X, the final state becomes 6.

Extending variable X to MySQL databases, clients send various DML operations that become persistent Binlog entries. These logs then generate database tables (state machines).

An important concept emerges: persist the "event stream" (log stream) rather than the "data itself" (state machine). Several reasons support this approach:

Logs involve only append operations, while data supports add, delete, update operations. Converting three operations into one simplifies persistence.
Multi-machine synchronization becomes easier with logs versus complex data structures (relational database joins, trees, graphs). Linear log sequences enable straightforward comparison.

Whether considering persistence or data synchronization, storing state machine input event streams (logs) proves simpler than storing state machines themselves.

Based on this approach, state machines extend to replicated state machines. The principle states: identical initial states + identical input events = identical final states. Therefore, ensuring identical log streams across multiple nodes guarantees consistent states. Even if a node fails, restarting and replaying the log stream restores previous states.

An Intuitive Conceptual Framework

Paxos initially emerged through Basic Paxos formal proof, followed by Multi Paxos, then application scenarios. Since applications weren't initially presented, direct examination of Basic Paxos proofs proved challenging. This discussion reverses the approach, starting with applications before deriving Paxos and Multi Paxos.

When three clients concurrently send requests, six potential results prove valid. An algorithm must ensure that despite concurrent client requests, all nodes record identical log ordering.

Here's an intuitive yet profound concept: global understanding of number sequences 1,2,3,4,5... remains consistent! All people and machines share this understanding.

When referencing 2, everyone knows it follows 1 and precedes 3! Two represents a position definitely between (1,3).

Applying this idea to computer systems for multi-node log replication: when Node1 receives X=1 and stores it in position 1, first verify if other machines already occupy position 1 with X=3 or X=5. If occupied, check position 2, and so forth. If unoccupied, store X=1 in position 1 and inform other nodes to store X=1 in their respective position 1. Similarly, Node2 and Node3 execute accordingly.

The key concept: while each node receives requests in different orders, they maintain consistent understanding of log positions 1, 2, 3, etc., collectively ensuring identical data storage at each position!

Each node queries other nodes before storing logs, then decides placement. This creates two phases: inquiry followed by decision, representing the prototype of Paxos two-phase commit.

Further decomposing the problem to single log replication instead of three logs: determining the first log for three nodes reveals challenges.

Node1 discovers position 1 is unoccupied and plans to propagate X=1 to Node2 and Node3. Simultaneously, Node2 finds position 1 unoccupied and plans to propagate X=3 to Node1 and Node3. Similarly, Node3 plans to propagate X=5 to Node1 and Node2.

This creates conflicts - even determining the first position proves problematic!

Basic Paxos specifically addresses this challenge.

First, position 1 either gets claimed by Node1 (everyone stores X=1), Node2 (everyone stores X=3), or Node3 (everyone stores X=5), following majroity rule. To achieve this, Basic Paxos proposes a method with two points:

When Node1 fills position 1 and discovers the value is determined by majority (e.g., X=5 where Node3 claimed position 1 and Node2 followed), Node1 accepts this reality: position 1 becomes unavailable and must set its own position 1 to X=5. Then check if position 2 allows X=1 storage. If position 2 is also occupied, adopt those values for position 2, then check position 3...
When discovering position 1 is unoccupied, lock it preventing other nodes from claiming unless they have higher priority. If finding position 1 empty but discovering occupation during submission, fail and retry with subsequent positions.

Therefore, achieving identical position 1 logs may require multiple retries, with each node continuously attempting two-phase commit. This continuous retry process until consensus achieves represents the Paxos protocol execution - a Paxos instance ultimately determining a value. Multi Paxos repeats this process to determine a series of values, i.e., each log entry!

Basic `Paxos` Algorithm

In the scenario, three clients concurrently send write instructions to three nodes. Corresponding to the Paxos protocol, each node serves dual roles: Proposer and Acceptor. Implementation typically places both roles within the same process.

When Node1 receives Client1's X=1 instruction, Node1 acts as a Proposer proposing X=1 log storage across all Acceptors (itself and other two nodes).

Similarly, when Node2 receives Client2's X=3 instruction, Node2 proposes to all Acceptors; Node3 follows the same pattern.

Examining Paxos algorithm details: each Acceptor persists three variables (minProposalId, acceptProposalId, acceptValue). Initially: minProposalId=acceptProposalId=0, acceptValue=null. The algorithm operates in two phases: P1 (Prepare phase) and P2 (Accept phase).

P1 (Prepare Phase)

Prepare phase P1a: Proposer broadcasts prepare(n), where n represents a locally generated incrementing ID (not globally ordered, e.g., timestamp+IP). P1b: Acceptor receives prepare(n) and makes decisions:

if n > minProposalId, reply Yes
    minProposalId = n (persisted)
    return (acceptProposalId, acceptValue)
else 
    reply No

P1c: If Proposer receives majority yes responses, select the acceptValue from the Acceptor with highest acceptorProposalId as v, proceeding to second phase by broadcasting accept(n,v). If all acceptor returns are null, use own value as v for second phase. Otherwise, increment n and repeat P1a.

P2 (Accept Phase)

P2a: Proposer broadcasts accept(n,v). Here n matches the P1 phase n, while v might be own value or acceptValue from phase 1. P2b: Acceptor receives accept(n,v) and makes decisions:

if n > minProposalId, reply Yes. Also
    minProposalId = acceptProposalId = n (persisted)
    acceptValue = value
    return minProposalId
else 
    reply No

P2c: If Proposer receives majority yes responses and minProposalId equals n, algorithm terminates. Otherwise, increment n and repeat P1a.

Analyzing the algorithm reveals two Paxos issues:

Paxos creates a "continuously cycling" two-phase commit. In P1C or P2C phases, algorithms may fail and restart P1a, creating typical "livelock" problems with potential endless cycling.
Determining each value requires at least two RTTs (two phases, two network round trips) plus two disk writes, presenting performance challenges. Multi Paxos addresses these issues.

Multi `Paxos` Algorithm

Issue 1: Livelock Problem

Previously established, Basic Paxos creates a continuously cycling two-phase commit. Multiple clients writing to multiple machines with each machine acting as Proposer causes high concurrency conflicts - each node might execute multiple cycles to determine one log. Extreme cases involve infinite cycling two-phase commits, creating livelock problems.

Reducing concurrency conflicts involves converting multi-write to single-write by selecting a Leader allowing only Leader to act as Proposer. Other machines forward write requests to Leader or clients send requests directly to Leader.

Several Leader election methods exist:

Approach 1: Lease-free Leader Election

Lamport's paper presents a simple Leader election algorithm:

Each node has a unique identifier, selecting the highest numbered node as Leader
Each node periodically sends heartbeats to others, assuming period Tms
If a node doesn't receive heartbeats from higher-numbered nodes within 2Tms, it becomes Leader
Non-Leader nodes forward requests to Leader

This algorithm proves simple but network timeouts may create multiple Leaders, not affecting Multi Paxos correctness though increasing concurrent write conflict probability. Algorithms don't require enforcing single Leader existence at all times.

Approach 2: Leased Leader Election

Another approach strictly ensures single Leader existence - the "lease" concept. Lease means during a specified period, one machine remains Leader. Even if that machine fails, Leader switching doesn't occur until lease expiration permits new Leader election. This approach creates brief unavailability but guarantees single Leader existence. Specific implementation details reference Paxos Lease.

Issue 2: Performance Problems

Basic Paxos creates infinite cycling two-phase commit requiring at least two RTTs plus two disk writes per log confirmation (one for Prepare broadcast/reply, one for Accept broadcast/reply). Requiring two RTTs plus two disk writes per log creates poor performance. Multi Paxos optimizes two-phase commit to one-phase commit after Leader selection, needing only one RTT plus one disk write.

Basic approach: once a node confirms as Leader, broadcast Prepare once. After majority agreement, execute Accept operations directly for subsequent logs. Here, Prepare controls entire log sequences rather than individual logs. Once Leader gains complete log control, skip Prepare and execute Accept directly.

What happens with new Leader emergence? New Leader initiates Prepare, increasing minProposalId. Old Leader's Accept broadcasts definitely fail, transforming old Leader into ordinary Acceptor as new Leader replaces it.

Specific implementation details: Basic Paxos two-phase commit parameters appear as:

prepare(n)
accept(n,v)

Multi Paxos adds log index parameter, becoming:

prepare(n,index)
accept(n,v,index)

Issue 3: Synchronizing Chosen Log States to Other Machines

For a log, when Proposer (Leader) receives majority Accept request agreements, the log becomes "chosen" - confirmed and unchangeable!

Only Proposer knows the log is confirmed; other Acceptors remain unaware. How does this information reach other Acceptors?

Approach 1: Proposer Active Notification

Add another parameter to accept:

accept(n,v,index,firstUnchooseIndex)

When Proposer broadcasts accept, include additional parameter firstUnchosenIndex=7 meaning logs before position 7 are all "chosen". Acceptors receiving such requests check prior logs, and if finding pre-7 logs meet conditions: acceptedProposal[i]==request.proposal (parameter n), set log status to chosen.

Approach 2: Acceptor Passive Query

When Acceptor becomes Leader, execute Paxos again for all unconfirmed logs to determine majority-agreed values.

Since Basic Paxos has core characteristic: once value determination occurs, additional Paxos executions won't change the value! Therefore, executing Paxos again equates to querying the cluster!

Multi `Paxos` Core Concepts

Reviewing this algorithm reveals two essential elements:

Essence 1: Strongly Consistent "P2P Network"

Every log has two states (chosen, unchosen). Applied state (confirmed logs applied to state machine) exists but relates little to Paxos protocol.

Chosen state means log acceptance by majority, becoming unchangeable;

Unchosen means uncertainty. As Alibaba OceanBase team engineers described, it's like "Schrödinger's cat" or "maximum commit principle". An unchosen log might already be chosen but unknown to the node, or not yet chosen. Confirmation requires executing Paxos again, representing the "maximum commit principle".

Entire Multi Paxos resembles a P2P network with bidirectional synchronization between all nodes, continuously confirming all unchosen logs! Multiple Leaders may appear with switching scenarios, none affecting algorithm correctness!

Essence 2: "Sequential Ordering"

Multi Paxos ensures identical log ordering across all nodes, but from each node's perspective, logs lack inherent "order". What does this mean?

If a client sequentially sends logs a, b (sending b before receiving a's reply), server storage might be a,b, b,a, or inserting other client logs between a,b!
If a client sequentially sends logs a, b (sending b after receiving a's reply), server storage might be a,b; or a,xxx,b (inserting other client logs between a,b), but never b before a.

Therefore, "sequential ordering" exists only when single clients serially send logs. Multiple clients writing concurrently with servers executing Paxos concurrently for each log creates no overall "order".

Tags: distributed-systems consensus-algorithm paxos replicated-state-machines multi-paxos

Posted on Sat, 09 May 2026 04:39:37 +0000 by zzz

Freaks City

Understanding the Paxos Consensus Algorithm for Distributed Systems

The Problem That Paxos Addresses

A Fundamental Concurrency Challenge

Understanding Sequential Ordering

What Paxos Resolves

Replicated State Machines

An Intuitive Conceptual Framework

Basic `Paxos` Algorithm

P1 (Prepare Phase)

P2 (Accept Phase)

Multi `Paxos` Algorithm

Issue 1: Livelock Problem

Issue 2: Performance Problems

Issue 3: Synchronizing Chosen Log States to Other Machines

Multi `Paxos` Core Concepts

Hot Tags

Freaks City

Understanding the Paxos Consensus Algorithm for Distributed Systems

The Problem That Paxos Addresses

A Fundamental Concurrency Challenge

Understanding Sequential Ordering

What Paxos Resolves

Replicated State Machines

An Intuitive Conceptual Framework

Basic Paxos Algorithm

P1 (Prepare Phase)

P2 (Accept Phase)

Multi Paxos Algorithm

Issue 1: Livelock Problem

Issue 2: Performance Problems

Issue 3: Synchronizing Chosen Log States to Other Machines

Multi Paxos Core Concepts

Hot Tags

Basic `Paxos` Algorithm

Multi `Paxos` Algorithm

Multi `Paxos` Core Concepts