The Problem That Paxos Addresses
The Paxos algorithm often appears cryptic and difficult to grasp. While numerous papers and articles exist, many feel abstract and unclear about its core principles and practical applications.
This complexity stems from two factors: most Paxos materials focus on formal proofs of algorithm correctness, making them inherently dense, and there are limited mature engineering implementations based on Paxos. This discussion aims to approach the topic progressively, starting from problems and gradually exploring Paxos concepts.
A Fundamental Concurrency Challenge
Consider a basic concurrency issue illustrated with a key-value storage cluster. Three clients concurrently send requests to a three-node cluster. When retrieving value X, what should its final state be?
The answer is: X could equal 1, 3, or 5 - all are valid! However, X=4 would be incorrect. Since clients operate concurrently, request arrival order at servers remains uncertain, allowing multiple valid outcomes.
A crucial insight: if the final cluster state shows X=1, then Client1's request (X=1) receives confirmation, while Client2's (X=3) and Client3's (X=5) requests effectively get rejected. Similar reasoning applies when X equals 3 or 5. This behavior represents a fundamental characteristic of the Paxos protocol.
Understanding Sequential Ordering
Expanding the scenario: consider a three-machine KV cluster where nodes communicate by propagating values. Each client sends requests to different machines. Each machine logs received requests (both client and inter-node communications). After processing completes, what sequence should each machine maintain?
The conclusion: regardless of ordering, if all three machines maintain identical log sequences, the result is correct. With three items, total permutations (6 cases) all remain valid. For instance, if all machines store X=1, X=3, X=5 in that order, the final cluster value becomes 5. Other scenarios follow similarly.
Incorrect scenarios occur when: Machine1 maintains sequence 1,3,5 (resulting in X=5), Machine2 has 3,5,1 (resulting in X=1), and Machine3 holds 1,5,3 (resulting in X=3). The machines disagree on X's value.
This example illustrates sequential ordering: while clients operate concurrently without precedence, server clusters must ensure identical log ordering across all machines - achieving distributed consistency.
What Paxos Resolves
In the example, Node1 receives X=1 then replicates to Node2 and Node3; Node2 receives X=3 then replicates to Node1 and Node3; Node3 receives X=5 then replicates to Node1 and Node2.
With concurrent clients and concurrent replication between nodes, how can we guarantee identical log sequences across all nodes? This challenge represents the core problem that Paxos addresses.
Replicated State Machines
Previous discussions addressed log replication issues - each node stores log sequences with guaranteed consistency between nodes. Some might ask: why store logs instead of final data directly?
Consider variables or objects as state machines. Each write request represents an event causing state machine transitions, which constitutes the log.
Using the simple variable X example: with one node and three client commands modifying X, the final state becomes 6.
Extending variable X to MySQL databases, clients send various DML operations that become persistent Binlog entries. These logs then generate database tables (state machines).
An important concept emerges: persist the "event stream" (log stream) rather than the "data itself" (state machine). Several reasons support this approach:
- Logs involve only append operations, while data supports add, delete, update operations. Converting three operations into one simplifies persistence.
- Multi-machine synchronization becomes easier with logs versus complex data structures (relational database joins, trees, graphs). Linear log sequences enable straightforward comparison.
Whether considering persistence or data synchronization, storing state machine input event streams (logs) proves simpler than storing state machines themselves.
Based on this approach, state machines extend to replicated state machines. The principle states: identical initial states + identical input events = identical final states. Therefore, ensuring identical log streams across multiple nodes guarantees consistent states. Even if a node fails, restarting and replaying the log stream restores previous states.
An Intuitive Conceptual Framework
Paxos initially emerged through Basic Paxos formal proof, followed by Multi Paxos, then application scenarios. Since applications weren't initially presented, direct examination of Basic Paxos proofs proved challenging. This discussion reverses the approach, starting with applications before deriving Paxos and Multi Paxos.
When three clients concurrently send requests, six potential results prove valid. An algorithm must ensure that despite concurrent client requests, all nodes record identical log ordering.
Here's an intuitive yet profound concept: global understanding of number sequences 1,2,3,4,5... remains consistent! All people and machines share this understanding.
When referencing 2, everyone knows it follows 1 and precedes 3! Two represents a position definitely between (1,3).
Applying this idea to computer systems for multi-node log replication: when Node1 receives X=1 and stores it in position 1, first verify if other machines already occupy position 1 with X=3 or X=5. If occupied, check position 2, and so forth. If unoccupied, store X=1 in position 1 and inform other nodes to store X=1 in their respective position 1. Similarly, Node2 and Node3 execute accordingly.
The key concept: while each node receives requests in different orders, they maintain consistent understanding of log positions 1, 2, 3, etc., collectively ensuring identical data storage at each position!
Each node queries other nodes before storing logs, then decides placement. This creates two phases: inquiry followed by decision, representing the prototype of Paxos two-phase commit.
Further decomposing the problem to single log replication instead of three logs: determining the first log for three nodes reveals challenges.
Node1 discovers position 1 is unoccupied and plans to propagate X=1 to Node2 and Node3. Simultaneously, Node2 finds position 1 unoccupied and plans to propagate X=3 to Node1 and Node3. Similarly, Node3 plans to propagate X=5 to Node1 and Node2.
This creates conflicts - even determining the first position proves problematic!
Basic Paxos specifically addresses this challenge.
First, position 1 either gets claimed by Node1 (everyone stores X=1), Node2 (everyone stores X=3), or Node3 (everyone stores X=5), following majroity rule. To achieve this, Basic Paxos proposes a method with two points:
-
When Node1 fills position 1 and discovers the value is determined by majority (e.g., X=5 where Node3 claimed position 1 and Node2 followed), Node1 accepts this reality: position 1 becomes unavailable and must set its own position 1 to X=5. Then check if position 2 allows X=1 storage. If position 2 is also occupied, adopt those values for position 2, then check position 3...
-
When discovering position 1 is unoccupied, lock it preventing other nodes from claiming unless they have higher priority. If finding position 1 empty but discovering occupation during submission, fail and retry with subsequent positions.
Therefore, achieving identical position 1 logs may require multiple retries, with each node continuously attempting two-phase commit. This continuous retry process until consensus achieves represents the Paxos protocol execution - a Paxos instance ultimately determining a value. Multi Paxos repeats this process to determine a series of values, i.e., each log entry!
Basic Paxos Algorithm
In the scenario, three clients concurrently send write instructions to three nodes. Corresponding to the Paxos protocol, each node serves dual roles: Proposer and Acceptor. Implementation typically places both roles within the same process.
When Node1 receives Client1's X=1 instruction, Node1 acts as a Proposer proposing X=1 log storage across all Acceptors (itself and other two nodes).
Similarly, when Node2 receives Client2's X=3 instruction, Node2 proposes to all Acceptors; Node3 follows the same pattern.
Examining Paxos algorithm details: each Acceptor persists three variables (minProposalId, acceptProposalId, acceptValue). Initially: minProposalId=acceptProposalId=0, acceptValue=null. The algorithm operates in two phases: P1 (Prepare phase) and P2 (Accept phase).
P1 (Prepare Phase)
Prepare phase P1a: Proposer broadcasts prepare(n), where n represents a locally generated incrementing ID (not globally ordered, e.g., timestamp+IP). P1b: Acceptor receives prepare(n) and makes decisions:
if n > minProposalId, reply Yes
minProposalId = n (persisted)
return (acceptProposalId, acceptValue)
else
reply No
P1c: If Proposer receives majority yes responses, select the acceptValue from the Acceptor with highest acceptorProposalId as v, proceeding to second phase by broadcasting accept(n,v). If all acceptor returns are null, use own value as v for second phase. Otherwise, increment n and repeat P1a.
P2 (Accept Phase)
P2a: Proposer broadcasts accept(n,v). Here n matches the P1 phase n, while v might be own value or acceptValue from phase 1. P2b: Acceptor receives accept(n,v) and makes decisions:
if n > minProposalId, reply Yes. Also
minProposalId = acceptProposalId = n (persisted)
acceptValue = value
return minProposalId
else
reply No
P2c: If Proposer receives majority yes responses and minProposalId equals n, algorithm terminates. Otherwise, increment n and repeat P1a.
Analyzing the algorithm reveals two Paxos issues:
Paxoscreates a "continuously cycling" two-phase commit. In P1C or P2C phases, algorithms may fail and restart P1a, creating typical "livelock" problems with potential endless cycling.- Determining each value requires at least two RTTs (two phases, two network round trips) plus two disk writes, presenting performance challenges.
Multi Paxosaddresses these issues.
Multi Paxos Algorithm
Issue 1: Livelock Problem
Previously established, Basic Paxos creates a continuously cycling two-phase commit. Multiple clients writing to multiple machines with each machine acting as Proposer causes high concurrency conflicts - each node might execute multiple cycles to determine one log. Extreme cases involve infinite cycling two-phase commits, creating livelock problems.
Reducing concurrency conflicts involves converting multi-write to single-write by selecting a Leader allowing only Leader to act as Proposer. Other machines forward write requests to Leader or clients send requests directly to Leader.
Several Leader election methods exist:
Approach 1: Lease-free Leader Election
Lamport's paper presents a simple Leader election algorithm:
- Each node has a unique identifier, selecting the highest numbered node as Leader
- Each node periodically sends heartbeats to others, assuming period Tms
- If a node doesn't receive heartbeats from higher-numbered nodes within 2Tms, it becomes Leader
- Non-Leader nodes forward requests to Leader
This algorithm proves simple but network timeouts may create multiple Leaders, not affecting Multi Paxos correctness though increasing concurrent write conflict probability. Algorithms don't require enforcing single Leader existence at all times.
Approach 2: Leased Leader Election
Another approach strictly ensures single Leader existence - the "lease" concept. Lease means during a specified period, one machine remains Leader. Even if that machine fails, Leader switching doesn't occur until lease expiration permits new Leader election. This approach creates brief unavailability but guarantees single Leader existence. Specific implementation details reference Paxos Lease.
Issue 2: Performance Problems
Basic Paxos creates infinite cycling two-phase commit requiring at least two RTTs plus two disk writes per log confirmation (one for Prepare broadcast/reply, one for Accept broadcast/reply). Requiring two RTTs plus two disk writes per log creates poor performance. Multi Paxos optimizes two-phase commit to one-phase commit after Leader selection, needing only one RTT plus one disk write.
Basic approach: once a node confirms as Leader, broadcast Prepare once. After majority agreement, execute Accept operations directly for subsequent logs. Here, Prepare controls entire log sequences rather than individual logs. Once Leader gains complete log control, skip Prepare and execute Accept directly.
What happens with new Leader emergence? New Leader initiates Prepare, increasing minProposalId. Old Leader's Accept broadcasts definitely fail, transforming old Leader into ordinary Acceptor as new Leader replaces it.
Specific implementation details: Basic Paxos two-phase commit parameters appear as:
prepare(n)
accept(n,v)
Multi Paxos adds log index parameter, becoming:
prepare(n,index)
accept(n,v,index)
Issue 3: Synchronizing Chosen Log States to Other Machines
For a log, when Proposer (Leader) receives majority Accept request agreements, the log becomes "chosen" - confirmed and unchangeable!
Only Proposer knows the log is confirmed; other Acceptors remain unaware. How does this information reach other Acceptors?
Approach 1: Proposer Active Notification
Add another parameter to accept:
accept(n,v,index,firstUnchooseIndex)
When Proposer broadcasts accept, include additional parameter firstUnchosenIndex=7 meaning logs before position 7 are all "chosen". Acceptors receiving such requests check prior logs, and if finding pre-7 logs meet conditions: acceptedProposal[i]==request.proposal (parameter n), set log status to chosen.
Approach 2: Acceptor Passive Query
When Acceptor becomes Leader, execute Paxos again for all unconfirmed logs to determine majority-agreed values.
Since Basic Paxos has core characteristic: once value determination occurs, additional Paxos executions won't change the value! Therefore, executing Paxos again equates to querying the cluster!
Multi Paxos Core Concepts
Reviewing this algorithm reveals two essential elements:
Essence 1: Strongly Consistent "P2P Network"
Every log has two states (chosen, unchosen). Applied state (confirmed logs applied to state machine) exists but relates little to Paxos protocol.
Chosen state means log acceptance by majority, becoming unchangeable;
Unchosen means uncertainty. As Alibaba OceanBase team engineers described, it's like "Schrödinger's cat" or "maximum commit principle". An unchosen log might already be chosen but unknown to the node, or not yet chosen. Confirmation requires executing Paxos again, representing the "maximum commit principle".
Entire Multi Paxos resembles a P2P network with bidirectional synchronization between all nodes, continuously confirming all unchosen logs! Multiple Leaders may appear with switching scenarios, none affecting algorithm correctness!
Essence 2: "Sequential Ordering"
Multi Paxos ensures identical log ordering across all nodes, but from each node's perspective, logs lack inherent "order". What does this mean?
- If a client sequentially sends logs a, b (sending b before receiving a's reply), server storage might be a,b, b,a, or inserting other client logs between a,b!
- If a client sequentially sends logs a, b (sending b after receiving a's reply), server storage might be a,b; or a,xxx,b (inserting other client logs between a,b), but never b before a.
Therefore, "sequential ordering" exists only when single clients serially send logs. Multiple clients writing concurrently with servers executing Paxos concurrently for each log creates no overall "order".