TL;DR Paxos is a protocol for distributed consensus. In its simplest form, a group of participants collectively decide and agree on a single value of the system.
The problem in general was discussed in the Raft post. We'll be talking about single-decree Paxos where a group of participants want to decide on a single value. A value can be considered as a log entry—an event that leads to state changes. Single-decree Paxos can be repeated to decide on multiple values, forming a log.
This blog post abstracts the problem of consensus as a WOR (Write-Once register). It is quite intuitive; in Paxos, once a value is chosen, it must not be changed.
In Paxos, there are three kinds of agents: proposers, acceptors, and learners.
The algorithm involves one or more rounds that are driven by the proposers. Each round is identified by a sequence number and driven by one proposer. Multiple rounds can happen concurrently without violating safety.
In each round, the proposer has two phases
If everything goes well, a single round will get the problem solved. But Paxos is designed to survive different failure scenarios. Nodes can die and restart at any time. The system should continue to function as long as a majority of participants are up.
A proposer can go down at any step of the process. For example, it may fail to send out enough Accept requests to all acceptors or it could be super slow as if it has crashed. Multiple rounds are needed in these failure scenarios.
The sequence number of a round establishes precedence between rounds. The Prepare phase of a higher-sequence-number proposer is like a fencing token that asks the acceptors to forget about previous proposers.
For any two rounds, we can consider the possible sequence of events that can happen on a single acceptor
(Px: prepare of round x; Ax: accept of round x)
Note that the Paxos protocol prohibits P2P1 from happening.
Value determination of the Prepare phase
If a value is chosen at a certain round, Paxos ensures that it is impossible for a different value to be chosen again. It does this by ensuring that all future proposals will propose the same value. This is the key purpose of the Prepare phase—picking up values that have been accepted before.
Let's say round 1 gets a value chosen. The prepare phase of round 1 must have reached a majority of acceptors. Now, for future rounds, there must be an overlap between the set of acceptors that accept the chosen value in round 1 and the acceptors that respond to the prepare request of round 2. In other words, there must be at least one node in the P1A1P2 situation, which ensures that the proposed value is the same as round 1 in round 2.
The prepare phase ensures that any value accepted by the majority is picked up, inherited, and propagated to all future proposals. In Raft, this induction-like inheritance is done at the leader election step. The new term leader must contain any entry replicated to the majority.
You may ask if a value has been chosen, what's the need to have more rounds? Why not just stop everything? It could be that there are multiple concurrent rounds and one of the rounds chooses a value while others are in-flight. We need to handle conflicts between rounds.
What about learners? Whenever an acceptor accepts a value, it can send a message to a learner for it to count. Once a majority is reached, it will learn that a value has been chosen. It can then inform the proposers to stop.
The appendix in the original Paxos paper is a good reference. That paper poses a relatively strict constraint that requires the majority sets to be the same in the two phases.
In Paxos Made Simple, the version of Paxos allows the majority sets to be different. To support that, it needs to allows an acceptor to accept a ballet higher than what's prepared.
In the TLA+ specification, an acceptor is allowed to accept a ballet higher than what's prepared, but when that happens, it also updates its highest ballet number so that it denies future prepare/accept requests with a smaller sequence number. To me, it's like automatically inserting a prepare request right before the accept request.
Some claimed that PMS has a bug but I disagreed. In the failure scenario mentioned in the blog post, the accepted ballet with a lower number will not overwrite the higher ballet already accepted.
Some said PMS should have said, “if the acceptor accepts a ballet higher than what's prepared, the acceptor modifies the prepared value to the new ballot”. This is also what the TLA+ spec does. But I think this is an optimization, not a requirement for safety.
Without that optimization, consider again the sequence of events on an acceptor.
Note that P2P1 and P2A1 are still prevented.
Now some might say that situation 5 is problematic. But I argue that it is harmless. It is only harmful if two rounds have different values and they are both accepted. But that will not happen.
A Raft-style Paxos description is presented in https://arxiv.org/pdf/2004.05074.pdf.
An extensive list of Paxos variations: https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-935.pdf