1. Introduction
The digital age drove forth the need for easy and fast access to information. The world wide web has facilitated the existence of many useful multiagent systems from messaging apps to cryptocurrency [
1] and distributed data storage (or cloud services) [
2,
3]. However, the design of multiagent systems inherently requires agents to communicate and coordinate according to a prescribed shared protocol to achieve a common goal. For example, messaging apps must always show messages in the same order to all participants in a conversation, which is challenging when user clocks are not necessarily synchronized [
4,
5]. Cryptocurrencies employ decentralized data structures to register currency transactions, which require a vast majority of users to agree upon its current state [
6]. Distributed data storage services must show consistent views of stored files in the presence of multiple concurrent reading and writing operations [
7,
8].
In the pursuit of developing such distributed protocols, much of the literature routinely makes two powerful assumptions. The first is that communication links are reliable [
9,
10,
11], i.e., all messages between agents are eventually delivered. The second is that there exists an upper bound on the transmission delay of messages from one agent to another (usually the maximum propagation time of links) [
12]. Nonetheless, communication networks are notoriously unreliable [
13,
14,
15]. In fact, actual communication links may suffer from sudden crashes, resulting in messages in transit to be lost forever. In an effort to ensure reliability, distributed applications are generally built upon a reliable broadcast layer implemented by the Transmission Control Protocol (TCP) [
16]—one of the main protocols in the internet protocol suite. However, while TCP guarantees eventual delivery of all sent messages, it does not provide any upper time bound on delivery time [
17] (p. 9). In practice, these assumptions do not hold simultaneously.
In this work, we assume no such underlying structure exists and analyze the performance of a simple majority-rule protocol solving a fundamental coordination problem in distributed systems-binary majority consensus, in the presence of probabilistic message loss. Using probabilistic analysis for a large scale, fully-connected network of agents, we prove that the Simple Majority Protocol (SMP) converges rapidly to a consensus on the majority opinion of the network with probability approaching 1 as , given that the difference between the numbers of agents that hold different opinions grows as fast as . Otherwise, if the difference between the numbers of agents that hold different opinions is relatively close to zero, then the SMP still converges extremely fast to a consensus, but not necessarily on the initial majority opinion of the network.
1.1. Importance of Reliable Communication
Reliability of communication is essential to guarantee coordination in almost all cases. The pitfalls and design challenges of coordination when communication is unreliable is best illustrated by
the two generals’ problem, which was popularized by Jim Gray [
18].
Consider two generals who must coordinate a joint attack on an enemy. Both generals must attack simultaneously for the attack to succeed. While the two generals agreed that they will attack, they have not agreed upon a time for the attack. To coordinate, they can send messages to one another by running messengers. However, the messengers can be captured by the enemy and their messages will therefore not reach their destination.
Due to the uncertainty of message delivery, there exists no deterministic joint communication protocol that guarantees coordinated attack. To see this, assume there exists such a protocol by contradiction. Since a deterministic protocol must solve the problem in a finite number of steps, then the protocol prescribes a fixed number of message exchanges between the two generals, after which both must attack together. Some of these messages are successfully delivered and some are lost. Consider the last successfully delivered message in a run of the protocol, after which the recipient is confident enough to attack without the need for any further correspondence. Suppose this message was lost instead, then the recipient will hold off and not attack. However, the sender does not know about this last communication failure. By the protocol definition he must attack anyway, despite his counterpart’s reluctance—contradicting the assumption that the protocol was a solution to the problem.
1.2. Majority Consensus
The impossibility result of the two generals’ problem had far-reaching implications in the field of distributed protocols and databases, including the study of binary consensus [
19]. In the binary consensus problem, every agent is initially assigned some binary value, referred to as the agent’s initial opinion. The goal of a protocol that solves consensus is to have every agent eventually decide on the same opinion, thus reaching agreement throughout the system. More formally, given any initial assignment of agent opinions, a run of a protocol which solves consensus must exhibit the following three properties:
Decision: every agent eventually decides on some opinion ;
Agreement: if some agent decided on v, no opinion other than v can be decided on by any other agent;
Nontriviality: if some agent has decided on v, then v was an opinion initially assigned to some agent.
Consensus is a fundamental problem in distributed systems, as many other coordination problems were shown to be directly reducible to and from consensus. The list includes agreeing on what transactions to commit to a database [
20], state machine replication [
21], atomic snapshots [
22], total ordering of concurrent events [
23], and the two generals’ problem, implying that no protocol can guarantee all three properties when communication is unreliable [
24].
In light of this, it is interesting to consider a variation of the two generals’ problem where the probability of a messenger getting captured is
p (independently of other messengers) [
25,
26]. While coordinated attack is still deterministically impossible, it is straightforward to design a protocol that guarantees success with probability at least
q, which can be as close as desired to 1. The first general simply sends
messengers, then attacks at the specified time without waiting for a reply, and the second general attacks if any messenger from the first general arrives.
In this work, we investigate whether leveraging such an assumption helps to solve binary majority consensus, in which the nontriviality clause stipulates that if a majority of agents initially hold the same opinion, then all agents must decide on this opinion. This variant of consensus is utilized when the agreed upon opinion holds importance beyond facilitating agreement. For example, a distributed system of sensors capable of detecting natural gas could use majority consensus to answer the question “Is the amount of gas in the air greater than 10,000 ppm?” to help detect a gas leak in a gas processing center.
We analyze the performance of the SMP in a complete graph of communication, i.e., where each agent has an active communication channel to every other agent in the system. In SMP, agents communicate in equal-length time intervals called rounds. All messages are sent at the beginning of a communication round, and they either arrive by the end of the round or are considered lost. We assume that all message loss events are statistically independent and identically distributed with some constant probability.
The SMP can be briefly described as follows: in each round, every agent sends its current opinion to all other agents. Then, it waits to receive all messages from other agents proposing their own opinions. If a majority of received messages propose the same opinion, then the agent adopts this opinion for the next round. All ties are reconciled by readopting the agent’s own opinion. After a fixed number of rounds r, each agent decides on its currently adopted opinion.
Similarly to the probabilistic protocol for the two generals’ problem discussed above, the SMP does not solve consensus deterministically, but rather provides probabilistic guarantees instead. The Decision and Nontriviality properties of classical consensus are assured, since all agents decide by the end of round r and any opinion that was decided on, was proposed by some agent. However, Agreement is not assured, since there always exists a nonzero probability of a run of the protocol in which message losses cause one agent to see only one opinion and another agent to see only the other, thus making them disagree. Likewise, Nontriviality of majority consensus is not guaranteed, since the majority opinion could be hidden from some agent. We will show in this article that the probability of these runs is negligible as the number of agents, n, tends to infinity, thus demonstrating that unreliable communication is not an insurmountable obstacle for coordination.
Specifically, we prove that the SMP with reaches classical consensus with probability converging to 1 as n tends to infinity. In a system of agents, let be the number of agents that are initially assigned the majority opinion minus n. For simplicity, assume the majority opinion is always the same for all n. We show that if grows at a rate of , then the SMP with reaches majority consensus with probability approaching 1 as . We also show that if grows at a rate faster than , then the SMP with reaches majority consensus with probability that converges to 1 exponentially fast.
We also show that these achievability results are, in fact, tight. We will prove that if , then communication rounds is a necessary condition, since the probability to reach consensus with only rounds converges to 0 as . Similarly, if grows as slow as , then rounds are a necessary condition to reach majority consensus.
1.3. Related Work
The problem of binary majority consensus was extensively researched in many different fields and contexts including autonomous systems [
27,
28,
29,
30], distributed systems [
31,
32,
33], and information theory [
34,
35,
36]. Almost always the problem is studied in the context of possible failure of some aspect of the network. In distributed systems, failure most often arises from agents behaving maliciously, failing to follow the protocol, or outright crashing. Consequently, protocols that solve consensus (and majority consensus by extension) are designed to tolerate a certain fraction of the set of agents failing [
37,
38]. Transmission faults (i.e., message loss, erasure, or addition) can be considered an extension of agent failure, but doing so may lead to false conclusions. For example, in a system of
n agents, the entire system may be considered faulty even if only one message from each agent is lost. However, as shown by Santoro and Widmayer [
39], the system may tolerate up to
messages losses in a round and still reach consensus. Additionally, assuming a probability distribution on message loss is consistent with how network protocols are analyzed. The most notable example is that TCP throughput was shown to be inversely proportional to the square root of the link’s average packet (i.e., message) loss probability [
40].
In [
27,
29,
30,
34], the authors studied the effects of message loss, random topology, Gaussian noise, and faulty agents, on the SMP’s convergence rate, i.e., the fraction of initial assignments of agent opinions (out of
) resulting in successful agreement. Specifically, in [
30] computer simulations showed an improvement in the convergence rate of the SMP as the message loss probability increased up to
, after which the rate begins to decrease to zero. In contrast, we are interested in the maximal probability of failure over any initial assignment of agent opinions, since we cannot assume any distribution or frequency on the input to the consensus problem.
Mustafa and Pekeč [
28] studied the requirements on the connectivity of the network such that, under assumption of reliable communication, SMP achieves consensus on any initial assignment of agent opinions. Their main result is that the SMP computes the majority consensus successfully only in highly-connected networks. This conclusion led us to analyze the SMP under the assumption of a fully-connected network. However, message loss may actually improve the chances of consensus in graphs with lesser degrees of connectivity, as shown in [
30]. We leave the proof of this hypothesis to future work. Additionally, the complete graph assumption is a valid approximation for unstructured overlays in peer to peer networks, e.g., Freenet, Gnutella, and Fast Track [
41].
Our work closely resembles the work performed in [
35,
36]. These articles have shown that in a lossless fully-connected network where agents poll a portion of their neighbors uniformly at random, the SMP converges quickly to majority consensus with probability of error (in the sense that agreement was reached, but not on the majority opinion) that decays exponentially with
n. While assuming the existence of infinite agents in a system may initially seem ludicrous and impractical, our own computer simulations of the SMP showed that these kind of results hold true even if the number of agents is on order of
, which is already the case in cryptocurrency protocols. We add another assumption of unreliable communication and show that this, essentially, does not change the outcome.
Yet, another line of relatively recent work deserves a special attention. In [
42], a local polling protocol is proposed, and it is proved that it reaches consensus on the initial global majority in general graphs with certain degree properties. An estimation on the number of required steps to reach consensus is provided. In [
43], similar results were given for random regular graphs. In both of these papers, it is assumed that a clear bias exists between the two initial opinions, in contrast to our main assumption in the current work, that the initial condition may be completely unbiased. In [
44], the binary consensus problem was tackled from a different angle. For a random graph
with a connectivity parameter
and any given
, this work reveals what the initial difference between the two camps should be, such that the larger camp will eventually win with probability at least as high as
. In [
45], the binary consensus problem was solved for relatively sparse random graphs but with random initial states, which is slightly different than the assumptions in the current work. A remarkable result was proved in [
45], stating that a consensus can be reached in at most four communication rounds.
The remaining part of the paper is organized as follows. In
Section 2, we establish notation conventions. In
Section 3, we formalize the model, the protocol, and the objectives of this work. In
Section 4, we provide and discuss the main results of this work, and in
Section 5, we prove them.
3. Model, Protocol, and Objectives
Assume a set of agents, and denote their assignment of initial opinions by . The vector is called the initial state. Denote the numbers of zeros and ones in by and , respectively. At each round, each agent transmits its current state to all other agents. If a message sent between any pair of agents arrives, then it is assumed to be delivered correctly. Otherwise, if is transmitted between any pair of agents, but got lost, then the designated receiver receives the default symbol e. This assumption is only made for the purpose of making the definitions that follow brighter. For a sent message and a received message , we assume that all message losses are statistically independent and identically distributed according to and , where is the loss parameter of the network. The binary erasure channel is characterized by a similar conditional distribution, but note that the actual faults in our model are message losses, not to be confused with erasures, which are different kinds of faults. The two extreme cases of a reliable network (i.e., with ) and a completely unreliable network (i.e., with ) are of less interest, for obvious reasons; hence, we assume throughout that .
At round
, the agent
receives the (random) vector:
and for
, he calculates the enumerators:
In the SMP, each agent updates (note that we use this terminology even if the value of an agent does not change between two consecutive rounds) its value according to the more common value at hand, i.e., agent
i chooses:
The vector
is called the state at the end of round
ℓ.
A specific SMP defines a priori the number of rounds until termination. Let us denote by SMP
the SMP with
r rounds of communication until termination. We say that the SMP
attains
consensus if:
and denote this event by
. Similarly, we say that the SMP
attains
majority consensus if the following holds:
and denote this event by
.
For a specific initial state
, the probability of error in achieving consensus is defined as
. The maximal error probability with respect to the initial state is defined by:
The error probability in achieving majority consensus is defined similarly and denoted
.
Now, the first objective of this work is to prove that the SMP requires only very few rounds of communication to attain consensus, with a maximal error probability that converges to 0 when . The second objective is to determine for which initial states it is possible to also achieve majority consensus with a small probability of error.
4. Main Results
Our first main result is the following, which is proved in
Section 5.1.
Theorem 1. Let , be a sequence of initial states over agents. Assume that the agents communicate over a network with a loss parameter . Then:
If is asymmetric of order larger than , the SMP attains .
If is asymmetric of exact order of , the SMP attains .
For any , the SMP attains .
We now provide a short discussion on the results of Theorem 1.
Theorem 1 shows that the SMP requires at most three rounds of communications to attain consensus, in the limit of an infinite number of agents. Consensus on the majority cannot be ensured for all possible initial states, but only for those initial states that have a significant majority to one of the sides. To understand this fact better, consider the following special case. Assume a network with agents, such that and . Since this majority in favor of the zeros is so weak, then it is most likely that the random losses in the network will completely hide it; we expect that about half of the agents will have , thus updating their current opinion to ‘0’, while the other half will update their current opinion to ‘1’s. We conclude that the state at the end of round 1 is probabilistically equivalent to a sequence of fair coin tosses, and hence, with a probability of about one half, the majority at the end of round 1 will be different from the initial majority.
More quantitatively, let and , where is a non-negative, nondecreasing sequence. Moreover, for an agent with an initial opinion ‘0’, let denote the sequence of probabilities of the events that such an agent updates its opinion to ‘0’. Then, the following trichotomy is seen inside the proof of Theorem 1.
Lemma 1. The following trichotomy holds:
If , then .
If then .
If , then .
One of the most surprising facts, at least to the authors of this work, is the following. For highly symmetric initial states, although
(which is proved in
Appendix C), it turns out (see Proposition 3 in
Section 5.1) that after a single round of communication, the initial symmetry breaks equiprobably into one of the sides. Moreover, for the symmetric case of
, we prove in Propositions 3 and 4 that with a probability converging to 1, the state at the end of round 1 will be asymmetric of exact order of
. Then, according to the second point in Lemma 1, the state at the end of round two is going to have a significant majority to one of the sides, and thus, according to the third point in Lemma 1, only one more round of communication is required to achieve consensus. If the initial state is already asymmetric of exact order of
, then only two rounds of communication are needed for attaining consensus, and in this case, it is guaranteed (with high probability) that all agents agree on the initial majority opinion.
The phenomenon that the initial symmetry breaks into a sufficient majority after the first round is of key importance, since it makes the convergence of the SMP so rapid. In fact, we also conclude that the faulty communication between the agents even helps in attaining consensus, by breaking the symmetry in some extreme cases, e.g., consider the case of and a reliable network (i.e., the case of ). Then, ad infinitum, the state at the end of any round will be symmetric. Otherwise when losses exist according to some , this will not be the case, even if the percentage of losses is extremely small (but fixed at all n).
A significant difference exists between the first point of Theorem 1 and its last two points, which is the following. The first point of Theorem 1 is based on Proposition 1 in
Section 5.1, which is mainly proved by using the Chernoff bound. Since the Chernoff bound is a nonasymptotic tool, we acquire a large-deviations result, i.e., for a given sequence
(with the condition
), we propose a tight upper bound on
, which holds for any finite
n (this tightness follows from the fact that a lower bound with a matching exponent can be derived as well). This result is obviously stronger than just
. On the other hand, the second and the third points of Theorem 1 are based on Propositions 2 and 3 in
Section 5.1, respectively. Since the proofs of these propositions involve central limit theorems, we merely arrive at asymptotic results. As a consequence, we do not know at what rates the probabilities in the second and the third points of Theorem 1 converge to one.
Since the results of the second and the third points of Theorem 1 are merely asymptotic, a few words on finite n effects are in order. We base the following facts on computer simulations of the SMP. On the one hand, convergence to consensus at more than three rounds is definitely possible, but only when the initial state is symmetric or almost symmetric. The reason for that is the fact mentioned above, according to which, the state at round 1 is probabilistically equivalent to a sequence of fair coin tosses, and hence, the probability that the state at round 1 is again symmetric behaves asymptotically as (upper and lower bounds can be derived using the Stirling’s bounds to ), which is not negligible at all, even for a relatively large number of agents. For relatively small values of n, we observed several realizations with even more than a single returning to a fully symmetric state. Although quite rare, these events should be taken into consideration in practical implementations.
All the results provided in Theorem 1 are, in fact, achievability results, i.e., they only tell under what conditions consensus can be attained. Hence, it is worth investigating whether consensus may be attained by the SMP with even less communication rounds than required in Theorem 1. In the following result, which is the second main result of this work and is proved in
Section 5.2, we show that for highly symmetric initial states, three rounds of communications are not only sufficient, but also necessary.
Theorem 2. Let be a sequence of symmetric initial states over agents, i.e., for all n. Assume that the agents communicate over a network with a loss parameter . Then, the SMP attains .
While Theorem 2 provides a converse result with regard to the third point of Theorem 1, a similar converse result can also be established with regard to the second point of Theorem 1. If the initial state is asymmetric of exact order of , then the SMP will likely not attain consensus after only a single round of communication, and furthermore, the probability of reaching consensus will tend to 0 as . We omit the proof of this negative result.