What do leaders know?

The ability of a society to make the right decisions on relevant matters relies on its capability to properly aggregate the noisy information spread across the individuals it is made of. In this paper we study the information aggregation performance of a stylized model of a society whose most influential individuals - the leaders - are highly connected among themselves and uninformed. Agents update their state of knowledge in a Bayesian manner by listening to their neighbors. We find analytical and numerical evidence of a transition, as a function of the noise level in the information initially available to agents, from a regime where information is correctly aggregated to one where the population reaches consensus on the wrong outcome with finite probability. Furthermore, information aggregation depends in a non-trivial manner on the relative size of the clique of leaders, with the limit of a vanishingly small clique being singular.


I. INTRODUCTION
Amartya Sen [1] argues that famine and other catastrophes are easily avoided in a democracy. This argument relies on the fact that where information can freely diffuse, decision makers can form an unbiased picture of the state of a society, and take proper measures. Biases due to individual opinions are expected to be washed out in the information aggregation process, a phenomenon often referred to as the "wisdom of crowds" [2]. Still, cases of information aggregation failure abound even in democratic societies [25]. For example, in the aftermath of the 2008 Lehman Brothers bankruptcy, former Federal Reserve chairman Alan Greenspan expressed his state of "shocked disbelief" during his hearing before the US Committee of Government Oversight and Reform, leaving the public opinion to wonder where did he get the information the policy of the FED was based on. Al Gore [4] argues that shortcuts between decision makers and the media are often such that the former are not in the best position to be informed about what is going on.
A number of models on social dynamics have addressed the issue of information aggregation (see e.g. [5] for an excellent review). The simplest is probably the voter model, which entails agents taking the same opinion of a randomly chosen node amongst their neighbors. This allows for sharp predictions [6,7] where generally the information aggregation process converges to the incorrect outcome with finite probability. Other contributions have instead proposed different opinion dynamics mechanisms, such as majority rules [8] or social impact models [9], which support different conclusions. These models, however, come short in their micro-economic foundation, as the interaction mechanism is somewhat arbitrary.
More detailed micro-economic models of social learning have been proposed in the Economics literature. It is well known that information aggregation may fail when agents free ride on the information gathered by others, without seeking independent sources. This phenomenon, called rational herding [10], is also supported by experimental evidence [11].
A sequel of papers have focused on Bayesian learning schemes [26] [13][14][15][16], coming to the generic conclusion that when agents update their beliefs following Bayes rule society correctly aggregates information (still see [17][18][19]). Some authors have focused on the impact of dominant groups of individuals on the aggregation of information. For example, Bala and Goyal [20] introduced the notion of "Royal Family" as a group of agents whose behavior is observed by anyone else. Alternatively, Golub and Jackson [21] defined t-step "prominent groups" as those groups whose behavior eventually influences all other agents within time t. Regardless of the specific definitions, these and other studies unanimously highlighted the negative role that exceedingly influential groups have on the information aggregation process.
In this paper we focus on an extremely stylized model of a society and we address the issue of whether information distributed across the population is able to diffuse to an uniformed well connected clique of decision makers. Our model assumes Bayesian learning, but differently from [13,14], who study a continuum of agents, we study a finite but large population of agents connected by a social network. On a finite network, when agents talk repeatedly with their peers, they may not be able to disentangle what in their peer's opinion is new information and what reflects information exchanged in previous interactions, including the one provided by themselves to them. This phenomenon, called "persuasion bias" in [22], introduces a non-trivial positive feedback and leads to information aggregation failure, at odds with the conclusions of [13,14].
The main conclusions of our paper can be summarized in two points: i) information aggregation crucially depends on the synchronicity of the information updates of different agents: in the extreme case of a parallel update dynamics, where we can derive analytic results, information diffusion leads to the correct outcome in the limit of a very large society or for very informative initial signals. When the fraction of agents who update their beliefs at each time step is lower than a critical threshold, the society converges with finite probability to the wrong outcome, no matter how large the society is. ii) In the case of parallel dynamics, information aggregation degrades as the size of the clique of uninformed agents gets smaller. In particular, the limit of a vanishingly small clique of uninformed agents, behaves markedly differently from the case of a homogeneous society (with no clique). Both results suggests that it might not be wise to rely on crowds in situations which are reminiscent of those prevailing in our societies, where update is sequential and the social network is characterized by highly connected cliques (news corporations, political parties).
The paper is organized as follows. In Section II we detail the network structure and the information update rules, as well as our quantitative measure of a social network's ability to correctly aggregate information. In Section III we provide analytic results for the case of parallel information update. In Section IV we numerically compare these results with those obtained in sequential update schemes. We conclude the paper with a few conclusive remarks in Section V.
A. Core-periphery network structure As already stated, our interest is mainly focused on the information aggregation process as performed by societies where a fraction of individuals matters much more than the vast majority of the population. In the language of networks, the most obvious measure of the importance of a node is its degree, i.e. the number of neighbors. For this very reason, throughout the rest of the paper we shall focus on a highly stylized society structure, where only few nodes have a large degree, which we build starting from a connected regular graph where all N nodes have degree c ∼ O(1). Then, a randomly chosen set H made of N H non-neighboring sites are connected among themselves, thus forming a clique of nodes with degree N H +c−1 (this construction is such that each hub has exactly c links connecting it to nodes outside H). In the following, we shall be mostly interested in the case N H c, i.e. when H becomes a group of mutually connected hubs.

B. Initial beliefs
At time t = 0, each agent i / ∈ H receives signals about event X which are independently drawn from a probability distribution P H {s|X}. We assume these signals to be informative [13,17] (1) On the other hand, the agents i ∈ H -the leaders -are assumed to be initially uninformed. This means that their signals are independently drawn from a probability P H {s i |X} = 1/2 for s i , X = ±1.

C. Belief update dynamics
In our model, agents repeatedly exchange information with their neighbors. In this exchange, the generic agent i collects a certain number n of signals that we denote by s i = (s are the initial signals discussed above. Given this information set s i , by Bayes' theorem [23], the agent's state of knowledge about X is quantified by the conditional probability where P {s i } is the probability of the signals s i . Notice that the likelihood ratio of P {X = +1|s i } and P {X = −1|s i } does not depend on P {s i }. If the agent believes that the different signals are independent, then and the logarithm of the likelihood ratio, which embodies the state of information of agent i, can be described by a single variable θ i : At t = 0, agents have just one signal. Then we have n = 0 and the above expression reduces to the very compact form When two agents, say i and j with signals s i and s j respectively, meet, they communicate by exchanging signals and, as a result, their state of knowledge changes. Indeed, if Starting from an initial state of knowledge θ i (t = 0), for i = 1, . . . , N , one can think of different types of information update. Our assumption will be that at each time step t = 1, 2, . . ., a certain fraction Φ = N Φ /N (where N Φ ≤ N ) of randomly selected agents update their state of knowledge by listening to their neighbors. So, assuming that agents in the set I t = {i 1 , i 2 , . . . , i NΦ } are the ones to update their information at time t, one has: where a ij is the (i, j) element of the adjacency matrix A = {a ij } i,j=1,...,N , i.e. a ij = a ji = 1 if agents i and j are connected and a ij = a ji = 0 if they are not. Clearly, the above dynamics has two limiting cases: Φ = 1/N and Φ = 1. The former describes cases where agents update their information one at a time, and we shall refer to this particular situation as random node sequential (RNS) dynamics. The latter case, instead, describes a parallel dynamics where all agents simultaneously update their state of knowledge. This information update rule was initially proposed in [16], and, due to its analytical tractability, represents the most frequent choice in social learning models. In the following, we also shall investigate this type of dynamics, and then explore other cases in Section IV.
The dynamics in Eq. (6) is unbounded, i.e. each θ i will either diverge to +∞ or −∞. Thus, information aggregation properties can be assessed simply by looking at the signs of the θ i s in the long run. Thus, a good measure of information aggregation is given by the "magnetization" of the system: The quantity XΘ(t) tells what is the fraction of the population holding the right information on event X at time t.
A quantitative measure of information aggregation is given by the probability P {XΘ(t) > 0} that the majority will converge to the true outcome, in an ensemble of repeated trials.

III. PARALLEL DYNAMICS
According to the parallel dynamics prescription, all agents in a social network listen to their neighbors at any time t = 1, 2 . . ., and update their state of knowledge accordingly: By collecting all θ i (t)s into a column vector |θ(t) [27], the dynamics described in equation (8) can be rewritten as The above equation clearly suggests that the spectral properties of the adjacency matrix A play a crucial role in the time evolution of the state of knowledge vector |θ(t) . Being symmetric, the adjacency matrix A yields N real eigenvalues λ 1 ≥ λ 2 ≥ . . . ≥ λ N , whose corresponding eigenvectors |λ i (i = 1, . . . , N ) form an orthogonal set in R N . By decomposing the adjacency matrix as A = N i=1 λ i |λ i λ i |, one can see that, for large enough times, equation (9) becomes As is well known from Frobenius-Perron theorem [24], all components of the eigenvector |λ 1 , corresponding to the largest eigenvalue of the adjacency matrix A, share the same sign, which we shall assume to be positive from now on. Thus, in the light of the relation in (10), two main points become apparent: • For large enough times |θ(t) is proportional to |λ 1 , meaning that all agents on the network either learn the correct value of X or they all get it wrong.
• The sign of the components in |λ 1 is completely determined by the sign of the overlap λ 1 |θ(0) , so that the probability of the whole network learning the right information reads In the following we shall compute the probability (11) for the simple network topology discussed above. For the sake of simplicity, let us assume X = +1, so that the probability in equation (11) is equivalent to the probability of the scalar product λ 1 |θ(0) being positive, and that each agent is initially given one signal s = ±1 at time t = 0. Assuming that hubs, i.e. nodes in the clique H, have no initial information (θ i (0) = 0 for i ∈ H), such a scalar product can be written as a sum over the N − N H sites not belonging to H: where λ (i) 1 denotes the i-th component of the first eigenvector, and θ i (0) = s i log p 1−p (see equation (5)). A good approximation scheme to estimate the probability of the quantity in equation (12) being positive is via the central limit theorem: as a matter of fact the scalar product in (12) is the sum of N − N H random variables, each given by the product of two random variables: y i = λ (i) 1 θ i (0). Thus, the probability of Y in equation (12) being positive is approximately given by where µ Y and σ Y denote the mean and standard deviation, respectively, of the random variable Y . Given the independence of the θ i s and the eigenvector components λ (i) 1 s, such two quantities are given by where µ θ and σ θ denote the mean and standard deviation of the random variables θ i , whereas µ λ and σ λ denote the mean and standard deviation of the eigenvector components λ (i) 1 for i / ∈ H. Computing µ θ and σ θ is easy. Recalling that signals must be informative (see equation (1)), one has p = P {s = +1|X = +1} > 1/2. Let us rewrite such probability as p = (1 + x)/2 with x ∈ (0, 1). Then, one can immediately verify that As regards µ λ and σ λ , good approximate expressions for them can be computed by employing standard perturbation theory up to second order (see Appendix A for the details). To leading order in N one gets: where f = N H /N denotes the fraction of hubs in the network. As can be seen from the inset in Fig. 1, the above approximations are in excellent agreement with results obtained from numerical diagonalization of adjacency matrices, especially for large network sizes. Plugging equations (15) and (16) into equation (14), one can eventually compute the probability of converging to the right value of event X as in equation (13): .
As already stated, we are mostly interested in cases where only a few nodes in the network play the role of hubs, i.e. f 1: in this case the probability in equation (17) further simplifies to the following remarkably simple expression: In Fig. 1  A few comments are in order on the approximate result of equation (18). Since erfc(−2)/2 1, according to equation (18) for each system size N correct information aggregation happens with probability that for all practical purposes can be considered equal to 1 when initial signals' informativeness is p ≥ p 0 = (1+x 0 )/2, where x 0 = 2 2/(N f c). This point essentially means that for any population size N correct information aggregation is possible, for informative enough initial signals, despite the presence of a fraction f of dominant nodes. Such a result shows that the presence of a group of individuals with large influence does not necessarily jeopardize correct information aggregation. Moreover, the threshold value x 0 is inversely proportional to √ N , meaning that large populations will be able to aggregate information correctly as soon as signals are informative, i.e. as soon as p is slightly larger than 1/2. This is essentially a stronger statement of previous results obtained for infinite networks (see for example [17]), where the presence of signals with arbitrarily large informativeness, combined with the lack of individuals with unbounded influence, is identified as a sufficient condition for correct information aggregation. On the other hand, for p < p 0 the population reaches consensus on the wrong value of X with non-zero probability.
A very interesting role in the information aggregation process is played by the fraction of hubs f . In Fig. 2, one can see how, for a fixed system size N , the probability of correct information aggregation behaves when increasing the fraction f of hubs in the network. Also, it is rather interesting to compare such results with the information aggregation capabilities of a regular graph where all nodes have the same degree c. In such a case, one can immediately verify that the first eigenvector of the adjacency matrix is uniform with all components equal to 1/ √ N , and the probability of the scalar product in equation (12) being positive simply reduces to the probability of the sum N i=1 θ i (0) being positive. Therefore, one can compute the probability of correct information aggregation of a regular network with easy central limit theorem considerations, analogous to those already presented in this Section. Such a probability does not depend on c and reads: where the last approximation holds for large values of N . As one can see, equation (18) reduces to the above expression for f = c −1 (though numerically one does not find perfect agreement between the two, since equations (17) and (18) represent good approximations only for very low values of f ). So, the lesson to be learned from the plots in Fig. 2 is twofold. First, one can see that as soon as a very small clique of uninformed hubs is introduced in a regular graph the overall population's ability to correctly aggregate information decreases sharply. This can be also understood by observing that the probability in equation (17) does not recover the regular network (RN) result (19) when considering vanishingly small fractions of hubs, i.e.: On the other hand, whenever a clique of hubs is present in the network, then information aggregation can actually be improved by increasing the size of the clique itself, up to the point (for f c −1 ) where the aggregation ability of the original regular graph can almost be reproduced. Intuitively, the above findings can be altogether understood in the following terms. According to our setting, all hubs in the clique H are mutually connected and have a degree equal to N H + c − 1. This means that each hub has exactly c neighbors outside H, so that one can expect roughly cN H = cf N nodes to fall within the clique's neighborhood ∂H. So, for very low values of f , ∂H contains a negligibly small number of nodes, which, however, will largely influence the initially uninformed hubs whenever they communicate for the first time. Given the small size of ∂H, its initial state of knowledge will be much more sensitive to fluctuations in the initial signals distribution among agents. On the other hand, when f c −1 , the number of nodes in the neighborhood of H becomes of order N , hence much more robust with respect to fluctuations.
In summary, the role of hubs in our model is subtle, as a handful of them is enough to heavily damage the good information aggregation properties of a population of equals (as modeled by a regular graph), whereas increasing their number also has "healing" effects which can restore such good properties.

IV. GENERAL DYNAMICS
So far, we only have considered the most popular and widely used evolution rule for the information propagation on a network, i.e. the parallel dynamics introduced in equation (8). However, as already discussed in Section II, parallel dynamics represents one of the two extreme cases of the general dynamics (6), according to which a fraction Φ of agents listens to their neighbors at each time step t, i.e. the case Φ = 1. The other extreme case is the already mentioned RNS dynamics (Φ = 1/N ), according to which agents update their state of knowledge one at a time.
Numerical simulations highlight significant differences in a social network's ability to aggregate information correctly under parallel or RNS dynamics, the latter performing much worse than the former: as shown in the left panel of Fig.  3, the probability of correct information aggregation under parallel dynamics outperforms the one obtained under RNS dynamics over a wide range of signal informativeness levels [28]. Moreover, results obtained via RNS dynamics show no relevant dependence on the system size N .
The above findings suggest to look for a transition in information aggregation as a function of the number of agents that update their state of knowledge at a given time step by letting the parameter Φ take values over the whole interval [1/N, 1]. In the right panel of Fig. 3 we plot the probability P of correct information aggregation as a function of Φ for different system sizes and a fixed informativeness level of the signals initially distributed to agents (the qualitative overall appearance of the results is not changed when considering different levels of informativeness). As can be seen, for increasing values of Φ a transition is observed towards better information aggregation capabilities for all system sizes. This can essentially be interpreted in terms of the speed of information update. As one could expect, RNS dynamics is extremely slow compared to parallel dynamics (depending on the system size, we find on average that RNS dynamics reaches consensus in times that are 3-4 orders or magnitude larger than the ones required by parallel update), hence more prone to allow the spreading of misleading signals in the agents' initial distribution. On the other hand, parallel dynamics is fast, in such a way that in a few time steps each agent receives through his / her neighbors aggregated information coming from the whole network.

V. CONCLUSIONS
In summary, we have presented a stylized dynamic network model of the information diffusion throughout a large society featuring a small fraction of uninformed leaders. The model's simplicity allows, in some cases, to make analytical considerations. Namely, when assuming all agents to simultaneously update their state of knowledge on a given issue, we are able to provide a closed-form expression for the probability of correct information aggregation As can be seen, RNS dynamics does not show any significant dependence on the system size, and performs much worse than parallel dynamics at correctly aggregating information. RIGHT: Probability of correct information aggregation as a function of the fraction Φ of agents that listen to their neighbors at each time steps. The extreme cases Φ = 1/N and Φ = 1 correspond, respectively, to RNS and parallel dynamics. All data were obtained for signal informativeness fixed as x = 0.16. In both plots all data points are obtained by averaging over 10 4 independent network configurations.
as a function of the system size, i.e. the number of agents in the society, and the fraction of individuals playing the role of hubs. Our results partially overlap with previous works from the social learning literature in Economics, as we show that larger populations are better, on average, at aggregating information. On the other hand, we provide interesting novel results on the role played by the size of an uninformedélite, portrayed in our model by a clique of nodes that do not own any prior information on the issue being discussed by the population. First, we show a rather counterintuitive result, i.e. that increasing the relative size (compared to the overall population) of such uninformed elites actually helps the information aggregation process. Moreover, we show that letting the fraction of hubs go to zero does not recover the results obtained for the corresponding hub-free regular network.
Rather interestingly, we also show our model to be sensitive to the information update speed, as defined by the fraction of agents who simultaneously revise their information at each time step, by showing the existence of a transition towards better information aggregation capabilities when moving from the low speed towards the high speed regime. When assuming hubs to be identified by nodes 1, . . . , N H , the network adjacency matrix A takes the following block form: In the above equation, I is an N H × N H block such that I ij = 1 for i = j and I ii = 0 ∀i. The off-diagonal block G is of size N H × (N − N H ), and it accounts for neighbors of the clique H, i.e. G ij = 1 for i ∈ H and j ∈ ∂H, or vice versa, and zero otherwise. Lastly, the block C is of size (N − N H ) × (N − N H ), and it accounts for links between nodes that do not belong to H. Spectral properties of the adjacency matrix A, expressed in block form as in equation (A1), can be deduced from standard perturbation theory. As a matter of fact, such a matrix can be decomposed as A = A H +Ã, where For small values of c (i.e. the degree of nodes outside of H), the matrixÃ above is sparse and can be interpreted as a perturbation to the matrix A H describing the fully connected clique H plus a sea of N − N H disconnected nodes. Let us denote the eigenvalues and eigenvectors of the "unperturbed" adjacency matrix A H as λ i,H and |λ i,H . They fall within three categories: • The largest eigenvalue reads λ 1,H = N H − 1, and its normalized eigenvector |λ i,H has the first N H components equal to 1/ √ N H and the remaining N − N H ones equal to zero.
• λ i,H = −1 for i = 2, . . . , N − N H , with eigenvectors having non-zero components only in the first N − N H sites.
• λ i,H = 0 for i = N − N H + 1, . . . , N , with eigenvectors that can simply be chosen as having all components equal to zero except for the i-th component being equal to one.
Let us then approximate the first eigenvector of the full adjacency matrix A as |λ 1 |λ 1,H + |λ 1 + |λ 1 , where |λ 1 and |λ 1 denote the first and second order corrections, respectively, to the unperturbed eigenvector |λ 1,H . The first order correction only involves neighbors of the clique H, and it reads where n i = j∈H a ij represents the number of neighbors that node i has within the clique H. The second order correction [29] involves neighbors of the nearest neighbors of the clique H: where ∂(∂H) denotes the set of next to nearest neighbors of the clique H, whereas n i = j∈∂H a ij is the number of neighbors that node i has amongst neighbors of the clique H. In order to perform exact calculations up to second order, one should in principle compute the expected number of nodes belonging to ∂H and ∂(∂H), and the expected values of the quantities n j in (A4) and n j in (A5) by averaging over all possible network configurations built as explained in Section II for given N , N H and c. However, in order to keep things simple, let us just assume that each node in ∂H has just one neighbor in the clique H, and, in a similar fashion, that each node in ∂(∂H) has just one neighbor in ∂H, which amounts to posing n i = 1, ∀i ∈ ∂H, and n j = 1, ∀j ∈ ∂(∂H). Clearly, both such approximations work well as long as the number of nodes in H is small compared to N , i.e. for f 1 where f = N H /N . According to the above approximations, the N − N H nodes not belonging to H yield the following components in |λ 1 , as computed with equation (A3): Therefore, the mean µ λ and standard deviation σ λ (see equation (14)) can be computed as follows: and the approximations in equation (16) can be immediately derived as leading order results in N of the above expressions.