Locating Multiple Sources of Contagion in Complex Networks under the SIR Model

: Simultaneous outbreaks of contagion are a great threat against human life, resulting in great panic in society. It is urgent for us to ﬁnd an efﬁcient multiple sources localization method with the aim of studying its pathogenic mechanism and minimizing its harm. However, our ability to locate multiple sources is strictly limited by incomplete information about nodes and the inescapable randomness of the propagation process. In this paper, we present a valid approach, namely the Potential Concentration Label method, which helps locate multiple sources of contagion faster and more accurately in complex networks under the SIR(Susceptible-Infected-Recovered) model. Through label assignment in each node, our aim is to ﬁnd the nodes with maximal value after several iterations. The experiments demonstrate that the accuracy of our multiple sources localization method is high enough. With the number of sources increasing, the accuracy of our method declines gradually. However, the accuracy remains at a slight ﬂuctuation when average degree and network scale make a change. Moreover, our method still keeps a high multiple sources localization accuracy with noise of various intensities, which shows its strong anti-noise ability. I believe that our method provides a new perspective for accurate and fast multi-sources localization in complex networks.


Introduction
Contagion propagation [1][2][3] is an important topic in social network research [4], which brings huge damage to society [5][6][7], drawing much attention from researchers.When severe contagion outbreaks occur in important places simultaneously, decision-makers including government leaders almost have no ability to deal with the disaster very well.Here is a real case.In 2009, the H1N1 pandemic spread almost simultaneously in Beijing, Shanghai, Fujian and Guangdong provinces, which were the sources propagating the virus in China, and then across the country.After the application of medical treatment over several months, the number of infected people declined gradually and disappeared.If we acquired the infection state of the whole country at an early stage, we could know the cities which spread the virus initially and then take practical actions to prevent it from spreading further.Furthermore, for each originally infected city, we are able to adopt appropriate measures to eliminate it after the disease spread for a short while.Therefore, it makes sense for us to design a pre-warning system, whose core content means a fast and accurate multiple sources of contagion localization method in complex networks under the SIR [8,9] model.Despite many studies having been done in this field, the problem of multiple sources localization is still a challenging work.On the one hand, the exact number of sources and initial time of sources remain unknown to us; on the other hand, it is inevitable that the randomness of the propagation process will decrease the accuracy of multiple sources localization methods.In the past ten years, the multiple sources localization problem has attracted many researchers' attention.Shah et al. [10] first proposed a rumor centrality, as a maximum likelihood estimation of single source, in a tree network under the SI model.Then Lou et al. [2] popularized it to a multiple sources localization method.They proposed a Multiple Sources Estimation and Partitioning algorithm, the key to which lies in dividing the network to several disjoint infection regions by infection partitioning, where one region corresponds to one source.Zhu et al. [11] defined Jordan centrality, the farthest distance a node to all the infected nodes and the sources are the nodes with the smallest Jordan centrality.In addition, there is a similar concept, namely distance centrality, the sum distance of the node to the whole infected nodes and the sources have the smallest distance centrality analogously.Fioriti et al. [12] presented to calculate the dynamical age of each node based on the importance of node dynamical.They considered the eigenvalue drop rate of the adjacency matrix as the dynamical age when a node was eliminated from this network and the sources are the nodes with the highest dynamical age.Based on the SIR model and incomplete node information, Zang et al. [13] proposed an advanced unbiased betweenness algorithm.They used a reverse propagation algorithm to build an extended infection graph and marked off several infection subgraphs where we can identify the source with the highest unbiased betweenness.Moreover, based on source identification algorithm, Wang et al. [14] presented label propagation, where they set up an initial label, propagated label and chose the nodes with the tallest label as the sources.Hu et al. [15] combined a backward diffusion-based method with IP to locate both sources and the initial diffusion time with a limited number of observers.
In addition, some scholars study the source localization problems from different angles [16][17][18][19][20]. Nino Antulov-Fantulin et al. [21] proposed a new source localization method based Monte-Carlo simulation under the SIR model.Fu et al. [22] studied a backward diffusion-based source localization method.Based on the times at which the diffusion reached partial observers, the maximum time when the diffusion goes reversely from partial observers to each node is calculated.Then the node with the minimum value is picked up and recognized as the source.Huang et al. [23] used observers to diffuse reversely and found the node as the source with minimum variance yields, resolving the single source localization problem in the temporal network.
In this paper, we propose the Potential Concentration Label method (PCL) to locate multiple sources of contagion in complex networks under the SIR model.The main idea in this paper reflects that the sources prefer to exist in the infection region with more infected neighbor nodes where the nodes have the maximal value of potential concentration label just right.In the following sections, we first define the Potential Concentration Label and propose the PCL method.After that, we test the performance network parameters on sources localization accuracy in synthetic networks and real networks, compared to the other four benchmark methods.Finally, some experiments are carried out to measure the anti-noise ability of our method.

SIR Model for Contagion Propagation
In this work, we focus on an undirected graph G = {V, E}, where V is the set of nodes and E is the set of edges.Each node v ∈ V has its possible state-Susceptible (S), Infected (I), Recovered (R).The susceptible nodes represent the people who are infected easily but have not been infected yet, meanwhile the infected nodes denote the citizens who have already been infected and are capable of infecting other nodes.The recovered nodes are the individuals who remain immune or die.Suppose that there is a time-slotted system.At first, only several nodes are infected, which are the contagion sources in the network.Meanwhile, the other nodes are susceptible.At each time step, each infected node infects its susceptible neighbors with probability p independently, that is, a susceptible node is infected with probability 1 − (1 − p) n when it has n infected neighbors.Meanwhile, the infected nodes turn to be recovered with probability q.Additionally, the recovered nodes will not be infected, which may die or be removed.

Problem Formulation
As a contagion propagates through a complex network under the SIR model, all the nodes will change infection state as time goes by.The susceptible nodes may be infected by infected neighbour nodes and the infected nodes recover to a recovered node with a certain probability.Due to the emergency response to contagion, we mainly consider an initial infection situation of the whole network and only collect two states, infected and uninfected (susceptible, recovered), of all nodes.Accordingly, the problem of the multiple sources localization problem can be described as-given the simple snapshot of the network at an early certain moment, how can we accurately locate multiple sources?
It is common that we know the state of almost all nodes, but we have no ability to distinguish the susceptible nodes from the recovered nodes.Therefore, all nodes can be divided into two states-infected and uninfected, which decreases the accuracy of multiple sources localization certainty.

Potential Concentration Label Definition
In the early period of severe contagion propagation, disease outbreaks through a crowd quickly.It comes to the situation that the nodes around sources are more likely to be the infected nodes, that is, the sources are surrounded by many infected nodes.Only by depending on the infection states can we locate the sources in a complex network accurately.
Inspired by Figure 1a, which shows the concentration of a pollutant, it is clear that the sources are more likely to be the node set {d, k}, whose concentration is the highest (10).In fact, to get the state of each node is not easy, for example, some sensors do not have the capacity to measure concentrations, and can only judge whether the concentration surpasses a threshold value or not, and even then we may lose the concentration information.Therefore, the information we can obtain is incomplete, just like in Figure 1b.We can see two concentration states easily, 0 or 1 (1 denotes concentration over 8, 0 denotes concentration under 8) in a network, where an error occurred with node c .It seems we have no ability to identify the sources according to these concentrations, which is similar to the infection situation of contagion.Therefore, a new index needs to be proposed so as to distinguish between the sources and other nodes for incomplete pollutant diffusion and contagion propagation.We think the node with more infected neighbors, including the first order neighbor, the second order neighbor and so forth, is closer to the sources.Based on the above analysis, we propose a new concept, namely a potential concentration label, denoted by L .The potential concentration label is determined by its initial label and the labels of neighbor nodes.The experiments demonstrate that it is a good index for locating multiple sources of contagion in complex networks under the SIR model.

The PCL Method
In this section, we present the PCL method at length in this section.The purpose of PCL is to locate multiple contagion sources, which is realized by following four steps in Algorithm 1.

2.4.1.
Step 1: Label Assignment in the Snapshot of Network Due to the incomplete information, only two states can be seen in the network-infected and uninfected (susceptible and recovered).The infection state of nodes X is shown as follows-infected nodes carry the virus, denoted by 1; uninfected nodes carry no virus, denoted by 0. That is, if node i is infected, then where L 0 i is the initial potential concentration label of node i.

Step 2: Adding One Hub Node to the Network
In real networks, it comes up all the time that the network we acquire is disconnected, but connected actually.To avoid this situation, we can add a hub node in the network, which has a link with every node, to make it connect for certain and to increase its connectivity.Besides, the possibility of this node being infected is high enough that we assign label 1 to it directly.

Step 3: Potential Concentration Label Calculation by Iteration
The potential concentration label of a node is connected with the potential concentration label of neighbor nodes and its initial potential concentration label, so the potential concentration label of node i at t iterations becomes: where α, β is the proportionality coefficient, Γ i represents the first order neighbors of node i.
Before starting iteration, we should build an adjacency matrix A and a degree matrix D. Matrix A is decided by edge E, where A ij = 1 represents node i and node j have an edge.Matrix D is a diagonal matrix, where the i-th element is the sum of i-th row of matrix A. The transmission probability matrix T from neighbors is decided by adjacency matrix A and degree matrix D, such that T = D − 1 2 AD − 1 2 .The state of a node at moment t is mainly dependent on the states of its neighbor nodes at moment t − 1. Apparently, the potential concentration label of each node at moment t is proportional to the initial potential concentration label.Therefore, we choose α > 0, β > 0.
Thanks to the hub node, the diameter of the network decreases to two.That is to say, every node only has the first order neighbor and the second order neighbor.Therefore, a node acquires the label information from other nodes in the network, only requiring two iterations.It spends little time in getting the potential concentration label.
+ βL 0 i 6: end for 7: α > 0, β > 0, Γ i represents the first order neighbors of node i. 8: We choose the nodes with maximal value as the sources S ; 9: return S .

Step 4: The Multiple Sources Localization
The central idea of this paper is that the sources prefer to exist in an infection region with more infected nodes, meanwhile the potential concentration label of sources is superior to that of neighbor nodes.After several iterations, there are several maximal values of potential concentration labels existing in the network.Finally, we choose the nodes with the maximal value as multiple sources.

A Simple Example of Multiple Sources Localization
To better describe the PCL method, we just introduce a simple example of multiple sources localization.Given a snapshot of the network at some point, we can know the infection state of all the nodes.In addition, the sources are { f , h}.From Figure 2, it is easy to find that the node f and h always have the maximal value.According to PCL, we see nodes { f , h} as the estimated sources, which also are the true sources.

Data Descriptions and Measurements
To evaluate the performance of PCL method, we firstly introduce several synthetic networks, that is, ER, WS and BA, and real networks, that is, Karate, Lesmis, Adjnoun, Football, Jazz and USAir, as the experimental data.Synthetic networks are controllable, where network parameters can be adjusted, so that many tests can be done to verify the efficiency of the method.What is more, the data of Karate, Lesmis, Adjnoun and Football networks can be downloaded via the network data of Newman [24].The other data sets come from the corresponding references.Basic characteristics are shown in Table 1.As we know, F-measure is usually used to check the accuracy of estimated or identified sources in a complex network [25].It can be defined as follows: where precision is the ratio of the number of correctly identified sources over the number of all retrieved sources which is defined in Equation ( 3) and recall is the ratio of the number of correctly identified sources over the ground truth source, defined in Equation ( 4) In this paper, suppose that we already know the number of sources so that retrieved sources equal true sources, that is, precision equals F-measure.
Therefore, we choose the precision as the evaluation index of sources localization accuracy in this paper.The situation we face is a serious contagion so that we suppose infection probability p = 0.8, recovery probability q = 0.1.What is more, the results are obtained by averaging over 100 independent realizations.

Optimal Iteration Frequency Choice
We choose the nodes with the maximal value of the potential concentration label as the sources and the potential concentration label is related to the number of iterations.Therefore, we next test the performance of our multiple sources localization method under six iteration frequencies in synthetic networks and real networks, which can help us find the appropriate iteration frequency.
Figure 3 shows that the source's localization accuracy changes sharply when t1 is different.It is an interesting phenomenon that the accuracy reaches its highest when t1 = 2.The hub node plays a decisive role in the change of accuracy.On the first iteration, the hub node is an unnecessary node which brings error to the potential concentration label of each node.However, on the next iterations, the hub node transmits all node labels to each node as a bridge, which increases the accuracy of sources localization.Moreover, there is a turning point when β = 0.The accuracy is higher for β < 0 than it is for β > 0. The main reason lies in the incomplete infection information where a recovered node has actually been infecte , especially the sources, but it is considered to be uninfected when calculating.To get a better multiple sources localization performance, we choose t1 = 2, β = −1 in the following experiments.

Comparison Methods
To compare with the performance of PCL, we pick up some sources localization methods as benchmarks.
Distance Centrality (DC) [10]-represents the sum of the distances from one node to all the infected nodes.The sources usually have the smallest Distance Centrality.
Jordan Centrality (JC) [11]-denotes the maximum of the distances from one node to all the infected nodes.The sources prefer to have the least Jordan Centrality.
Unbiased Betweenness Centrality (UBC) [13]-the betweenness of one node eliminates the effect of degree, namely unbiased betweenness.The nodes are the sources, which always have the biggest unbiased betweenness.
Modified Label Propagation based Source Identification (LPSI) [14]: This method lets infection status iteratively propagate in the network as labels, and finally uses local peaks of the label propagation result as source nodes.

Sources Localization in Synthetic Networks
To test the efficiency of the PCL method, we first carry out some experiments in synthetic networks, that is, the Radom (ER) network [26], the Watts-Strogtz small world(WS) network [27], and the scale-free (BA) network [28].The ER network and WS network are homogeneous networks, and the BA network is a heterogeneous network.We focus our attention on the influence the network parameter has on sources localization accuracy.The main parameters are the scale of network N, average degree < k > and the number of sources s.
This paper mainly considers the sources localization problem, there is no denying that the number of sources is the most important network parameter.At first, we examine the effects the number of sources has on the performance of sources localization.Figure 4 shows that when the number of sources increases, the sources localization accuracy has a decrease tendency for all the methods, that is, PCL, LPSI, DC, JC and UBC.When the number of sources becomes large, multiple sources may be too closed to identify them easily.To find the number of sources accurately is the first problem we need to solve urgently.In a short, in the above three synthetic networks, PCL behaves better than the other four methods in sources localization accuracy.With the increasing of the number of sources, the sources localization accuracy of PCL only declines slightly, reflecting its strong robustness.From Figure 6, we find that the average degree has little influence on the sources localization accuracy with almost all methods.The results of the four methods in the ER network distinguish that PCL > DC > JC > UBC > LPSI; meanwhile those in the WS and BA networks distinguish that PCL > JC > DC > UBC > LPSI.All in all, PCL can always solve the sources localization problems, no matter whether the network is sparse or not.In other words, the accuracy of PCL keeps very robust when the number of edges in the network increases or decreases.For different scales of networks, Figure 7 indicates that the sources localization accuracy has a mild fluctuation with the increasing of network scale for all the methods except for DC.In the ER network, the DC method can get a higher accuracy when the scale of network increases.Of course, the accuracy of PCL keeps robust when network size changes.Thanks to its result, we can generalize this method to large networks based on the background of big data.

Sources Localization in Real Networks
In addition to the synthetic network, we test the performance of the above five methods in real networks (Karate, Lesmis, Adjnoun, Football, Jazz and USAir).These networks are social networks, where propagation usually occurs.
From Figure 5, we can find that PCL has the highest sources localization accuracy in all real networks.Meanwhile, the sources localization accuracy of PCL keeps robust with a different network structure.Moreover, it confirms that the average degree and the scale of the network have less effect on sources localization accuracy.All in all, PCL behaves best in sources localization accuracy of five different methods, that is, PCL, LPSI, DC, JC and UBC.

Anti-Noise
An efficient method needs to keep high accuracy under noise of various intensities.In this section, noise disposal strategies and infection state noise are taken into account so as to test the anti-noise of the sources localization method.
It is very common that the infection information of partial nodes may be lost.Now suppose that there is 20% nodes in a network unknown to us.There are three strategies to deal with it.(i) All-inf, a strategy where the nodes without infection information are considered to be infected; (ii) None-inf, a strategy where the nodes without infection information are regarded as uninfected; (iii) Rand-inf, a strategy where the nodes without infection information are thought to be infected randomly.Next, we test the performance of sources localization accuracy with five methods in synthetic networks and real networks.The results are shown in Tables 2 and 3.
Table 2.The average precision (PCL vs. LPSI vs. DC vs. JC vs. UBC), α = 0.1 : 0.1 : 1, β = −1, for different strategies of dealing with unknown infection information in synthetic networks and real networks with 20% uncertainty.The number of sources is s = 3 in all networks, < k >= 4, N = 100 in ER, WS and BA network, the connect probability is 0.1 in WS network.Besides, a bold number denotes the highest accuracy of sources localization in each network.

Network Strategy PCL LPSI DC JC UBC
Table 2 suggests that the sources localization accuracy of PCL reaches its highest among all the methods with each strategy in each network.The accuracy changes slightly, due to different strategies of dealing with noise, of the whole methods except for PCL.In most cases, when noise exists, PCL method achieves the highest sources localization accuracy, mostly choosing the None-inf strategy to deal with noise.
Except for the different strategies for dealing with unknown infection information, we further study the sources localization performance under noise of three various intensities (ni), which denotes the proportion of nodes we are unaware of.The noise intensities are shown such that ni = 0.05, ni = 0.1, ni = 0.2.From Table 3, with increasing noise intensity, the sources localization accuracy of all methods decreases in all networks.Apparently, our method shows a huge advantage in sources localization in all methods.PCL achieves the highest sources localization accuracy not only in an ideal situation (without noise), but also in a real situation(with noise).

Conclusions and Discussion
In this paper, we study multiple sources of the contagion localization problem under the SIR model.Given the snapshot of a network, we propose a fast and more accurate multiple sources localization method, namely Potential Concentration Label.What matters in this method is to find the nodes with the maximal value of the potential concentration label as the sources.Firstly, we assign the initial concentration label to each node according to its infection state; next, it begins the label propagation process, where the label of one node is determined by its neighbors' and its initial own, through two iterations; finally, we choose the nodes with the maximal value of the potential concentration label as the contagion sources.The experiments demonstrate that when the number of sources increases, the sources localization accuracy of our method decreases gradually.However, it keeps very robust as the average degree and network scale make a change.Compared to other benchmark methods, this method has a low time complexity and higher sources localization accuracy in synthetic networks and real networks.What is more, the anti-noise ability of our method is strong enough, which shows its effectiveness.
Although our method provide a new reference for the problem of multiple sources localization in complex networks, much work still needs to be done.The issue of sources localization we proposed is

Figure 1 .
Figure 1.A snapshot of the pollutant diffusion process.The letter denotes the node in network and the digital represents the concentration label of each node.The propagation sources are S = {d, k}.(a) pollutant concentration of network diagram.(b) incomplete pollutant concentration of network diagram, which is similar to the contagion situation of the network.

Algorithm 1 1 2 4 :
Potential Concentration Label Input: The network topology G and infection state X.Output: The multiple sources S .1: Set up the initial label L 0 i , i ∈ V; 2: Add a hub node to the network and L 0 N+1 = 1; 3: Construct the transmission matrix T = D − 1 2 AD − for t=1:t1 do 5:

Figure 2 .
Figure 2.An example of multiple sources localization.(a-c) represent the potential concentration label of the network at different iteration frequencies.

Figure 3 .
Figure 3. Sources localization accuracy with various frequencies of iteration in nine networks.(a-i) The relationship between precision and β in synthetic networks and real networks.Without loss of generality, we suppose α = 1.t1 denotes the number of iterations.

Figure 4 .Figure 5 .
Figure 4. Sources localization accuracy of different numbers of sources with five methods in synthetic networks.(a-c) The relationship between and α(β = −1) for different number of sources in ER network.The number of sources is s = 2, 3, 5 respectively, which is similar to Figure 5.The scale of the network is N = 100 and the average degree is < k >= 4. (d-f) The difference is that the experiments are carried out in the WS network.(g-i) Similarly, tests are done in the BA network.Besides, all contrast experiments (including subsequent experiments) involve the above five methods, that is, PCL, LPSI, DC, JC and UBC, which is represented by five different colors.

Figure 6 .
Figure 6.Sources localization accuracy of different average degree with five methods in synthetic networks.(a-c) Results in ER network.The average degree is < k >= 4, 6, 8 respectively, the scale of network is N = 100.The number of sources is s = 3, the same as Figure 7. (d-f) Results in WS network.(g-i) Results in BA network.

Figure 7 .
Figure 7. Sources localization accuracy of different network scale with five methods in synthetic networks.(a-c) Results in ER network.The scale of the network is N = 100, 200, 300 respectively, the average degree is < k >= 4. (d-f) Results in WS network.(g-i) Results in BA network.

Table 1 .
The parameter of real network.The quantities N, |E|, < k > represent the number of nodes, the number of edges and the average degree, respectively.

Table 3 .
The performance of sources localization accuracy with PCL vs. LPSI vs. DC vs. JC vs. UBC in synthetic networks and real networks under noise of various intensities.The precision is the mean value when α = 0.1:0.1:1.The number of sources is s = 3 in all networks, < k >= 4, N = 100 in the ER, WS and BA networks, the connect probability is 0.1 in the WS network.Moreover, the strategy of dealing with unknown infection information chooses Rand-inf.