Locating the Source of Diffusion in Complex Networks via Gaussian-Based Localization and Deduction

: Locating the source that undergoes a diffusion-like process is a fundamental and challenging problem in complex network, which can help inhibit the outbreak of epidemics among humans, suppress the spread of rumors on the Internet, prevent cascading failures of power grids, etc. However, our ability to accurately locate the diffusion source is strictly limited by incomplete information of nodes and inevitable randomness of diffusion process. In this paper, we propose an efﬁcient optimization approach via maximum likelihood estimation to locate the diffusion source in complex networks with limited observations. By modeling the informed times of the observers, we derive an optimal source localization solution for arbitrary trees and then extend it to general graphs via proper approximations. The numerical analyses on synthetic networks and real networks all indicate that our method is superior to several benchmark methods in terms of the average localization accuracy, high-precision localization and approximate area localization. In addition, low computational cost enables our method to be widely applied for the source localization problem in large-scale networks. We believe that our work can provide valuable insights on the interplay between information diffusion and source localization in complex networks.


Introduction
Diffusion dynamics taking places on complex networks has been a long-term hot topic with importance value to help us better understand many ubiquitous natural phenomena and social behaviors. In general, many different diffusion processes that occur in daily life are harmful and may result in huge losses to society. Prototypical examples include outbreaks of epidemics among humans [1], the spreading of rumor over social networks [2], cascading failures of power grids around country wide [3], etc. If decision-makers, such as managers and politicians, can identify the diffusion sources as early as possible, they are more likely to make the right decision in time and avoid economic losses and social panic due to the associated tragedies. In order to better resist the potential terrible consequences induced by those diffusion processes, there is a great need for us to develop efficient strategies to locate the source of diffusion and devise control methods as early as possible.
Despite great effort that has been made in this field, the problem of identifying the zero patient is still challenging work. The main difficulty lies in two aspects, i.e., the incomplete information and the stochastic nature. In one aspect, the exact number of sources and the zero time at which diffusion first occurred are usually unknown to us. In addition, even if we can get full access to such information, the inevitable randomness of the diffusion process still weakens our ability to accurately locate the source. For example, in the epidemic spreading process, such as the susceptible-infected process, different initial conditions may lead to the same observations, making it extremely hard for us to identify the real source.
Models and methods for source localization have been studied in a number of literature works. Shah and Zaman [4] were two pioneers who first provided a systematic study of the problem about how to detect the origins of a computer virus in a network. They modeled the process of virus spreading within a network via a susceptible-infected (SI) model and presented the Rumor centrality as the maximum likelihood estimator for a class of graphs. Inspired by their work, Zhu and Ying [5] developed a sample-path-based approach named Jordan centrality to detect the information source in a network under susceptible-infected-recovered (SIR) dynamics. Later, two Bayesian inference solutions, namely dynamic message-passing [6] and belief propagation [7], were successively designed to measure the probability distribution for the observations and identify the zero patient of the networks. Brockmann and Helbing [8] studied global disease dynamics and proposed a new index named effective distance to locate the origin of complex spatiotemporal patterns. Zhu and Ying [9] developed the Short-Fat Tree algorithm to locate the diffusion source with an independent cascade model. Hu et al. [10] developed a framework for optimal source localization in arbitrarily weighted networks with arbitrary distributions of sources based on controllability theory and compressive sensing. Several other methods for source localization [11][12][13][14][15][16] and observation selection [17,18] have also been proposed based on different considerations.
Those excellent works are mainly based on the knowledge of the diffusion status for a proportion of nodes at specific snapshots. However, the time information or the timestamp when diffusion first arrives at some nodes has not been fully investigated by many scholars. In [19], Pinto et al. presented the Gaussian heuristic as a maximum likelihood estimator of source localization problem on arbitrary trees, but its performance could not be guaranteed in general graphs. Zhu et al. [20] formulated the source localization problem as a ranking problem on graphs and proposed two ranking algorithms, namely cost-based ranking and tree-based ranking, to locate the diffusion source in networks. Recently, Shen et al. [21] developed a time-reversal backward spreading algorithm to efficiently locate the source of a diffusion-like process and proposed a general locatability condition. For multiple sources detection, Fu et al. [22] investigated a maximum-minimum strategy based on backward diffusion, which has been further extended by Hu et al. [23] via integer programming.
In this paper, we present the Gaussian-based localization and deduction (GLAD) as a simple and efficient framework for locating the diffusion source in networks based on partial timestamps. We mainly considered a simple diffusion process associated with time delay and provided a probabilistic method to locate the source of diffusion via parameter estimation and maximum likelihood estimation. To be more concise, we derived an optimal solution of the source localization problem on arbitrary trees and extended it to general graphs via approximations and simplifications. Experimental results were conducted on synthetic networks (including arbitrary trees and general graphs) and real networks, and the results all verified the good performance of our algorithm.

Methods
In this section, we provide a brief introduction about diffusion models and source localization problems at first. Then, on arbitrary trees, we derive the GLAD framework as a maximum likelihood estimator that can simultaneously estimate the probability that a node is the diffusion source and provide the corresponding diffusion parameters. Finally, we discuss the difficulties for source localization problem on general graphs, and present an approximate method with low computational cost for those cases.

Problem Definition
The underlying network in which the diffusion occurs was modeled as an undirected simple graph G = (V, E), where V denotes the set of nodes and E denotes the set of edges. Specifically, we mainly focused on static networks whose topology never changes during the diffusion process. The diffusion source s * ∈ V represents the only node at which the information originates and it triggers diffusion at some unknown initial time t 0 . Compared with many previous literature works based on stochastic epidemic processes such as SI, SIS and SIR, a simple diffusion model associated with a time delay along edges was employed in our study.
The diffusion process is modeled as follows. At time t, each node can be in one of two states: informed and uninformed. An informed node represents the individual who is aware of the information and will propagate it to its neighbors, whereas one uninformed node represents the individual who has not been informed yet. Let Γ(v) denote the neighbors of node v and suppose that v is in the uninformed state. After v receives the information from one of its neighbors for the first time t v , it will change to the informed state and propagate the information. Then, each of its neighbors u ∈ Γ(v) receives the information from v at t v + θ uv , where θ uv denotes the diffusion delay associated with edge uv. Specifically, the diffusion delays {θ uv } along each edges were modeled as an independent and identically distributed random variables that follows a Gaussian distribution N (µ, σ 2 ). As is often the case in many real situations, such as the spread of a computer virus on the Internet via a cable, the mean diffusion delay µ is easy to evaluate, whereas the standard deviation σ is hard to obtain. We further assumed that only µ is known, and σ needed to be estimated.
Let O = {o k } K k=1 denote a group of observers whose informed times .., t o K ] T can be accessed by us. Then, the source localization problem could be described as follows: given the network topology and several observers O, our goal was to find the diffusion source s * in the network. To simplify the derivation and avoid a non-invertible matrix in our method, we assumed that s * / ∈ O. There was no loss of generality since one can generalize our method by adding an extra step to calculate the probability of being the source for those o k ∈ O with a known initial time t o k . In reality, the rumor source will not report its informed time, since this information will likely expose itself as the first to report the rumor.

Source Localization on Arbitrary Trees
Consider first the case of an arbitrary tree. In graph theory, a tree is defined as a connected, undirected graph that contains no closed loops. Obviously, describing the diffusion process occurring in trees presents a natural advantage because there is only one path between the source and each observer.
Since the diffusion delays {θ uv } along edges are random variables, the arrival times {τ o k } for observations can also be viewed as random variables.
Suppose that s is the source node and t 0 is the initial time that s starts diffusion, and let P (s, o k ) ⊂ E be the path from the node s to the observer o k with a length d sk = |P (s, o k )|. As all diffusion delays Taking the above, the likelihood probability of observing the diffusion timestamps T O at arrival time T O with respect to the source node s, the initial time t 0 and the delay standard deviation along edges σ is expressed as follows: The mean vector µ s and the covariance matrix Λ s of the joint Gaussian distribution are as follows: where [Λ ps ] i,j = |P (s, o i ) ∩ P (s, o j )| denotes the path intersection matrix, and the element of matrix represents the number of joint edges on two paths from node s to observer o i and o j .
Since we consider that no prior information is available on the location of the source node, the optimal estimator is the maximum likelihood estimator (MLE): Let w s = [w s1 , w s2 , ..., w sK ] T represent a K-dimensional column vector of the differences between the observation time and the mean arrival time for the observers: For any estimatorŝ, by maximizing Equation (1) with respect to t 0 , the MLE of t 0 can be represented ast where I = [1, 1, ...1] T denotes a K-dimensional column vector full of 1s.
To better describe the optimization procedure, we created an auxiliary variablê Substituting Equation (6) into Equation (1) and following a similar procedure by maximizing P(T O |s,t 0 , σ) with respect to σ, the MLE of σ had the following form Finally, the optimal estimator for the source node s iŝ The equations above constituted the core of our source localization algorithm in arbitrary trees. Since our optimizations were based on the assumption of Gaussian distribution on edge delays, and the diffusion parameters were deduced from observers, we named our method Gaussian-based localization and deduction (GLAD). Figure 1 demonstrates a diffusion process on a toy tree with 11 nodes and 10 edges. In this model, the diffusion delay along edges were sampled from a Gaussian distribution N (1, 0.25). We assumed that one node s was the diffusion source and three nodes O = {o 1 , o 2 , o 3 } were observers. Only the network topology, the mean diffusion delay, and the informed times of observers were accessible to us, where the informed time of observers was T 0 = [2.319, 0.662, 2.488]. Now, we introduce the calculation process of a GLAD algorithm, taking node B for example. From Figure 1, it is easy to know the path from node B to three observers A, J, F: Combining Equation (6) and Equation (8), the MLE of t 0 and σ is represented as Finally, the objective function of node s in Equation (9) is calculated: Following the optimization process in GLAD, the diffusion parameters (t 0 , σ) estimated by Equations (6) and (8) and the corresponding objective function value in Equation (9) for each candidate source s ∈ V\O are illustrated in Table 1. From the table, we can see that the real source I has been successfully identified, and the estimation errors for t 0 and σ are fairly acceptable.

Source Localization on General Graphs
When information is diffused on general graphs, the source localization problem will be more difficult. In general graphs, a spanning tree grows naturally with the diffusion process based on the sequence at which information first reaches each node. Unfortunately, although the real diffusion process in the network is a deterministic process, we cannot ascertain which spanning tree is the actual diffusion tree because the exact diffusion delays along edges are unknown. For that reason, to find the MLE for the diffusion parameters (t 0 , σ) and the source s, we need to optimize the likelihood function in Equation (1) over all possible spanning trees rooted at each candidate source. However, the naive strategy is intractable in practice because, even in medium-sized networks, the number of spanning trees is too large and the computational cost is prohibitive for modern computers.
One possible solution is to assume that the actual diffusion tree is a breadth-first search (BFS) tree, instead of an arbitrary spanning tree. This assumption corresponds to the case that information travels from the source to each observer along the shortest mean diffusion delay path, which is reasonable and intuitive. Nevertheless, even for the same root node, different search strategies may lead to different BFS trees. The naivest approach is to randomly select a BFS tree as the diffusion tree, although this leads to a poor performance according to our numerical experiments in the next section. In the following, we introduce an approximation method to efficiently locate the source in general graphs.
In graph theory, an important feature is that for any BFS tree with the same root s, the distance d sk between s and each observer o k never changes. Recall the diffusion parameters µ s and Λ s . As can be seen in Equation (2), the mean vector µ s does not change in different BFS trees. On the other hand, from Equation (3), we could find that, although the covariance matrix Λ s may vary in different BFS trees, the diagonal elements in each Λ s remain unchanged. In addition, matrix Λ s is diagonally dominant and sparse in large networks so that the diagonal matrix of Λ s can replace it with little effect on parameter estimation and likelihood maximization. Inspired by these phenomena, we proposed to perform the optimization with a modified function: where D s = diag(Λ s ) denotes the diagonal matrix of the original covariance matrix Λ s . As D s does not change in different BFS trees, one can randomly build a BFS tree and obtain it.
In the new optimization problem, the corresponding MLE for t 0 can be given bỹ where w sk = t o k − µ · d sk is the same as in Equation (5).
Then, the auxiliary variable should bẽ The MLE of σ remains the same as in a previous subsection: Finally, the optimal estimator for the source node s becomes Generally, we located the diffusion source by optimizing the naïve likelihood function (Equation (1)) and the modified function (Equation (14)) on general graphs. In order to distinguish between these two cases, we named them GLAD-naïve and GLAD-modified, respectively. However, for GLAD-naïve, we had to randomly build a BFS tree for each node firstly and optimize the naïve likelihood function (Equation (1)).

Computational Complexity Analysis
The computational cost of GLAD-naive consists of three parts: the building of BFS tree rooted at one candidate source v ∈ V\O, the calculation of path intersection matrix Λ ps , and the estimation of diffusion parameters t 0 , σ and objective function. Suppose that the numbers of nodes and edges in the network G(V, E) are N and M and we have K observers. Generally, when the topology of the network is known, building a BFS tree rooted at one node v will cost O(M). In addition, in the meantime, the paths from v to each nodes on the tree can be gained naturally along with the tree building scheme. As each element in the path intersection matrix [Λ ps ] i,j denotes the length of the common path from v to observers i and j, O(N) should be the cost to check how many edges on the tree lay on the common path P (s, o i ) ∩ P (s, o j ). Thus, the computational cost of this step is O(NK 2 ). The main costs in the third step of parameter estimation and objective function calculation were the computation of Λ −1 ps and |Λ ps |, which both cost O(K 3 ) by typical algorithms in linear algebra. In total, the whole computational The calculation time could be further reduced in GLAD-modified because the path intersection matrix Λ ps did not need to be explicitly constructed, since we only cared about its diagonal elements d sk . This process could be performed in a batch mode with only O(KM) in which we started the BFS process for each observer o k and then recorded the distance d sk from it to all candidate sources s ∈ V\O. According to Equations (15)-(18), the parameter estimation and objective function calculation can be finished in a linear time O(K) for each candidate source. Consequently, the whole computational complexity of GLAD-modified is linear, i.e., O(KM + KN) = O(KM), which enables its wide application in large-scale networks.

Experiments and Analysis
To quantify the validity of the proposed algorithm, we present numerical results on the success rate of source localization on arbitrary trees and general graphs. In real implementations, the diffusion delay θ uv along each edge is sampled via an independent and identically truncated Gaussian distribution to ensure that θ uv > 0. Since no prior knowledge is available on the diffusion source s and observers O, they are chosen randomly among the network. The results are obtained by averaging over 100 independent realizations.

Metrics and Benchmark Methods
The algorithm performance is evaluated using three metrics, namely average ranking, γ%-accuracy, and average error distance. The average ranking is the average location of the actual source in the list of nodes sorted in increasing order by the objective function value. We focus on the average value after many simulations. The γ%-accuracy represents the proportion of simulations for which the real source is ranked within the top γ% among all nodes. In particular, ties are broken randomly for nodes with the same ranking. As for the average error distance, it is defined as the average distance between the estimated source and the real source. Among those metrics, the average ranking reflects the average accuracy of source localization, the γ%-accuracy focuses more on high-precision localization, and the average error distance mainly considers the approximate area of the source.
The performance of GLAD-naive and GLAD-modified are compared with two well-known benchmark methods, namely Gaussian heuristic and time-reversal backward spreading: 1.
Gaussian heuristic (GAU). In [19], Pinto et al. first showed the possibility of estimating the location of the source from measurements collected by sparsely placed observers. They modeled the diffusion delays along edges with Gaussian distribution N (µ, σ), and built an MLE as the optimal solution for the source localization problem in arbitrary trees. Compared with our method, Pinto assumed that both µ and σ were known parameters, whereas we allowed σ to be unknown and determined it via estimation process.

2.
Time-reversal backward spreading (TRBS). The time-reversal backward spreading algorithm proposed by Shen et al. [21] was an efficient method to infer the diffusion source based on a weighted network structure and partial timestamps. In their method, the variance of the differences between the true arrival times and the expected arrival from a node to all observers was calculated, as a measurement to evaluate the extent to which it is the diffusion source.
Note that several topological-based methods, such as Rumor centrality and Jordan centrality, are not employed as benchmark methods since they cannot well exploit the timestamp information. (We have implemented these methods in experiments, and the numerical results suggest that they are far less accurate than the timestamp-based methods.)

Results on Arbitrary Trees
We first perform simulations on two types of arbitrary trees, i.e., ER trees and BA trees. These two tree networks are generated by random network (ER) and scale-free network (BA), respectively, which are all connected networks.
To generate a BA tree, we firstly make the initial network contain only one isolated node. At time t, add one new node to the network and meanwhile add an edge linking the new node with one existing node according to the preference attachment rule in the BA model. Relatively speaking, it is very difficult to directly use an ER model to generate a connected network with N nodes. Although n − 1 edges can be randomly generated among the N nodes, there is no guarantee that the generated network is a connected network. Thus, a compromise is adopted here: Firstly, we generate a ER network with 1.5N nodes and its average degree is 1. If the giant component in the network has N nodes (through many simulations, we do find the giant component), take the maximum spanning tree as the approximation of ER tree; if not, repeat the first step, generating another ER network, until we find the ER tree. Figure 2 shows the average ranking under different methods with different fraction of observers. In the experiments, the number of nodes in the tree is fixed to be 100, and the signal-to-noise ratio of the diffusion delay along edges µ/σ are chosen from {4, 3, 2}. A lower signal-to-noise ratio implies larger uncertainty or noise on edges. The average ranking of the source for GLAD-naive and GAU is slightly lower than that of the other two methods, which is consistent with the previous discussion, indicating that they are all optimal solutions in arbitrary trees. Since GLAD-modified does not fully exploit the path intersections between the source and each observer, it performs strictly worse than GLAD-naive. The same reason leads to the poor results of TRBS in one aspect. On the other hand, as TRBS does not model the diffusion delay along edges, its performance is not robust to large noise. As can be seen from left to right in the figure, with the increase of the noise, TRBS performs worse and worse compared to other methods. Only the mean delay µ is used to identify the source. Figure 3 demonstrates the γ%-accuracy of different source localization algorithms on arbitrary trees. Compared with the previous simulations, the results of the four methods under this metric are more distinguishable that GLAD-naive > GLAD-modified > GAU > TRBS. Although the average ranking of GLAD-naive and GAU are nearly equivalent, the γ%-accuracy of the former algorithm is obviously better than that of the latter. This finding reveals one disadvantage of GAU that it cannot precisely locate the diffusion source in trees. In contrast, the proposed GLAD-naive and GLAD-modified estimate the diffusion parameters t 0 and σ for each node, which will improve the ability to distinguish between the real source and its neighbors. As the signal-to-noise ratio decreases, the elements of the covariance matrix Λ s in Equation (3) become increasingly significant. As GLAD-modified ignores the non-diagonal elements of Λ s , the performance gaps between GLAD-naive and GLAD-modified will also increase.  The average distance between the estimated source and the real source is shown in Figure 4. Under this metric, GLAD-naive and GAU show the best performance among all methods, which is similar to the findings shown in Figure 2. More precisely, the performance of GLAD-naive is slightly better than GAU, especially in the trees with larger noise. A strange phenomenon is observed in which the probability of identifying the real source for GLAD-modified is far larger than TRBS, whereas the average error distance of the two methods is similar. In BA trees, GLAD-modified performs even worse than TRBS. Compared with TRBS, the performance of GLAD-modified is more sensitive to the relative locations between the source and observers. In the discussion section, we will elaborate on the relationship between these two methods.  . Average distance between the estimated source and the real source in different source localization algorithms on synthetic trees. If more than one node is ranked first, we randomly choose one as the estimated source.
The mean square errors (MSE) of the estimated parameters t 0 and σ are illustrated in Figure 5. In the experiments, the initial time is t 0 = 0 and the parameters are µ = 4 and σ = 1. We assume that only µ is known, and try to estimate the other two parameters by GLAD-naive and GLAD-modified. Since GLAD-naive is the optimal solution for the source localization problem in arbitrary trees, the MSEs of t 0 and σ obtained by GLAD-naive are obviously lower than that of GLAD-modified.  N (4, 1). Note that the MSE results of t 0 and σ are not calculated based on estimated source, but rather on the real source.

Results on General Graphs
Simulations on general graphs are performed on synthetic networks and real networks. For synthetic networks, we consider three common types, i.e., Erdös-Rènyi random network ER (N, k), Barabási-Albert scale-free network BA(N, k) and Watts-Strogatz small-world network WS (N, k, p), where N denotes the number of nodes, k denotes the average degree, and p denotes the rewiring probability in WS networks. A brief introduction of some basic topological features is shown in Table 2. Table 2. Basic topological features of three synthetic networks. N and M represent the number of nodes and edges. k and k max are the average degree and the maximum degree. H is the degree heterogeneity. d denotes the average path length between node pairs. C is the average local clustering coefficient of the network. All features are averaged over 500 times of simulations on the corresponding synthetic networks.  Figure 6 shows the performance on three synthetic networks with an average degree k = 4 and signal-to-noise ratio µ/σ = 3. Unlike the case on arbitrary trees in previous section, the algorithm performances vary greatly on general graphs. The results demonstrate a uniform ranking of the four methods under all three metrics that GLAD-modified > TRBS > GLAD-naive > GAU. Although GLAD-naive presents satisfactory results in arbitrary trees, its performance is just normal in general trees. However, one advantage of GLAD-naive is maintained in that it performs consistently better than GAU, thus illustrating the necessity of estimating diffusion parameters for different nodes. In arbitrary trees, GLAD-naive always performs better than GLAD-modified, whereas the situation is completely reversed in general graphs, which explains once again the great need to design a special algorithm on general graphs. Theoretically, if one BFS tree rooted at one node is indeed the diffusion tree, GLAD-naive will definitely outperform GLAD-modified, which is consistent with the results in arbitrary trees. Nevertheless, that condition is difficult to satisfy in dense networks because there are so many BFS trees in such networks. In addition, the uncertainty of the diffusion delay along edges further weakens our ability to determine which tree is the true diffusion tree. Taken together, although the likelihood function for GLAD-modified is just an approximation of GLAD-naive, it can present a much better performance on general graphs.
Another important finding is that it is often easier to locate the diffusion source in a homogeneous network like ER and WS than in a heterogeneous one like BA. The greatest obstacle of source localization problem in a heterogeneous network is the information redundancy of observers. Consider a simple heterogeneous case of a star-like network with one center node connected with many peripheral nodes. In such a network, regardless of which node is the real source, it cannot be uniquely identified since we do not know the exact initial time of diffusion. Actually, the information provided by peripheral observers is equivalent to the information provided by the center node, thus increasing the difficulty of locating the source in a heterogeneous network. However, such situations do not occur in homogeneous networks because there are only a few peripheral nodes in such networks.
Next, we consider the results of source localization in real networks. Nine networks from different fields are used to compare the performances. States [24] is a network of 48 contiguous states and the District of Columbia of USA, where an edge exists if two states share a border. Dolphins [25] contains the frequent associations between 62 dolphins in a community. Polbooks [26] represents the frequent co-purchasing books about US politics sold by the online bookseller Amazon.com. Football [27] is a network of American football games between Division IA colleges during regular season Fall 2000. Enron [28] email communication network covers the communications among the employees in Enron corporation between 1999 and 2003. Jazz [29] is a social collaboration network where nodes are Jazz musicians and an edge implies that two musicians have played together in a band. USAir [24] is a network of flights between US airports in 1997. Netscience [30] represents a co-authorship network of scientists working on network theory and experiment. Celegans [31] is a biological network where nodes are substrates and the edges are metabolic reactions. All networks can be downloaded from the Internet. (Dolphins, Polbooks, Football and Netscience are available on Newman's website [32]. States, Enron, Jazz, USAir and Celegans are available on Network Repository [24].) In the following experiments, we only consider the largest connected component and remove self-loops and multiple edges. A brief introduction of some basic topological features is shown in Table 3.   Figure 7 demonstrates the average performance of source localization in real networks with signal-to-noise ratio µ/σ = 4. Results show that GLAD-modified outperforms any other methods in all real networks, which is consistent with its performance in synthetic networks. One remarkable feature is that, with the same number of observers, the source in Football is easier to determine compared with that in other networks. The homogeneous structure of Football may be the reason. Considering a highly heterogeneous network, the nodes in network can be divided into center nodes and edge nodes. If the center node is the diffusion source, it is difficult to identify the source among this center node and its neighbor nodes; if the edge node is the diffusion source, it is also hard for us to locate the source among this edge node, the first-order center nodes and the second-order center nodes. Linking with Table 3, it is easy to locate the diffusion source when the degree heterogeneity remains low. Actually, the degree heterogeneity of Football is even lower than that of an ER random network with the same number of nodes and edges. As previously discussed, the homogenous nature of Football improves the accuracy of locating the diffusion source under the same number of observers. In comparison, it is not easy to locate the source in heterogeneous networks such as Celegans and USAir. The γ%-accuracy and average error distance of source localization algorithms on real networks are shown in Figures 8 and 9. Under both metrics, GLAD-modified presents stable and satisfactory performance, while GAU performs the worst. For GLAD-naive and TRBS, their performance varies considerably in different networks. Generally, GLAD-naive performs slightly better than TRBS in small-scale networks with low average degree, whereas TRBS gradually performs better when the number of nodes increases. This situation is similar to that of GLAD-naive and GLAD-modified in arbitrary trees and general graphs. In addition, the performance of TRBS and GLAD-modified is hard to separate in several networks like Football, USAir and Celegans. Thus, we guess that there may be some potential connection between the two methods. In the discussion section, we will introduce the relationship between TRBS and GLAD-modified in detail. . Average distance between the estimated source and the real source by source localization algorithms on nine real networks with signal-to-noise ratio µ/σ = 4. If more than one node is ranked first, we randomly choose one as the estimated source.

Discussion
In this section, we will discuss some interesting features of GLAD-modified, including the effects of different observer placement strategies, and the internal relationship between GLAD-modified and TRBS.

Different Strategies for Observer Placements
Although we have proposed GLAD-modified as an effective source localization algorithm that can achieve nearly 90% accuracy in most real networks with 50% observers, it is often not easy to monitor such a huge number of individuals in the real world due to limited resources and privacy issues. Consequently, how to choose the most informative nodes as observers should also be one of important counterpart in the research field of source localization. In the following, we will discuss the performance differences for GLAD-modified under several centrality-based observer placement strategies, including small degree, large degree, large betweenness [33] and large closeness [34].
The γ%-accuracy of GLAD-modified on synthetic networks with the basic random strategy and the other four observer placement strategies is displayed in Figure 10. For the sake of simplicity, we focus more on high precision localization and do not distinguish among the cases when the γ%-accuracy is lower than 0.9. Results show that the effects of different observer placement strategies are very similar in homogeneous networks ER and WS, whereas they vary dramatically in heterogenous network BA. Compared with center nodes with larger degree, betweenness and closeness, the information provided by peripheral nodes with a lower degree is less useful for identifying the real source. This result is consistent with our intuitive assessment that experts are usually more important than ordinary individuals in decision-making.

Relationship with the TRBS Algorithm
In previous experiments, the results demonstrate that the performances of GLAD-modified and TRBS are closely correlated. Generally, the performance of GLAD-modified is slightly better than that of TRBS under all metrics in most synthetic networks and real networks. In particular, we notice that the performance of the two methods for Football and Celegans is indistinguishable, which inspires us to expose whether certain potential relationships occur between GLAD-modified and TRBS.
In [21], Shen et al. developed the TRBS algorithm to locate the source of a diffusion-like process. TRBS starts the reversed diffusion process from each observer o k to all nodes in the networks along the reversed direction of links. At each node s, the reversed arrival time t o k −t(s, o k ) from each observer o k should be recorded, and a vector T s = [t o 1 −t(s, o 1 ), t o 2 −t(s, o 2 ), ..., t o K −t(s, o K )] T is then obtained. Afterward, the node s * with the minimum variance of Var(T s * ) is detected as the source. Along with the notations in our paper, the reversed arrival time vector T s in TRBS is identical to w s in Equation (5), thus the optimization problem of TRBS can be given by Recall the parameter estimation process in GLAD-modified in Equations (15)- (17). For each node s, if extra prior knowledge that all observers {o k } are uniformly distributed around s is provided, the distance d sk between s and each observer o k can be approximately considered as equal to d sk ≈ d s .
Then, the corresponding MLE for t 0 in Equation (15) should bẽ The auxiliary variablez s in Equation (16) becomes Finally, the optimal estimator for the source node in Equation (18) can be given bỹ By adding this prior knowledge, the optimization problem in GLAD-modified is similar to that of TRBS, which explains from the side of why GLAD-modified often performs better than TRBS. Compared with TRBS, the performance of GLAD-modified is more sensitive to the relative locations between the source and observers, which also provides insights on the strange phenomenon in Figure 4 that the average error distance of GLAD-modified is larger than TRBS in BA trees. In BA trees, the distance between two nodes changes considerably, which severely limits the effectiveness of GLAD-modified.
However, it is worth noting that this prior knowledge is not rigorous. Even if all observers {o k } can be uniformly distributed around one node s, they are almost impossible to be uniformly distributed around another node at the same time. For example, the observers can never be uniformly distributed around the leftmost or rightmost node in a line graph. The prior knowledge we attached here only acts as one possible bridge between GLAD-modified and TRBS, and more work on this issue should be done in the future.

Conclusions
A fundamental question in modern systems that undergo a diffusion-like process such as information propagation and epidemic spreading is where the origin is located. In this paper, our main purpose was to devise a method that can locate the diffusion source in complex networks with limited observations. To do so, we modeled the diffusion delay along each edge as a random variable that follows a Gaussian distribution N (µ, σ), and derived the corresponding likelihood function for the informed timestamps of observers. Thus, the source localization problem could be transformed to a parameter estimation and likelihood maximization problem. Since our optimizations were based on Gaussian assumption, and the diffusion parameters were deduced from observers, we named our method Gaussian-based localization and deduction (GLAD). We obtained the GLAD-naive as the optimal solution on arbitrary trees, and further derived the GLAD-modified as an approximate solution on general graphs.
We compared the algorithm performances with two benchmark methods in terms of three types of abilities: average localization accuracy, high-precision localization, and approximate area localization. Extensive experiments on synthetic trees showed that GLAD-naive was superior to other methods, which was consistent with our conclusions because GLAD-naive was indeed an optimal solution in this case. The results on general graphs demonstrated that GLAD-modified performed the best among all methods. Furthermore, we studied the effects of different observer placement strategies for GLAD-modified, and the underlying relationship between GLAD-modified and TRBS.
The main contribution of this work is that we employ the parameter estimation process into the optimization problem of source localization, which enables us to build a complex model with more parameters and achieve better results. Compared with one well-known source localization algorithm named GAU, our framework applies the same Gaussian assumption on diffusion delays and utilizes the same known information, while it outperforms GAU significantly on both arbitrary trees and general graphs. In addition, the computational complexity of GLAD-modified is just linear with the scale of networks. The extremely low cost will enable the wide application of our method in large-scale networks for source localization within a reasonable time. Although we only present the optimization process for undirected networks, the corresponding formula for directed networks can be easily achieved via slight modifications. Upon we finish our work, we notice a recent paper proposed by Tang et al. [35] that also combines parameter estimation with source localization. Their approach is more complicated than ours since they treat more diffusion parameters as unknown and perform the optimization over a parameterized family of Gromov matrices. However, high computational cost restricts its further usage on large-scale networks.
Although our proposed algorithm provides a new perspective for the problem of source localization in complex networks, considerable work remains to be done. Our proposed method relied strongly on diffusion delay that all diffusion delays {θ uv } along edges are i.i.d Gaussian variables θ uv i.i.d.
∼ N (µ, σ 2 ), which indicates to us to consider other delay distributions. The most intuitive task is how to identify multiple diffusion sources in networks. Compared with the single source localization problem, the case of multiple source localization is far more complicated because it is a combinatorial optimization problem. Several studies [36][37][38] have addressed the problem via a two-step strategy that first obtains a set of source candidate clusters and then applies single source localization algorithms to identify the source in each cluster. However, a general framework for simultaneously determining the number of sources and their locations in a large complex network is still lacking. In addition, to the best of our knowledge, few theoretical and practical studies have focused on source localization in multi-layer networks [39] and temporal networks [40].