A Fusion Link Prediction Method Based on Limit Theorem

The theoretical limit of link prediction is a fundamental problem in this field. Taking the network structure as object to research this problem is the mainstream method. This paper proposes a new viewpoint that link prediction methods can be divided into single or combination methods, based on the way they derive the similarity matrix, and investigates whether there a theoretical limit exists for combination methods. We propose and prove necessary and sufficient conditions for the combination method to reach the theoretical limit. The limit theorem reveals the essence of combination method that is to estimate probability density functions of existing links and nonexistent links. Based on limit theorem, a new combination method, theoretical limit fusion (TLF) method, is proposed. Simulations and experiments on real networks demonstrated that TLF method can achieve higher prediction accuracy.


Introduction
Limit theory is a basic theoretical issue and has attracted wide interest across many fields.On the 100th anniversary of its foundation, Science raised 125 unresolved scientific questions, and many of these issues related to limit theory [1].Link prediction predicts missing links in current networks and new or dissolution links in future networks [2].With continuous improvement of link prediction methods and, the theoretical limit of link prediction has attracted considerable research interest [3].
Considering structure or attribute features, link prediction methods based on classification have been proposed by computer science community [4,5].Subsequently, more insightful methods of network structure, such as similarity based methods [6], have become a focus, these methods pay more attention to the physical meaning.At the same time, similarity index fusion methods are springing up [7,8].Recent years, with the development of deep learning, some deep features extraction methods have been proposed [9,10], the fusion of structure and attribute information has been attached importance again [11][12][13][14].These methods have strong consistency.We divide link prediction method into single and combination methods, based on whether they use multidimension information, and whether they define the relation of multidimension information directly.For example, single methods, such as RA index [15], which defines the relation of common neighbors and degree of nodes directly; and classification based methods, index fusion methods, fusion of structure and attribute information methods belong to link prediction combination methods.
Most combination methods perform better than single methods that will be fused, and are robust to many network types.However, what is the reason for this improved accuracy and robustness, and is there a theoretical limit for combination methods?This paper proposes the mathematic Appl.Sci.2018, 8, 32 2 of 13 description of combination methods, and obtains the necessary and sufficient conditions for theoretical limit.The limit theorem also has important practical application value.It reveals the ultimate goal of combination methods that is to estimate probability density functions of existing links and nonexistent links.Thus, an appropriate form of the transformation function could be selected from the complete set.Based on the limit theorem, a new combination method, theoretical limit fusion (TLF) method, is proposed.We use the Parzen kernel method [16] of destiny estimation in the TLF method.Simulations and empirical studies have shown that TLF method can achieve higher prediction accuracy.
Section 2 introduces a mathematical description for the theoretical limit of combination methods and evaluation metrics for link prediction.Section 3 proposes and proves necessary and sufficient conditions for the theoretical limit of combination methods.Section 4 proposes a fusion link prediction method based on limit theorem (TLF method).Section 5 provides simulation examples for limit theorem and proposed TLF method with other combination methods, and gives comparison experiments in real networks.Sections 6 and 7 discuss some results and conclude the paper.

Problem Description
Given a network G(V, E) at time t, where is the set of links.The observed links, E, are randomly divided into training, E T , and probe, E P , sets, where E = E T ∪ E P and E T ∩ E P = ∅.Link prediction aims to predict missing links at current network or new links for a future time t (t > t) [2].Link prediction combination methods fuse several similarity indices and obtain a synthetic index and can be described in mathematic as follows.Let X = (X 1 , X 2 , • • • , X n ) T be the scores of existing links as given by n structural similarity indices, and follow probability density function (pdf) the scores of nonexistent links as n structural similarity indices, and follow g(x) = g(x 1 , x 2 , • • • , x n ).We need to find the transformation function, l(x), and obtain the synthetic score, X = l(X), Y = l(Y) that maximizes evaluation metrics.Figure 1 is the diagram of combination methods.
Most combination methods perform better than single methods that will be fused, and are robust to many network types.However, what is the reason for this improved accuracy and robustness, and is there a theoretical limit for combination methods?This paper proposes the mathematic description of combination methods, and obtains the necessary and sufficient conditions for theoretical limit.The limit theorem also has important practical application value.It reveals the ultimate goal of combination methods that is to estimate probability density functions of existing links and nonexistent links.Thus, an appropriate form of the transformation function could be selected from the complete set.Based on the limit theorem, a new combination method, theoretical limit fusion (TLF) method, is proposed.We use the Parzen kernel method [16] of destiny estimation in the TLF method.Simulations and empirical studies have shown that TLF method can achieve higher prediction accuracy.
Section 2 introduces a mathematical description for the theoretical limit of combination methods and evaluation metrics for link prediction.Section 3 proposes and proves necessary and sufficient conditions for the theoretical limit of combination methods.Section 4 proposes a fusion link prediction method based on limit theorem (TLF method).Section 5 provides simulation examples for limit theorem and proposed TLF method with other combination methods, and gives comparison experiments in real networks.Sections 6 and 7 discuss some results and conclude the paper.( , , , )

Problem Description
be the scores of existing links as given by n structural similarity indices, and follow probability density function (pdf) be the scores of nonexistent links as n structural similarity indices, and follow ( ) ( , , , ) . We need to find the transformation function, ( ) l x , and obtain the synthetic score, ( ) = Y that maximizes evaluation metrics.Figure 1 is the diagram of combination methods.

Evaluation Metrics
Let the synthetic score ( ) x .X and Y are independent.We have the following metrics.

Area under the Receiver Operation Characteristics Curve (AUC)
A receiver operating characteristics (ROC) curve is a two-dimensional depiction of classifier performance [17].In the field of link prediction, the ROC curve abscissa represents the probability of

Evaluation Metrics
Let the synthetic score X = l(X) follow pdf f X (x), and Y = l(Y) follow g Y (x).X and Y are independent.We have the following metrics.

Area under the Receiver Operation Characteristics Curve (AUC)
A receiver operating characteristics (ROC) curve is a two-dimensional depiction of classifier performance [17].In the field of link prediction, the ROC curve abscissa represents the probability of nonexistent links i.e., the false positive rate (FPR), when the link prediction score is greater than some threshold, µ, and FPR = ∞ µ g Y (x)dx.The ordinate represents the probability of missing links, Appl.Sci.2018, 8, 32 3 of 13 i.e., the true positive rate (TPR), when score >µ, and TPR = ∞ µ f X (x)dx, TPR is equivalent to Recall.According to [18], AUC can be derived as where In the real network, original data is randomly divided into training set and the probe set.Equation ( 1) means that for n independent comparisons, if there are n comparisons where the missing link returns a higher score and n comparisons where the missing and nonexistent links return the same score, we can obtain the algorithm expression of AUC:

Precision
Precision can be defined as the ratio of correct to (correct and error) prediction proportions when score >µ, i.e., In the real network, if the top L links are predicted ones, with m links being right (i.e., there are m links in Owing to the imbalance of positive and negative samples, link prediction usually uses AUC metric.In application, high Precision means target links are accurate, and these links can be used directly.AUC and Precision are two important metrics in link prediction, we will study the theoretical limit using the two metrics.

Theoretical Limit
random vectors following the joint distributions f (x) and g(x), respectively, where m{x : f (x)/g(x) = C, g(x) = 0, ∀C ∈ R} = 0. (m represents the measure of a set.)Then the following conditions are equivalent.
(a) A monotonically increasing function r(x) exists, such that l(x) = r[ f (x)/g(x)], g(x) = 0, a.e.x ∈ R n .(b) Transformation function l(x)produces maximum AUC.If we add a condition in Theorem that prior probability of existing and nonexistent links be P(ω 1 ) and P(ω 2 ), respectively.Then the following conditions are equivalent to (a) and (b): (c) for any α, there exists the corresponding threshold µ l for transformation l(x), and satisfies )dx, such that transformation function l(x) produces maximum Precision.

Proof. (a) ⇒ (b) :
From the equivalent definition, AUC maximum is the maximum area under the ROC curve.For any FPR, if the TPRs corresponding to the ROC curve reach maximum, then the AUC reaches the maximum, i.e., where We use Lagrange's undetermined multipliers to solve this problem.For any specified FPR (denoted as FPR 0 ), the TPR corresponding to the ROC curve reaches maximum is equivalent as ϕ reaches maximum, Function ϕ will be maximized if we choose set E(l(x) > µ) such that the integrand is positive, i.e., if then x ∈ E(l(x) > µ).Which means, no matter what is λ, if we select the set of x which makes the integrand f (x) − λg(x) always be positive, the function ϕ will reach maximum; if the set contains x that makes the integrand be negative, function ϕ will decrease.Let l(x) = f (x)/g(x) and µ = λ, and the set, E(l(x) > µ), equals to E( f (x)/g(x) > λ), which satisfies (8), i.e., Thus, for any FPR, the TPR corresponding to the ROC curve reaches the maximum, so the AUC reaches the maximum when X and Y are transformed by l(x) = f (x)/g(x).
Let r(x) be a monotonically increasing function; and h(x) be the inverse function of r(x).If h (x) = 1/r (x), then h(x) and r(x) have the same monotonicity, and both are increasing functions.
, where r(x) is increasing function, there exists l 2 (x) such that X, Y transforming from l 2 (x) can also produce maximum AUC, and then the corresponding ROC curves are the same.Otherwise, if ROC curves are different, except the same part, for any FPR, there is at least a ROC curve which doesn't reach maximum TPR, and contradict with maximum AUC.Since m{x : f (x)/g(x) = C, g(x) = 0, ∀C ∈ R} = 0 and the ROC curve is the same for any point (FPR, TPR) on the two ROC curves, thus, i.
(c) ⇔ (b) : Let k = TPR/FPR be the slope of the secant for any point on the ROC curve to the origin, then Precision = k/(k + λ), λ = P(ω 2 )/P(ω 1 ).For any α, that l(x) produces maximum Precision is equivalent that k reaches maximum.And equivalent that for any α, dx is maximum.Since this condition is established for any α, then it is equivalent that for any FPR ∈ [0, 1], the corresponding TPR reaches maximum, and equivalent to l(x) produces maximum AUC.
Note 1: the condition m{x : f (x)/g(x) = C, g(x) = 0, ∀C ∈ R} = 0 is for exclusion that when f (x)/g(x) = C, (C is a constant), transformation function can be defined randomly on set σ = {x : f (x)/g(x) = C, g(x) = 0, ∀C ∈ R} ∩ R n .For example, let us construct the pdf of some random vector X as R n f (x)dx = 1.Let the transformation function be then no matter how l * (x) is defined, only when l * (x) < min[ f (x)/g(x)] can the l(x) produce the maximum AUC of ( X, Y).In particular, if f (x) = g(x), x ∈ R n , regardless of how l(x) is defined, AUC = 0.5.Thus, maximum AUC = 0.5.

Note 2:
The arbitrariness of the ratio α must be emphasized in condition (c).If we omitted "any α", then (b) ⇒ (c) can be established but (c) ⇒ (b) cannot.The meaning of α in application is a ratio of the whole data, for any l(x), a ratio α corresponds a threshold µ.
Theorem shows that no matter which evaluation criteria choose, transformation functions that provide maximum link prediction accuracy constitute a function cluster, Φ = {l(x) : l(x) = r[ f (x)/g(x)] , g(x) = 0}, where r(x) is a monotonically increasing function.Therefore, the accuracy of the combination method must be greater than or equal to the accuracy of each single dimension.

The Algorithm
The limit theorem of combination method shows that when selecting transformation function as l(x) = f (x)/g(x) or its monotone increasing transformation, the AUC and Precision of synthetic score reaches the maximum.In the real network, because f (x) and g(x) are unknown, the pdfs need to be estimated from multidimensional data.Let the estimated pdfs be f (x) and ĝ(x).On the basis of limit theorem, we define the transformation function as the ratio of estimated pdfs, i.e., l(x) = f (x)/ ĝ(x) (14) then we obtained the synthetic score, s = l(x), and used for link prediction.This method is called theoretical limit fusion (TLF) method.
Before evaluating f (x) and g(x), the input link prediction scores need to be normalized, s k (i, j) represents the k-th similarity score for node pair i, j.N is the dimension of adjacent matrix, and d is the number of similarity indices.The limit theorem of combination method transformed the link prediction indices fusion problem into the estimation of pdfs.Statistical methods for estimating density functions can be applied to this problem, directly.The Parzen kernel method [16] of destiny estimation is used in this paper.The multivariate kernel density estimate defined as: where h is the window width, n s is the sample size, and K(x) is a multivariate kernel defined for d-dimensional x, such that A form of the pdf estimate commonly used is Gauss kernel, In summary, the steps of TLF are listed as Table 1.
Table 1.The steps of theoretical limit fusion (TLF) method.
Step 1 Divide the network into training set, E T , and probe set, E P ; Step 2 According to Equation (15), normalize these similarity indices, then we distinguish existing links and nonexistent links in the training set; Step 3 Based on Equation ( 16), estimate the pdfs of existing links and nonexistent links, and we obtain the estimated pdfs as f (x) and ĝ(x); Step 4 Obtain the synthetic score of n structure similarity indices according to Equation ( 14); Step 5 Calculate the accuracy such as AUC metric or Precision metric on the probe set.

Complexity Analysis
For a given undirected, unweighted graph G(V, E), let N = |V| be the number of nodes and let m = |E| be the number of edges, and let n s be the sample size.During the estimation of pdfs in (16), the entire samples are scanned once.A scan of samples requires time O(d•n s ) and it is less than O(N 2 ).This is the step of model training or pdf estimation.Among all combination methods, there is an inevitable time complexity, that is to obtain the similarity matrix or final link prediction scores according to Equation (14).This step requires time O(d•n s •N 2 ).So, the TLF method will take time more than O(N 3 ).The main space needs to storage estimator and adjacent matrix or final similarity matrix.The spatial complexity is O(N 2 ).

Simulation and Experiment
We programmed the algorithm using Matlab (MathWorks, Beijing, China, 2014), and runs on a single machine equipped with RedHat 6.4.The host memory is 16 GB, with 3.4 GHz CPU, and the Matlab version is R2014b.In simulations from Section 5.1, 4-dimisional pdfs are supported to verify limit theorem and the effectiveness of TLF method.We also test the resulting method in real networks.We use TLF method to fuse 4 local similarity indices, CN [19], AA [20], RA and PA [21,22].These indices are 4 simple indices with low computation complexity about O(N• < k > 2 ), where <k> represents the average degree of nodes in a network.CN index only considers common neighbors of node pairs; PA index only considers the degree of two nodes; AA and RA consider both common neighbors and degree of nodes with different weights.And compare the method with fusion methods such as naïve Bayes and logistic regression and other global indices with computation complexity more than O(N 3 ).

Simulation Examples
Four types of structural similarity indices were simulated to evaluate node pairs with and without links.The pdfs of the structural similarity indices are also provided.We construct 3 groups of known distributions for the similarity indices pdfs.One thousand samples extracted from 10,000 existing links and 100,000 samples of nonexistent links were generated following the appropriate pdfs.The 1000 samples serve as probe set; the 100,000 samples with 1000 probe links serve as unknown links for training; and the remaining 9000 samples serve as train set of existing links.Each sample had 4 dimensions to simulate 4 similarity scores.We first compute AUC and Precision for each dimension, then use proposed TLF method to obtain the synthetic score and calculate the AUC and Precision, compared with other combination methods such as Naïve Bayes and logistic regression.Finally, we calculate AUC and Precision using the theoretical limit theorem and compare with the above methods.
Let random vectors X = (X 1 , X 2 , X 3 , X 4 ) T and Y = (Y 1 , Y 2 , Y 3 , Y 4 ) T be the scores of existing and nonexistent links, which follow Let f (x), g(x) are 4-dimensional normal distributions, where diag(Σ T , and The parameter sets for the 2 groups of simulation examples are as follows.
Group 1: In each group, The window width h of TLF method in the group 1 and 2 is h = 0.1.Group 3: Let and We ignore the constant that makes the integral of f (x), g(x) equal to 1.The simulation results of group 3 are shown as Table 2.The window width h of TLF method in the group 3 is h = 0.1.The simulation results in Tables 2 and 3 show us that we can calculate the theoretical limit of combination method based on Theorem 1, and the limit AUC and Precision are highest among all listed methods, though we cannot list all possible conditions.Results also show that TLF method can fuse the information effectively, and obtain the optimum accuracy.We also verify that the transformation of monotonically increasing function does not change the theoretical limit.Theorem 1 provides a platform that can compare each combination method by constructing some distributions, and direct an effect combination method TLF.

Experiments in Real Networks
The significance of simulation is that the theoretical limit can be derived by theoretical calculation or numerical calculation, and all combination methods can be used to compare with it, finding shortcomings and gaps to design a more rational method.However, the simulation data is different from real network data.We use TLF method to fuse several similarity indices and test in real networks.The basic similarity indices we use are Common Neighbor index (CN) [19], Adamic-Adar index (AA) [15], Resource Allocation index (RA) and Preferential Attachment index (PA) [21,22].These indices are local indices.Several global indices such as Katz index [23], Average Commute Time index (ACT) and Cosine Similarity Time index (Cos+) are served as comparisons [24,25].The definitions of the above indices and their meanings are listed as Table 4.We use TLF method to fuse 4 local similarity indices, and compare with fusion method such as naïve Bayes and logistic regression and other global indices.Our experiments are performed on 11 different real networks.(1) Food Web Everglades Web (FWEW) [26]; (2) Food Web Florida Bay(FWFB) [27]; (3) Protein-protein Interactions Cell (PPI Cell) [28]; (4) CKM-3 [29]; (5) Netscience (NS) [30]; (6) Yeast [31]; (7) Political Blogosphere(PB) [32]; (8) Email [33]; (9) CA-GrQc(CG) [34]; (10) Com-dblp(CD) [35]; (11) Email Enron (EE) [36,37].The basic topological features of 11 real networks are listed in Table 5.Each original data is randomly divided into training set of 90% links, and the probe set of 10% links.
Tables 6 and 7 show the comparisons between TLF method and other combination methods or global indices using AUC and Precision metrics.Each result is the average of 10 realizations.When calculating the Precision metric in Equation ( 5), we take L = 100 in datasets 1 to 8, and take L = 1000 in datasets 9 to 11.In the large networks, TLF method needs to sample to save the computing sources, and in datasets 10 to 11, the under-sampling rate is set as 1000.The results show us that TLF method performs better than other fusion methods such as naïve Bayes and logistic regression, no matter what evaluation metric use.Almost all combination methods are better than 4 basic indices.From the limit theorem, combination methods are dependent with each dimension.The promotion of fusion index is restrict to each similarity index.Experiment results also exposed this problem: if the single similarity indices perform not well, the fusion index cannot significantly improve the accuracy.For example, in the CKM-3 network, though we use TLF method to fuse 4 basic similarity indices can improve the AUC obviously, it cannot be better than Katz index (0.928).

Discussion
Many combination methods try to find the nonlinear relation of every dimensions, and want to obtain a more reasonable fusion function to promote the prediction accuracy.For example, link prediction method based on the choquet fuzzy integral [7] uses fuzzy measures to measure the importance of each similarity index in the fusion process and the interaction between them.Logistic regression based index adopts logistic function to learn the relation of multiple structural features and obtain an adaptive link prediction method [38].In fact, according to the limit theorem, the nonlinear relation is the ratio of two joint probability destiny functions or its monotone increasing transformation.The best fusion function is a measurement of difference between existing and nonexistent links, and it reflects the relativity of existing and nonexistent links.The essence of combination methods is trying to approximate the pdfs from many aspects.Limit theorem provides a unified interpretation for all combination methods.On the basis of theoretical limit theorem, the proposed TLF method evaluates two pdfs directly, and it has a better fusion effect from results of simulation and experiment in real network.

Conclusions
This paper proposes mathematic description of link prediction combination methods and derives the limit theorem.Before the mathematic description we proposed, many combination methods have been put forward and widely used.However, all these methods are groping respectively without unified explanation.Limit theorem solved this problem and provided a guidance for link prediction method design.The TLF method based on limit theorem can achieve higher prediction accuracy.

Table 2 .
Simulation results of group 1 and group 2. The bold figure indicates the best accuracy in each dimension and combination method.

Table 3 .
Simulation results of group 3.The bold figure indicates the best accuracy in each dimension and combination method.

Table 4 .
Definitions and descriptions of similarity indices.+ , Cos+ calculates cosine similarity of two vectors in matrix L + .
jjAccording to L

Table 5 .
Basic topological features of 6 example networks.|V| and |E| are the total numbers of nodes and links, respectively.<k> represents the average degree of nodes in a network.C and r are the clustering coefficient and assortative coefficient respectively.H is the degree heterogeneity, defined as H = <k 2 > <k> 2 .

Table 6 .
Comparisons of the AUC value between TLF and other combination methods or global indices.In each network, the selected window width h is along with the AUC value.The bold figure indicates the best AUC.

Table 7 .
Comparisons of the Precision value between TLF and other combination methods or global indices.In each network, the corresponding window width h is the same as Table6.The bold figure indicates the best Precision.