Integrating Node Importance and Network Topological Properties for Link Prediction in Complex Network

Link prediction is one of the most important and challenging tasks in complex network analysis, which aims to predict the likelihood of the existence of missing links based on the known information in the network. As critical topological properties in the network, node degree and clustering coefficient are well-suited for describing the tightness of connection between nodes. The node importance can affect the possibility of link existence to a certain extent. By analyzing the impact of different centrality on links, which concluded that the degree centrality and proximity centrality have the greatest influence on link prediction. So, a link prediction algorithm combines node importance and attribute, called DCCLP, is proposed in this paper. In the training phase of the DCCLP algorithm, the maximized AUC indicator in the training set as the objective, and the optimal parameters are estimated by utilizing the White Shark Optimization algorithm. Then the prediction accuracy of the DCCLP algorithm is evaluated in the test set. By experimenting on twenty-one networks with different scales, and comparing with existing algorithms, the experimental results show that the effectiveness and feasibility of DCCLP algorithm, and further illustrate the importance of the degree centrality of node pairs and proximity centrality of nodes to improve the prediction accuracy of link prediction.


INTRODUCTION
Complex systems can be described as complex network that consisting of lots of nodes and edges, then the processing of information in complex systems can be transformed into the mining of information in complex network.Therefore, researching complex networks in a scientific way can help us better recognize and understand the internal structure of complex systems.The research contents of complex network main include community detection [1], link prediction [2], important node ranking [3] and network evolution [4]etc.Link prediction in the network refers to how to predict the possibility of edges existence between nodes which have no connecting edge in the network through known network structure and node attributes.This prediction includes both the prediction of unknown connecting edges and the prediction of future connecting edges [5].Predicting existing but missing edges is a data mining process, and the prediction of possible future edges is related to the evolution of complex ----------------Dai Fang* ， corresponding author ， E-mail: daifang@xaut.edu.cnnetwork.
According to the different methods of network feature extraction, the link prediction methods are roughly divided into three categories: network topology structure, maximum likelihood probability and machine learning [6][7][8].Although the methods based on maximum likelihood probability and machine learning can obtain better prediction accuracy, their application range is limited due to high computational complexity.The methods based on network topology structure are easier to implement when utilizing network topology to predict the link existence.Existing methods based on network topology can be divided into three categories: local similarity, quasisimilarity, and global similarity.
The link prediction based on local similarity mainly considers the common neighbors' information of nodes in network.The literature [9][10][11]pay more attention to the degree of common neighbors and the number of common neighbors to predict the network edges.Gao et.al [12] proposed a link prediction algorithm that considers nodes' attribute, which uses the degree and clustering coefficient of common neighbors to estimate the likelihood of the existence of links between nodes and achieved good prediction accuracy.Yu et.al [13] proposed a link prediction algorithm which combines degree, clustering coefficient, node centrality into link prediction of complex networks.The advantage of the method based on local similarity is low computational complexity.However, due to the limited information used, the prediction accuracy is not ideal.
The link prediction based on global similarity take advantage of more information in the network, such as the Katz method [14], which considered the transfer of similarity of all paths in the network and improved the prediction accuracy of link prediction.Based on the random walk in the network, Brin et.al [15] proposed a random walk method with restart, which improved the prediction accuracy of link prediction.The methods based on global similarity have good prediction accuracy but high computational complexity.In order to improve the prediction accuracy of the local similarity methods and reduce the computational complexity of the global similarity methods, some scholars proposed link prediction methods based on quasi-local similarity, such as the local path method proposed by Zhou et.al [16].This method considered the contribution of the third-order paths to the node similarity on the basis of the common neighbor method, and has good prediction accuracy.Liu et.al [17] proposed the LRW method based on local random walk of the network.Because only a limited number of steps are used, the computational complexity of this method is low, and it is significantly suitable for link prediction of large-scale networks.Based on the local topology of the network and the strong connection between nodes, and considered the contribution of highorder paths to node similarity, Qian et.al [18] proposed three algorithms: TPSR2, TPSR3 and TPSR4.The experimental results show that these three algorithms have good prediction accuracy, and TPSR3 algorithm performs better.
Inspired by the idea of employing local link information for link prediction, Wu et.al [19] proposed a new similarity method named CCLP, which used more local structure information in the network.Through comparative experiments in networks from various fields, the results show that it is more effective in predicting missing links and has better prediction accuracy.Based on the degree distribution and local information of nodes to estimate the likelihood of links existence between nodes in the network, Shi et.al [20] proposed the CN2D algorithm for link prediction that utilizing network structure information to predict the existence of links.The CN2D algorithm has low computational complexity and high prediction accuracy.Liu et.al [21] based on the initial information contribution of nodes proposed a link prediction algorithm IICN which aimed to solve the problem of ignoring the initial information size of nodes in the information transmission process between nodes.The experimental results demonstrate that, compared with mainstream benchmark methods ,the IICN algorithm has great advantages in effectiveness and robustness.
In this paper, we propose a new link prediction algorithm which considering the degree centrality of node pair and closeness centrality of nodes, called DCCLP, that predict the possibility of link existence from two aspects: node importance and node topological feature.Our DCCLP algorithm makes up the deficiency that TPSR3 algorithm doesn't fully consider the node importance and achieves better prediction accuracy and performance.

THE TPSR3 ALGORITHM
In 2017, Qian et.al [18] utilized the strong connection of the ego network and the close relationship between nodes to describe the similarity, and proposed a class of TPSR algorithm that combine topological properties and strong ties.The main ideas of these algorithms are using the paths within thirdorder in the network and the attribute information(such as degree and clustering coefficient)of nodes to describe the similarity between nodes.Through comparison and analysis of experimental results, it is found that TPSR3 algorithm performs best.
For two nodes ,  ∈  in the undirected unweighted netwrok  = (, ) , let  , 3 denote the set of third-order paths of connecting node  and node  , then the similarity between nodes  and  is shown in formula (1) [18].
where, (), () represent the neighbor node sets of nodes  and  respectively, the parameter  is a small positive number, which is used to control the contribution of thirdorder paths information to node similarity.| , 3 | represents the number of elements contained in | , 3 |,   is the degree of node .
denote the clustering coefficient of node , and the calculation method is shown in formula (2).
where   represents the number of connected edges between neighbor nodes of node .
According to formula (1), it can be seen that when measures the similarity between nodes in the network, TPSR3 algorithm considered the attributes information of common neighbors of predicted nodes, it also considered the path information between nodes and adjusts the contribution of third-order paths to node similarity by parameter .

THE DCCLP ALGORITHM
Similar to existing link prediction algorithms based on the local network structure, the TPSR3 algorithm did not fully consider the importance of nodes in the network, the importance of nodes has an important effect on the possibility of link's existence between nodes.By analyzing the impact of different centrality on link's existence, literature [22] points out that degree centrality has the most significant effect and the proximity centrality of nodes can also reflect the importance of nodes in the network.The degree centrality of node pairs and proximity centrality of nodes are integrated into TPSR3 algorithm to improve the TPSR3 algorithm in this paper, named DCCLP.
For two nodes , ϵ in the undirected unweighted network (, ) , we define the similarity between them as shown in formula (3).
where α > 0, β > 0 are adjustable parameters, and α + β = 1 .The parameter  is used to control the contribution of the third-order paths to the node similarity,   represents the degree of nodes  ,   represents th degree centrality of node pair (, ) , and the calculation method is shown in formula (4).
where, || represents the number of nodes in the network .
In the network , the average shortest path distance from node  to other nodes is expressed as formula (5).
Among them,   represents the distance of shortest path between node  and node , and the smaller   , the more important node  is in the network .
The proximity centrality   of node  in the network  can be calculated by formula (6).
In the following, we describe the detail steps of our DCCLP algorithm according to training phase and testing phase. (

1) Training Phase
Step1 For a given network , 90% of the edges and all the nodes in the network  are randomly selected as the training set, which is satisfied the connectivity of edges.
Step2 Taking the maximized AUC indicator in the training set as the objective, the adjustable parameters  and α are estimated by White Shark intelligent optimization algorithm.
Step3 Repeating Step1 and Step2 for one hundred independent experiments to calculate the average of AUC, according to the maximum AUC indicator to find the corresponding optimal parameters  * and  * .
(2) Testing Phase Substituting the optimal parameters  * and  * into formula (3) and selecting 10% of the connected edges in the network  randomly as the test set, selecting the evaluation indicators AUC and Precision to measure the prediction accuracy of the DCCLP algorithm.

EXPERIMENTAL RESULT AND ANALYSIS
Next, we do two groups of comparative link prediction experiments to verify the effectiveness and feasibility of our DCCLP algorithm.
(1) Comparison of DCCLP algorithm with the algorithms in literature [18] To verify the effectiveness and prediction accuracy of the DCCLP algorithm, we compare it with the nine algorithms mentioned in the literature [18] in different scale networks.Before the experiment, the dataset is preprocessed, that is, the directed network is converted to the undirected network, and the weight of the edges in weighted network is not considered.Tab.1 and Fig. 1(left) show the AUC indicator of ten algorithms.Tab.2 and Fig. 1(right) show the Precision indicator of ten algorithms.The black bold values in the Tab.1 and Tab.2 represent the best prediction accuracy among the ten algorithms.The first nine columns of data are from the literature [18].The full name of AUC indicator is Area Under the Receiver Operating Characteristic Curve, which is quoted from literature [23], and Precision is quoted from literature [24].
According to the above experimental results, our DCCLP algorithm has better prediction accuracy and predictive performance than other nine compared algorithms.Observed the AUC results in Tab.1, on the nine networks, the DCCLP algorithm obtains the optimal AUC indicator, especially on the Netscience network.Compared with the other nine algorithms, the AUC value is improved about 6%.Observed the Precision results in Tab.2, on nine networks, the DCCLP algorithm obtains the best Precision indicator, especially on FWEW network, the Precision is also improved about 6% compared with the best of the other nine algorithms.By analyzing the advantages of our DCCLP algorithm, there are mainly the following three aspects.Firstly, compared with the local similarity methods, for example: CN, RA, AA, TPSR2, these methods mainly use node attribute information to predict edges existence, our DCCLP algorithm considers the importance of nodes in the network.Secondly, the global similarity method Katz predicts the existence of edges by considering all the paths in the network, our DCCLP algorithm uses the fewer paths information to reduce the computational complexity of algorithm.Thirdly, our DCCLP algorithm considers the degree attribute of predicted node itself when measuring the similarity between nodes.Our DCCLP algorithm performs in comparison algorithms.The test sets contain q = 0.1 fraction of the edges in the complete networks and the presented results are the average of 100 independent runs.For the DCCLP algorithm, the values in parentheses represent the optimal value of adjustable parameters  * and  * , the parameter in LP method and Katz method are set 0.01.(2) Comparison of the DCCLP algorithm with existing algorithms To further verify the effectiveness and predictive performance of the DCCLP algorithm, we compared with the CCNC [13], CCLP [19], CN2D [20] algorithms which based on the network topology, and they ignore the influence of node degree centrality on link existence in the network.The twenty-one networks of different scales are collected for comparative experiments.These networks involve social network, biological network, citation network, cooperative network, aviation network and other network.The topological properties of the networks are shown in Tab. 3.
The DCCLP algorithm proposed in this paper is used for link prediction in twenty-one networks shown in Tab.where || is the number of nodes in the network; || is the of edges in the network; 〈〉 is the average degree of the network; 〈〉 is the average clustering coefficient of the network;  is the network density;  is the network degree heterogeneity; 〈〉 is the average shortest distance of the network.
Tab.4 and Tab. 5 respectively show the four algorithms' results of AUC and Precision indicators in twenty-one networks.The values in parentheses in Tab.4 and Tab.5 represent the optimal parameters  * and  * .The experimental data of the first three algorithms are cited in the corresponding literature, indicating that this data is from the literature, and the rest are the results of this paper.From the above experimental results, among the twenty-one networks shown in Tab. 3, there are eleven networks with the best AUC indicator, and nine networks with best Precision indicator.This further show that the superiority of the DCCLP algorithm.Compared with the other three algorithms, our DCCLP algorithm performs better in biological networks, cooperative networks and aviation networks by introducing the node importance.The prediction accuracy in social networks is not as good as CCNC algorithm.
The experimental results show that, compared with CCNC, CCLP and CN2D algorithms, the DCCLP algorithm has better prediction accuracy and better prediction effect.At the same time, the results also show that the clustering coefficient, degree, proximity centrality of common neighbor nodes and the degree centrality of predicted nodes can promote possibility of connecting edges between nodes and can well reflect the difference degree of the contribution ability of each neighbor node.

CONCLUSIONS
The degree centrality of predicted nodes and the proximity centrality of nodes are integrated in TPSR3 algorithm, and the link prediction algorithm DCCLP is proposed in this paper.The DCCLP algorithm predicts the existence of unknown edges in the network from the perspectives of network structure similarity and node importance.Through the two groups comparison experiments, the prediction accuracy AUC and Precision of the DCCLP algorithm outperformance in all comparison algorithms', which verifies the effectiveness and feasibility of the DCCLP algorithm, and further illustrates the importance of degree centrality and proximity centrality to improve the prediction accuracy for link prediction.
In this paper, we mainly consider the influence of node importance on link prediction in undirected unweighted networks, but the research on node importance in directed networks and weighted networks still needs to continue being explored.In the following research, we will generalize our algorithm to directed networks and weighted networks.

Fig. 1
Fig.1The average AUC and Precision of the DCCLP algorithm and the corresponding link prediction algorithms based on similarity.For almost all networks (expect C.elegans and NetScience), the DCCLP algorithm surpass the all comparison algorithms.
3, and compared with the CCNC, CCLP and CN2D algorithms, the prediction results are shown in Tab.4,Tab.5, and Fig.2.The Tab.4 and Fig.2(left) show the AUC indicator of four algorithms, and Tab. 5 and Fig.2(right) show the Precision indicator of four algorithms.

Fig. 2
Fig.2The average AUC and Precision of the DCCLP algorithm and the comparative link prediction algorithms based on the network topology.
Tab.1 Prediction performance of link prediction methods measured by AUC in a set of real-world networks Tab. 2 Prediction performance of link prediction methods measured by Precision in a set of real-world networks Tab. 3 Basic topological properties of twenty-one networks used in the experiments

9989(0.0822,0.0160)
5ab.4The AUC indicator for link prediction of four algorithms in twenty-one networks Tab.5The Precision indicator for link prediction of four algorithms in twenty-one networks