Link Prediction in Complex Networks Using Recursive Feature Elimination and Stacking Ensemble Learning

Link prediction is an important task in the field of network analysis and modeling, and predicts missing links in current networks and new links in future networks. In order to improve the performance of link prediction, we integrate global, local, and quasi-local topological information of networks. Here, a novel stacking ensemble framework is proposed for link prediction in this paper. Our approach employs random forest-based recursive feature elimination to select relevant structural features associated with networks and constructs a two-level stacking ensemble model involving various machine learning methods for link prediction. The lower level is composed of three base classifiers, i.e., logistic regression, gradient boosting decision tree, and XGBoost, and their outputs are then integrated with an XGBoost model in the upper level. Extensive experiments were conducted on six networks. Comparison results show that the proposed method can obtain better prediction results and applicability robustness.


Introduction
Complex networks can be used to model various real-world complex systems such as social, biological and information systems, where nodes denote different individuals or entities in the system, and links or edges indicate the relations or interaction between nodes [1]. Great efforts have been devoted to analyze the network topologies, and predict links among nodes in order to better understand the evolution of networks. A challenging issue in complex network analysis is link prediction, which aims to reveal potential links or forecast future relations based on the known information of nodes and network structure [2,3].
A large number of methods have been proposed to solve the link prediction problem [4][5][6][7], which are mainly divided into similarity-based methods and learning-based methods. Similarity-based methods are the most common, and especially structural similarity, due to simple and efficient computation, and universal applicability to networks with similar structures. These methos hold the view that the probability of link existence between two nodes is proportional to their topological similarity. In terms of the topological information range used in similarity calculation, these methods can be grouped into three categories: local similarity methods, global similarity methods, and quasi-local similarity methods. Local similarity methods explore local information of the disconnected nodes such as common neighbors and node degree, and include Common Neighbors (CNs) [8], Jaccard coefficient (JC) [9], Preferential Attachment (PA) [10], the Adamic-Adar (AA) index [11], Resource Allocation (RA) [12], Cosine similarity, and the Salton index (SI) [13]. These methods achieve good results in many cases due to lower computational complexity and simple implementation [14]. Global similarity methods utilize the topological information of the whole network for link prediction. Klein et al. [15] introduced average commuting time (ACT) as a novel distance function, and considered that

Problem Description
Consider an undirected and unweighted network G(V, E), where V and E denote the sets of nodes and links, respectively. As is usual, self-loops and multiple links are not allowed. For a network with N nodes, A = a ij N×N represents the adjacency matrix. Let S be the universal set of all N(N−1) 2 possible links, and S − E be the set of nonexistent links. For each nonexistent link e v i , v j ∈ S − E, v i , v j ∈ V, the similarity score is evaluated to quantify the link existence likelihood according to a defined similarity index. Generally, the link set E can be randomly divided into two parts: training set E T and test set E P , where E = E T ∪ E P , E T ∩ E P = ∅. The training set E T is utilized to compute the similarity index to obtain the similarity score of the test set E P . Link prediction methods exploit the known topological information to predict the missing or unknown links in the network. It can be deemed to be a binary classification problem, where the class label is determined by the existence of links. If there is a direct connection between two nodes, the label is 1; otherwise, the label is 0. In this way, each link is assigned a label, and then the topological features of each link are extracted from the network information by computing the similarity indices. The goal of link prediction is to utilize these extracted topological features to forecast the links in the network G.

Similarity Indices
Let Γ v i denote the neighbor set of node v i , and k v i denote its degree. Several similarity indices are introduced as follows.
(i) Common Neighbor (CN) index [8]: the CN index emphasizes that a pair of nodes with more common neighbors are more likely to be connected: (ii) Preferential Attachment (PA) index [10]: the PA index assumes that the connection probability of a pair of nodes is proportional to the product of their degree: (iii) Adamic-Adar (AA) index [11]: the AA index assigns more weights to the common neighbor nodes with a low degree: (iv) Leicht-Holme-Newman (LHN) index [28]: a pair of nodes with more common neighbors has high similarity, and the product of two node degrees is proportional to the mean of a common neighbor's number: (v) Resource Allocation (RA) index [12]: the proposal of RA index was inspired by the process of network resource allocation. Some resources are passed by node v i to v j , and their common neighbor is the transmission medium. The number of resources received by v j is defined as the similarity: (vi) Average Commute Time (ACT) index [15]: the ACT index assumes that the smaller the average commute time of two nodes, the closer the two nodes. Denote m(x, y) as the average number of steps that a random walker starting from node v i to reach node v j . The average commute time is then described as: By solving the pseudoinverse of the Laplacian matrix L + , we can obtain: and then the similarity is defined as: where l + ij represents the corresponding element in the matrix L + . (vii) Matrix Forest Index (MFI) index [29]: the MFI index is based on matrix-forest theory and expressed as: where L is the Laplacian matrix.
(viii) Random Walk with Restart (RWR) index [16]: the RWR index assumes that the random walker returns to the starting point with a certain probability 1 − c, and P is the transition probability matrix. The probability vector of arriving at each node v i of the network at time t + 1 is: where e v i represents the starting point, and the similarity index is defined as: where π v i v j represents the probability from node v i to node v j .
(ix) SimRank index [17]: the SimRank index states that two nodes are similar if they are connected to similar nodes: where the attenuation parameter s v i v i = 1, and C ∈ [0, 1]. (x) Local Path (LP) index [18]: the LP index considers local paths on the basis of the CN index, and attains a balance between the precision and computational complexity: where A 2 v i , v j and A 3 v i , v j are the number of different paths with the length 2 and 3, respectively, and α is the free parameter α = 0.001. (xi) Local Random Walk (LRW) index [19]: the random walker starts from node v i at time t, and π v i v j (t) is the probability that the walker just goes to node v j at time t + 1: The initial resource of node v i is q v i , and the LRW index at time step t is defined as: (xii) Superposed Random Walk (SRW) index [19]: on the basis of the LRW index, the value of the SRW index can be acquired by summing the results of step t and the previous step:

Dataset Description
To evaluate the effectiveness of the proposed RF-RFE-SELLP model, we performed massive experiments on six real-world networks from different fields. The basic statistics of these networks are summarized in Table 1 and a brief description is provided. C. elegans [30]: The Caenorhabditis elegans network consists of 297 neurons and 2148 connections.
Vicker [31]: the Vicker dataset is a social network between seventh grade students in a school in Victoria, Australia., which consists of 29 students and 376 links.
Email [32]: the E-mail network has 1133 users and 5451 communications between the members of University at Rovira i Virgili.
NS [33]: the NetScience dataset is a co-authorship network consisting of 1589 scientists and 2742 co-authorships.
SciMet [34]: the SciMet dataset is a network of articles from or cited by Scientometrics that includes 3084 articles and 10,399 links.
Router [35]: the Router dataset is an internet router hierarchical network containing 5022 routers and 6258 links.

Proposed Framework: RF-RFE-SELLP
In this section, the proposed RF-RFE-SELLP model is presented in detail, including model structure and algorithm procedure.

Motivation
Twelve similarity indices are introduced in Section 2.2, among which each index exploits one or two structural features of networks [36]. The fusion of these indices can incorporate multiple structural features to improve prediction performance. However, different networks have different inner structural features.
To validate this fact, a series of experiments was constructed on the six networks described above. For each network, the matching score [22] of one similarity index is computed: where E denotes the set of top |E| ranked links predicted by the predefined index. The larger the value of σ, the better the accuracy of the index. In this work, the PA index and the LP index are employed to illustrate that one similarity index plays different roles in different networks. The matching scores of the two indices, σ PA and σ LP , respectively, are calculated according to Equation (17), and the difference between them is defined as: The values of ∆σ for the six networks are shown in Figure 1. It can be observed that for NS and Vicker networks, the role of the PA index is superior to that of the LP index; that is, the structural features induced by the PA index can better represent this network.
For the Email and SciMet networks, the role of the LP index is greater than that of the PA index, which indicates that LP feature is more representative on these networks. For the remaining two networks, there is little difference between the two indices. Therefore, it is unwise to use structural features of one similarity index for link prediction in all networks. Due to different contributions of each index to different networks, it is essential to select suitable features of networks to improve the prediction accuracy.

Random Forest-Based Recursive Feature Elimination for Feature Selection
In random forest (RF), feature selection can be performed in a very simple way by the removal of features with low importance. However, RF can only provide the ranking of importance for each feature, but cannot determine the effective features and the number of features in the optimal feature set. In order to choose the relevant features for building the classification model, we recursively remove the features with lower importance to obtain smaller feature subsets, estimate the discriminative abilities of features of the subsets, and select those features with greatest discriminative power to enhance the prediction performance.
For m features of x 1 , x 2 , · · · , x m , the variable importance measure (VIM) is computed to evaluate the importance of each feature, and can be derived from the Gini index. The Gini index can be defined as follows: where C represents the number of categories, and p tc indicates the proportion of samples that belongs to category c for node t. The VIM of feature x j in node t is: where GI l and GI r represent the Gini index of left and right child branches, respectively. The VIM of feature x j in the i-th decision tree is: where T is the node set of feature x j in the i-th decision tree. The VIM of feature x j in RF is: where n is the number of trees in RF. The importance score of feature x j is defined by normalizing the V I M j : The higher the value of V I M j , the greater the importance of feature x j on RF for classification. Furthermore, recursive feature elimination (RFE) [37] is utilized to repeatedly remove the feature with the lowest VIM to obtain a new feature subset until only one feature remains in the feature set. We estimate the discriminative abilities of features in the subsets and then select the optimal features. To find the optimal number of features, k-fold crossvalidation is employed to score different feature subsets and select the best scoring set of features. The cross-validation score (CVS) is used to evaluate the discrimination of each feature subset, defined as: where p k is the classification accuracy on the k-th cross validation subset. The feature subset with the highest CVS is chosen as the optimal one. The proposed RF-RFE method aims to eliminate irrelevant features and redundant features to represent the original features with as few features as possible in order to improve the generalization ability of the classification model. In this work, k is set as 5, and the simulation is conducted 20 times.

Stacking Ensemble Learning for Link Prediction
Ensemble learning combines multiple weak supervised models to form a better supervised model. Ensemble learning methods are mainly divided into three categories: bagging, boosting, and stacking. Stacking fuses information from base models to generate a new model to obtain better classification performance than that of a single model. The base model with strong learning ability is conductive to improving the classification results. Moreover, the classification models with great differences should be selected as the base model, which can reflect the advantages of different algorithms to the greatest extent. In this paper, a stacking ensemble learning model for link prediction is proposed, named SELLP, in which three different models, i.e., XGBoost, LR, and GBDT, are selected as the base models, and XGBoost, which has better generalization performance, is employed as the top-level model. The model is shown in Figure 2. For a network with N nodes, a dataset S is constructed containing all possible links. The dataset S is divided into a training set S T and test set S P , where S T = (x i , y i ), i = 1, . . . , N T , N T is the number of samples, x i ∈ R D , and y i is the feature vector and label of the i-th sample, respectively. In the training phase, the dataset S T is randomly divided into five subsets S T k , k = 1, . . . , 5 with the same size. Let S T −k = S T − S T k , S T −k , and S T k be the k-th training set and test set in the 5-fold cross validation, respectively. For the base model C j , j = 1, 2, 3 of the first layer, the training set S T −k is used to train each base model C j to obtain the base model C jk , k = 1, . . . , 5. After cross validation, 3 × 5 = 15 base models are obtained. For each sample x i in the k-th test set S T k , the prediction result of the base model C jk iŝ y ji . The probability predicted by each base model C j on the subset S T k forms a new dataset S T new = (ŷ 1i ,ŷ 2i ,ŷ 3i , y i ), i = 1, . . . , N T . The newly generated dataset is then used to train the fuse model of the second layer in the SELLP model. By learning from the new dataset S T new , the XGBoost model can fuse the learning results of multiple base models in the first layer.
For each base model, the test data are different from the training data. All the data are utilized only once in the training phase, which can effectively prevent overfitting. Moreover, in the SELLP model, the training results of base models in the first layer are fully used in the induction process of the fusion model in the second layer. The fusion model can optimize and correct the prediction results of base models in the first layer to improve the accuracy of the SELLP model.
In the test phase, each sample in the test set S P = x l , l = 1, . . . , N P is fed into the base models in the first layer. For the sample x l , the base models C jk , k = 1, . . . , 5, predict the classification results y jkl , respectively, and then average classification results y jl = 1 5 ∑ 5 k=1 y jkl can be obtained for each base model C j . The classification results of three base models form a data ( y 1l , y 2l , y 3l ), which is sent to the fuse model to obtain the final classification results given by the SELLP model.

RF-RFE-SELLP Algorithm
The training procedure of the proposed RF-RFE-SELLP model is summarized as follows: Step 1: Obtain the adjacency matrix of a network and calculate the similarity scores of 12 feature indices to construct the feature vectors of links in the network.
Step 2: Estimate the VIM of features by the Gini index in random forest, then remove the feature with the lowest VIM, and calculate the corresponding CVS to obtain the optimal feature subset.
Step 3: Construct the dataset including the features and labels of all links, and divide this dataset into the training set and test set. LR, XGBoost, and GBDT are used as the base models in the first layer. Train the base models separately on the training set by 5-fold cross validation, and then obtain the classification results of each base model.
Step 4: Construct new training data from the prediction results by the base models and utilize these data to train the fuse model in the second layer. Then, the final RF-RFE-SELLP model can be obtained. The specific steps of RF-RFE-SELLP algorithm are presented in Algorithm 1.

Algorithm 1: RF-RFE-SELLP algorithm
Input: Network G = (V, E) Output: Parameters of SELLP model 1: Calculate the adjacency matrix A of network G. 2: for each v i , v j ∈ V do 3: y n ← A(:) 4: x n ← x S 1 v i , v j , . . . , x S 12 v i , v j 5:/*y n is the label, x n is the feature vector, and x S l v i , v j is the score computed by the similarity index S l .*/ 6: end for 7: for r ← 1 to 12 do 8: Compute the VIM of features, and remove the feature with the lowest one. 9: Update feature ranked list, and calculate the corresponding CVS. 10: end for 11: Obtain the optimal feature subset F n by comparing CVS. 12: Construct the set S including the feature subset F n and label y n , and then divide S into the training set S T and test set S P .

Experiments
In this section, we compare the proposed RF-RFE-SELLP with similarity-based methods described in Section 2 and supervised learning-based methods, i.e., multi-layer perceptron neural networks (MLP), RF, and AdaBoost for the purpose of validating the performance of RF-RFE-SELLP.

Experimental Setting
Due to the sparsity of real networks, the training set and test set should be generated with balanced class distribution. The number of existent links is calculated, and the same number of data is randomly selected from the nonexistent links to form a new dataset D. We randomly select a certain fraction (90%) of links from D as the training set D T , and the remaining data as the test set D P .
Three supervised-learning methods, namely AdaBoost, RF and MLP, were employed to identify missing links in the following experiments. The grid search and cross validation are utilized to determine the parameters of the models. All of the link prediction methods were coded in Python 3.8 and all experiments were implemented on a computer with an Intel(R) Xeon(R) Silver 4214 CPU and 32 GB memory. All experimental results are the average results of 10 independent runs.

Evaluation Metric
In this paper, link prediction is regarded as a binary classification problem. The label of a node pair is positive if a link exists between them, and negative otherwise. Four common terms are introduced to formulate the evaluation metrics [38]. True Positive (TP) is the number of node pairs with links that are correctly recognized as positive. False Positive (FP) corresponds to the number of node pairs without links that are incorrectly labeled as positive. False Negative (FN) refers to the number of node pairs with links that are incorrectly as identified negative. True Negative (TN) is the number of node pairs without links that are correctly classified as negative.
Four evaluation metrics are employed in this work to quantify the prediction performance of algorithms, i.e., Accuracy, Precision, F1-score, and AUC. Accuracy is the proportion of all correctly classified node pairs in the set of node pairs. Precision represents the proportion of all node pairs with links in the set of node pairs classified as positive.
Recall is the proportion of node pairs correctly classified as positive in the set of node pairs with links. F1-score is the harmonic average of Precision and Recall.
Area under the curve (AUC) can describe the overall performance of the prediction model, given as: where n 0 and n 1 are the numbers of positive and negative node pairs, respectively, and S 0 = ∑ r i , r i is the rank of i-th positive node pairs in the ranked list. Figure 3 shows the CVS of selected features obtained by RF-RFE on six networks, where the abscissa indicates the number of selected features and the vertical ordinate represents the corresponding cross validation score. It can be observed that the CVS increases and gradually stabilizes as the number of selected features increases for most networks. The higher CVS indicates the higher discriminative ability of features. In this way, the optimal feature subset can be determined with fewer features and a higher CVS. Taking the Vicker network, for instance, the value of CVS increases from 73.73% to a maximum of 80.36% when the number of features increases from one to five. When the number of features is greater than five, there is a slight decrease in the CVS. This indicates that too many features do not necessarily produce higher classification accuracy. The five features are selected for the Vicker network that make significant contributions to classification results.

Performance Comparison
To investigate module-wise effectiveness of the three base models and feature selection on the overall performance, the ensemble model, i.e., SELLP (RF-RFE-SELLP without RF-RFE feature selection), and three base models, i.e., GBDT, XGBoost, and LR, are evaluated, respectively. Tables 2-5 summarize the performance in terms of AUC, Accuracy, Precision, and F1-score of the six networks. With the same features, the ensemble model exhibits better results than the three base models in terms of the four metrics, which highlights the advantage of stacking ensemble learning. Each classification model with the selected features by RF-RFE can achieve better performance results than that with the original features, which proves the effectiveness of RF-RFE feature selection. RF-RFE can eliminate irrelevant and redundant features to represent the raw data in order to improve the generalization ability of the classification model. Moreover, the proposed RF-RFE-SELLP outperforms the comparison methods on all of the six networks. In detail, our model exhibits the highest AUC of 0.9118, Accuracy of 0.8985, Precision of 0.8743, and F1-score of 0.9156 on the Vicker network. Similar results can be observed on the other networks.    To further evaluate the performance of the proposed method, the RF-RFE-SELLP method is compared with supervised learning-based methods, i.e., AdaBoost, RF, and MLP, and similarity-based methods, i.e., CN, MFI, and SRW. Three chosen similarity-based methods belong to global similarity, local similarity, and quasi-local similarity, respectively. Figures 4-7 illustrate the comparison results regarding AUC, Accuracy, Precision, and F1-score on the six networks, where one bar with green color indicates the best result on this network.
It can be seen that RF-RFE-SELLP outperforms all of the comparison methods in terms of the four metrics on four networks, i.e., NS, Email, SciMet, and Router. For the Vicker network, the MFI index performs best in the four metrics, whereas the SRW index achieves the best results in terms of AUC and Precision for the C. elegans network. Although some similarity indices can obtain good performance on some networks, the performance varies greatly on different networks. The Precision of the CN index is 0.9710 on the NS network and 0.5606 on the Vicker network; the F1-score of the MFI index is 0.9189 on the Vicker network, but 0.3477 on C. elegans network; and the Precision of the SRW index is 0.9973 on the C. elegans network and 0.6491 on the Vicker network. Among the three similarity indices, the CN index achieves the worst performance on Vicker, NS, SciMet, and Router networks, and the worst results are obtained by the MFI index on C. elegans and Email networks. Compared with similarity-based methods, supervised learning-based methods can obtain more stable results in most respects on most networks. MLP yields the worst results in terms of Precision and Accuracy on C. elegans, NS, Email, SciMet, and Router networks, and the worst classification model is AdaBoost for the Vicker network.    Furthermore, we performed the statistical test for the proposed RF-RFE-SELLP and six comparative methods, i.e., AdaBoost, RF, MLP, CN, MFI, and SRW, in terms of AUC, Accuracy, Precision, and F1-score on six networks. Table 6 shows the results for Accuracy on the Email network. The F-test and t-test were performed to compare the prediction results. In the case of RF, as a result of the F-test, the obtained p-value (0.45) was greater than the significance level of 0.05, indicating there is no difference in variance between RF-RFE-SELLP and RF. Then, a two-sample assuming equal variance t-test was conducted and the obtained p-value (3.41 × 10 −3 ) was smaller than the significance level of 0.05; that is, there is a statistically significant difference in mean Accuracy between RF-RFE-SELLP and RF. In addition, the 95% confidence interval of mean difference does not include zero. From the results, we can conclude that RF-RFE-SELLP outperforms RF in Accuracy. A similar conclusion can be obtained for AdaBoost, MLP, CN, MFI, and SRW. Moreover, the experimental results in terms of AUC, Precision, and F1-score for these methods also lead to the same conclusion.

Robustness
In order to verify the robustness of the proposed RF-RFE-SELLP to the size of the training set, we compared CN, SRW, RF, MLP, and RF-RFE-SELLP. We performed the experiments with 80% and 90% training sets. Figures 8 and 9 show the variation in performance in terms of Precision and F1-score with training sets of different sizes, where one bar with blue color indicates the result with 80% training set and one with red color denotes the result with 90% training set. When we reduce the training set size from 90% to 80%, the performance degrades for all methods. However, the degradation is much smaller in the three supervised learning-based methods, i.e., RF, MLP, and RF-RFE-SELLP, compared to similarity-based methods, i.e., CN and SRW. Two indices work solely on local structural information of networks so that 10% removal of links affects their similarity scores, and the decline in F1-score is greater than that in Precision. Three supervised learning-based methods, especially RF-RFE-SELLP, select strong structural features from networks, which results in slight performance degradation.

Conclusions
The contributions of this paper include proposing the use of recursive feature elimination in random forest to choose representative structural features for networks, in addition to using the stacking ensemble learning method to predict the missing links. The RF-RFE-SELLP algorithm proposed in this paper is an ensemble learning framework that integrates LR, GBDT, and XGBoost models. Extensive experiments show that the performance of RF-RFE-SELLP in terms of AUC, Accuracy, Precision, and F1-score consistently exceeds that of the individual base model, and the comparison with the no feature selection method demonstrates encouraging performance in terms of these metrics. Moreover, the RF-RFE-SELLP method can achieve better performance than several supervised learning-based and similarity-based methods for most networks.
Author Contributions: Methodology, T.W. and M.J.; investigation, T.W., M.J. and X.W.; writingoriginal draft preparation, M.J.; writing-review and editing, T.W. and X.W.; supervision, X.W. All authors have read and agreed to the published version of the manuscript.