Link Prediction in Complex Networks Using Average Centrality-Based Similarity Score

Link prediction plays a crucial role in identifying future connections within complex networks, facilitating the analysis of network evolution across various domains such as biological networks, social networks, recommender systems, and more. Researchers have proposed various centrality measures, such as degree, clustering coefficient, betweenness, and closeness centralities, to compute similarity scores for predicting links in these networks. These centrality measures leverage both the local and global information of nodes within the network. In this study, we present a novel approach to link prediction using similarity score by utilizing average centrality measures based on local and global centralities, namely Similarity based on Average Degree (SACD), Similarity based on Average Betweenness (SACB), Similarity based on Average Closeness (SACC), and Similarity based on Average Clustering Coefficient (SACCC). Our approach involved determining centrality scores for each node, calculating the average centrality for the entire graph, and deriving similarity scores through common neighbors. We then applied centrality scores to these common neighbors and identified nodes with above average centrality. To evaluate our approach, we compared proposed measures with existing local similarity-based link prediction measures, including common neighbors, the Jaccard coefficient, Adamic–Adar, resource allocation, preferential attachment, as well as recent measures like common neighbor and the Centrality-based Parameterized Algorithm (CCPA), and keyword network link prediction (KNLP). We conducted experiments on four real-world datasets. The proposed similarity scores based on average centralities demonstrate significant improvements. We observed an average enhancement of 24% in terms of Area Under the Receiver Operating Characteristic (AUROC) compared to existing local similarity measures, and a 31% improvement over recent measures. Furthermore, we witnessed an average improvement of 49% and 51% in the Area Under Precision-Recall (AUPR) compared to existing and recent measures. Our comprehensive experiments highlight the superior performance of the proposed method.


Introduction
A graph is used to represent a complex network, where nodes or vertices represent entities, and edges or links represent the interactions or relations between these entities.Complex networks play a major role in natural phenomena, including biological networks, information networks, social networks, and technological networks [1][2][3].In such networks, nodes are neurons, scientists, individuals, or locations, whereas edges are associations or interactions between the nodes.In recent times, complex networks have gained significant attention in various fields including link prediction [4], centrality measures [5], community detection [6], and influence maximization [7].New nodes and links are constantly added to complex networks, which makes these networks dynamic.The challenge of predicting links in a network is therefore critical to comprehend the network's evolution.The link prediction (LP) problem was introduced by Liben-Nowell et al. [8].The LP problem aims to determine the probability of an interaction happening in the future between two nodes when such an interaction does not exist at a present moment in time.There is potential significance for the link prediction problem across multiple domains.Techniques for link prediction can be utilized to determine the interactions in biological networks that are the most likely to occur, thereby considerably reducing the costs associated with conducting experiments.Link prediction can be used to send friend requests on social networks such as Facebook and LinkedIn.On e-commerce platforms like Amazon, users can receive product recommendations by predicting connections between users and items.This is done using a user-item graph that represents user preferences or purchase history.Link prediction in co-authorship networks such as DBLP might point to possible partnerships between researchers [9].Numerous link prediction algorithms have been proposed recently.These algorithms are classified into three groups: similarity-based measures [10,11], probabilistic measures [12], and dimensionality-based measures [13].In particular, the most efficient and fundamental techniques for resolving the link prediction problem is the similaritybased measure.This approach computes a score, S v,u , for each pair of nodes (v, u), that indicates how similar the two nodes are to one another.Two nodes are considered similar if they share a large number of features, according to the general definition.The similarity indices are divided into three groups: local, global, and quasi-local [14].In order to compute a node's similarity, local similarity indices employ structural information from their neighbors rather than the entire network.A few popular local similarity measures are common neighbors, the Jaccard coefficient, preferential attachment, resource allocation, and Adamic-Adar.These measures are discussed in Section 4.1.In this study, we defined a new similarity measure that belongs to the class of common neighbor measures [15] and we used this as a basis for link prediction.These measures evaluate the probability of a link forming between non-adjacent pairs of nodes in a network based on the quantity of common neighbors they share.The primary drawback of local similarity indices is their limited ability to utilize local data; they use only one-hop and two-hop neighborhoods [16].However, links can emerge between nodes existing beyond two-hop neighborhood.Global similarity indices utilize the entire network's structural information to evaluate link scores.However, they are not parallelizable and their computational complexity limits efficiency in large networks.Conversely, quasi-local similarity indices combine the best features of both methods.In order to retain accuracy, quasi-local indices use more information than local indices and omit unnecessary information [17].The use of centrality-based link prediction has several advantages over traditional methods.Firstly, centrality measures help analysts evaluate the relative importance of nodes and edges in the network, which is crucial for predicting new connections.This leads to a more detailed understanding of the network's structure and dynamics, enabling more precise and informed predictions.Moreover, centrality measures, such as the clustering coefficient, measure the extent to which nodes in a network tend to form clusters, whereas closeness centrality is found to better describe endpoint influence, and betweenness centrality best quantifies path connectivity.This comprehensive assessment of a node's influence and importance within the network leads to more accurate predictions of future links.This paper's outline is structured as follows: Sections 2 and 3 describe problem definitions and recent works on link prediction and centrality measures.Section 4 discusses the related existing measures.Section 5 presents the methodology, including the centrality measures utilized, the calculation of average centrality, and the definition of similarity scores.Section 6 describes the experimental setup and presents the evaluation results.Section 7 provides an in-depth analysis and comparison with existing measures and recent measures.Section 8 concludes the paper and outlines potential directions for future research.Finally, Abbreviations defines the abbreviations used in this paper.

Problem Definition
Definition 1. Link Prediction: The link prediction task involves a complex network denoted as G = (V, E), where V represents the set of vertices and E represents the set of edges.The objective is to generate a list of edges that are not currently present in the network G[t 0 , t i ], but are predicted to appear in the future network G t j where t j > t i > t 0 [4].
The graph G may include directed edges indicating one-way interactions between nodes, along with weights indicating the strength of these interactions.However, this study focuses solely on undirected and unweighted edges.The potential expansion of this research to include directed and weighted networks is a prospect for future work.Definition 2. Centrality Measure: Given a graph G = (V, E), where V and E denote vertex and edge sets, respectively, the centrality, represented as C and defined as C : V − → R, assigns a real-valued score to u, quantifying the significance of u based on its structural position and connections to other nodes in G.
Various centrality measures exist, each capturing different aspects of a node's importance.Common centrality metrics include degree centrality, which measures the number of connections a node has, and betweenness centrality, which quantifies how often a node lies on the shortest paths between other nodes in the graph.Other measures include closeness centrality and the clustering coefficient, each providing unique insights into a node's centrality within the network.
The formation of future links in a network between two non-adjacent nodes u and v majorly depends on the structural similarity of u and v.A key factor influencing this resemblance is the presence of shared neighbors between u and v.However, many existing methods for predicting links fail to differentiate between these common neighbors.We believe that all common neighbors may not contribute equally in future link formation.In this work, we intend to evaluate the role of significance of common neighbors in link prediction.As the centrality of nodes depict different kinds of significance in the network, the centrality value of common neighbors affect link formation.Therefore, in this work, we examine various centrality values of nodes (especially common neighbors) on the task of link prediction.

Recent Work
This section addresses the latest research on link prediction using centrality measures.Lu et al. [15] summarized recent works on link prediction algorithms, and also introduced some real-time applications, as well as outlined the upcoming challenges of link prediction algorithms.Das et al. [18] presented research works on centrality measures based on social networks.The authors presented real-time applications of centrality measures in traffic, biology, transportation, research, drugs, and security.Bloch et al. [19] discussed centrality measures in networks based on nodal statistics and also discussed some properties which identify path-based centrality measures.Nasiri et al. [20] proposed new link prediction measures, namely weighted common neighbors (WCNs), depending on common neighbors and different types of centrality measures like degree, closeness, betweenness, k-core, eigenvector, and pagerank, which are used to predict the formation of new links in networks.To measure the performance of centrality measures based on link prediction, Singh et al. [21] investigated centrality measure network structures, then identified influential users and predicted future connections.Ahmad et al. [22] proposed a novel measure, called common neighbor and the Centrality-based Parameterized Algorithm (CCPA), which is parameterized and identifies future edges between non-adjacent node pairs using common neighbors and centralities.The next novel measure called the keyword network link prediction algorithm (KNLP) was proposed by Behrouzi et al. [23], which exploits nodes' clustering coefficient, centrality measures using eigenvector centrality, and community information, which can be used as an another parameter to predict the links based on centrality values.S Kumar et al. [24] proposed link prediction based on centralities of nodes, which improves the set of features that are utilized to make the predictions.The basic node centralities and various binary machine learning classifiers are used to predict links.T Gao et al.'s [25] focus was on degrees of end points and neighbors, so the authors proposed a powerful combination of endpoints and neighbors (PCEN) model, which gets better prediction results than existing models.Kumar et al. [26] proposed a new approach to link prediction based on the level-2 node clustering coefficient.To compute similarity scores between node pairs, the authors defined level-2 common nodes and their clustering coefficient, which extracts level-2 common neighbors' clustering information from the seed node pairs.Based on the rich get richer scenario, Zhang et al. [27] proposed an novel index relying on betweenness centrality to predict the links that will exist in the future.Later, Wu et al. [28] proposed local triangle structure information, which can be transformed by the clustering coefficient of common neighbors directly.Yang et al. [29] proposed an algorithm, named common neighbors and distance which excels in predicting missing links between nodes without common neighbors, outperforming many existing methods for real-world networks without adding any complexity.In this paper, we generalized similarity scores based on average centrality measures, which were calculated using local and global centrality measures, which give the best prediction accuracy compared to existing link prediction measures.

Related Work
In this section, we discuss basic link prediction and centrality measures for simple, unweighted, and undirected graphs.G = (V, E) is a representation of a network or graph, where V is the number of nodes and E is the collection of network edges.

Existing Similarity Measures
A straightforward method, known as "similarity-based method", computes a similarity score for non-adjacent node pairs, v and u.The similarity scores are sorted; the node pairs with the highest scores indicate the expected linkages between them.Similarity scores are grouped into local, global, and quasi-local groups [4].

•
Local Similarity Measures: Local similarity measures focus on examining the immediate neighbors of a node in the network.Some well-known measures include the common neighbor (CN) [15], Jaccard coefficient (JC) [3], preferential attachment (PA) [30], Adamic-Adar (AA) [31], resource allocation (RA) [32], etc. Common Neighbor: The likelihood of a link being formed between two nodes, v and u, is higher when they share a significant number of common neighbors.
In Equation ( 1), S CN v,u denotes the size of the nodes' neighborhoods' intersection; Γ(v) is the set of neighbors of node v. Jaccard Coefficient: The common neighbor is comparable to this metric, which normalizes the score of the common neighbor, as given below.
In Equation ( 2), S JC v,u is the size of the intersection of two nodes' neighborhoods, out of the total neighbors of nodes v and u, where Γ(v) is the set of neighbors of node v. Preferential Attachment: It counts the richness of two nodes instead of shared neighbors between non-adjacent node pairs.The degrees of nodes v and u are multiplied collectively.
PA requires the degree of nodes and does not consider common neighbors.In Equation (3), d(v) is the degree of node v.
Resource Allocation: We assume two non-adjacent node pairs, v and u.The amount of resources provided from node v to node u determines how similar the two nodes are when they are transferring resources through their shared nodes.
In Equation ( 4), d r is the degree of node r.
Adamic-Adar: Adamic-Adar is a variant of resource allocation.In real-world scenarios, for example, individuals with a larger number of friends tend to allocate less time and resources to particular friend compared to those with fewer friends.This is defined as follows: In Equation ( 5), d r is the degree of node r.

Recent Measures
In this section, two of recent centrality based similarity scores: CCPA [22] and KNLP [23] are elaborated.

Common Neighbor and Centrality-based Parameterized Algorithm (CCPA):
To recommend the creation of new linkages in complex networks, CCPA uses two essential node characteristics-the number of shared neighbors between node pairs, and their centrality measures.In this case, closeness centrality is taken into account as a parameter for missing link prediction.The term "common neighbor" describes the nodes that are shared by two nodes.The term "centrality" refers to the significance of a node inside the network.
In Equation ( 6), the user-generated parameter α ∈ [0, 1] regulates the centrality and common neighbor relevance.The set of neighbors of node v is represented by Γ(v), and D v,u is the shortest path length between v and u.
The stronger correlation between eigenvector centrality and node degree shows that nodes with the highest eigenvector have more connections.For nodes u and v, KNLP is defined as follows: In Equation (7), CS v and CS u are the centrality scores for nodes v and u, CC v and CC u are clustering coefficient values for nodes v and u, and their values always range between 0 and 1.Here, ϵ is used to avoid the division by zero error.

Centrality Measures
Centrality measures identify the nodes that are most crucial or central in the graph G.These measures help us to understand which nodes are the most influential, well-connected, or central in the graph.Centralities are derived into local measures, global measures, and so on [34].

•
Local Centrality: Local centrality involves only immediate neighborhood.Degree centrality (D) [5] and clustering coefficient (CC) [35] are two popular local centralities used in this paper.Degree Centrality: The node v's degree centrality is calculated as the fraction of other nodes adjacent to node v out of the possible total.Nodes characterized by a high degree of centrality are referred to as Hub nodes.
In Equation ( 8), the graph's total number of nodes is N, and node v has a degree of d v .
Clustering Coefficient: The clustering coefficient of a specific node is determined by the ratio of closed triangles within the node's neighborhood, to the total number of triangles present in that neighborhood.It is also known as transitivity.
In Equation ( 9), node v has a degree of d v , and the number of triangles connected to node v is K v .

•
Global Centrality: Global centrality involves the whole graph.Closeness centrality (C) [34] and betweenness centrality (B) [36] are few popular global centralities used in this paper.Closeness Centrality: One method of identifying nodes that can efficiently distribute information throughout a network is through closeness centrality.The closeness centrality of a node, denoted as v, within a graph, is determined by taking the reciprocal of the average shortest path distance from node v to all N − 1 reachable nodes in the graph.
In Equation (10), the shortest path length from v to u is denoted by D v,u .In the network, the node that is nearest to every other node is the one with the highest closeness centrality.Betweenness Centrality: A node's betweenness centrality is a measure of how many shortest paths there are via a particular node.
In Equation (11), σ v,u represents the total number of shortest paths between nodes v and u, and σ v,u (r) denotes the total number of shortest paths between nodes v and u that pass through node r.

Proposed Work
In this section, we outline our proposed approach for predicting links, which relies on the average centrality of the common neighbors.The proposed method computes a prediction score based on similarity between the nodes, which is based on the centrality score of the common neighbors between them.We name this method Similarity based on Average Centrality (SAC).SAC initially computes various centrality scores for the common neighbors and considers only the nodes with scores exceeding the network's overall average centrality score.We employ both local and global centrality measures.

Similarity Based on Average Centrality Measures (SAC)
The algorithm SAC can be generalized to use any centrality measure of nodes.Let C denote the centrality score of a node v and AC (G) denote a graph's average centrality value computed using Equation (12).
In Equation (12), C (v) represents the centrality value of the node v, and N denotes the total number of nodes in the whole graph G.The similarity of two vertices using the average centrality of a graph is defined as depicted in Equation ( 13): In Equation ( 13), SAC C (v, u) is the similarity scores of node pairs v and u, collecting all common neighbors and applying centrality scores to those common neighbors and then counting the nodes which exceed the average centrality of the graph.x denotes common neighbors between nodes v and u, and Γ(v) and Γ(u) are neighbors of the nodes v and u, respectively.AC is the average centrality of the graph, which is defined in Equation ( 12).The centrality C can be any local or global centralities defined in Table 1.
For instance, if we consider the centrality C to denote the degree centrality, we can utilize the average degree centrality (AD) as defined in Equation ( 12).This enables us to calculate the similarity between two vertices based on the average degree centrality of the graph, as specified in row 1 of Table 1.C can be tailored to the betweenness centrality, closeness centrality, or clustering coefficient by using the second, third, and fourth rows of Table 1, respectively, leading to the computation of SAC B (v, u), SAC C (v, u), and SAC CC (v, u).

S.No.
Centrality C Avg C SAC C (v, u) Algorithm 1 outlines the process for calculating the SAC C (v, u) for non-adjacent node pairs within the graph.A sample illustration of Algorithm 1 is given using a toy example, depicted in Figure 1, featuring eight nodes and twelve edges.For the graph given in Figure 1, we find similarity scores for SAC D (v, u), SAC B (v, u), SAC C (v, u), SAC CC (v, u), CN, JC, AA, RA, PA, CCPA, and KNLP.In this example, we find similarity scores for few non-adjacent node pairs; similarly, we can find similarity scores for other non-adjacent node pairs as well.We present the computation of similarity scores using the average centrality measure, with the degree centrality C being our chosen metric.

Algorithm 1: An algorithm for common neighbor-based average centrality
Initially, we calculate the degree centrality for each node in the graph.Node 1 and Node 2, for instance, both exhibit a degree centrality of 0.375, and so forth.Subsequently, we determine the average degree centrality for the graph, denoted as AD(G), as specified in Table 1, line 1.For our toy graph, AD(G) equals 0.375.Next, to identify common neighbors for a non-adjacent node pair (1,2), we locate Nodes 4 and 7. Applying the degree centrality scores to these common neighbors, we find that Node 4 has a centrality of 0.625, and Node 7 has a centrality of 0.25.Finally, we count the nodes with centrality scores exceeding the average degree centrality.In this scenario, the common neighbor Node 4 surpasses the average degree centrality.Consequently, the similarity between node pairs (1,2), based on average degree centrality, is 1.This process is repeated for several node pairs in the toy graph, and the results are summarized in Table 2. Proposed Measures

Time Complexity of Similarity Based on Average Centrality Measures
Given the network G = (V, E), where the number of nodes is indicated by |V| = n, and the number of edges is represented by |E| = m, the computational cost of evaluating the C for every vertex in a graph G can be expressed as O( f (n)).The complexity for finding the similarity score SAC C in Algorithm 1 is O( f (n) + O(n 2 )).In the case where C is the degree, the time complexity for finding the SAC D is O(n 2 ) [5,37].If the C is the clustering coefficient, betweenness centrality, and closeness centrality, then the time complexity for finding the SAC B , SAC C , and SAC CC is O(nm) [34][35][36].

Implementation
The proposed approach's effectiveness was compared to a few popular cutting-edge link prediction measures.The datasets utilized for performance analysis and the measures used for evaluation are described in depth in this section.

Datasets
To evaluate the effectiveness of our proposed method, we conducted simulations on four different datasets.These datasets were taken from different domains and were downloaded from [38].In bio-celegans, nodes represent genes or proteins, where edges are interactions between the proteins.The dataset comprises a total of 453 nodes and 2025 edges.The web-polblogs dataset represents a network of political blogs, where webpages are the nodes and hyperlinks between webpages are the edges.It consists of 643 nodes and 2280 edges.The CA-Grqc dataset represents a collaboration network, where nodes are authors or research papers and edges represent relationships between authors or citations between research papers.It consists of 5242 nodes and 14,496 edges.The last dataset used was Facebook-large dataset, which represents a social network, where nodes specify users and the edges represent friendship between users.It consists of 22,470 nodes and 171,002 edges.Table 3 displays the characteristics of these datasets.Among all of these datasets, bio-celegans is a dense network with a relatively high average clustering coefficient.CA-Grqc has a low average degree, which indicates a lower average number of connections per node.Facebook-large has high average degree, which indicates a wellconnected network and it has a relatively low diameter, suggesting shorter paths between nodes compared to CA-Grqc.Our study was carried out using a PC with an 11th generation Intel(R) with Core(TM) i7-8700 CPU, which has six cores, twelve logical processors, and a base clock speed of 3.20 GHz.The computer was running Windows 10 Education and had 16 GB of RAM.Python was used to perform our investigation, and Scikit-Learn, Matplotlib, Pandas, Networkx, and Numpy were among the packages used to build the methods.
For each of these datasets, 20% of the links were set aside for testing purposes.Prediction scores were calculated for the remaining 80% of the links.Subsequently, the effectiveness of the predictions was assessed using both the Area Under the ROC curve and the Area Under the Precision-Recall curve.These evaluation metrics will be discussed further in the following section.

Evaluation Metrics
In the assessment of similarity-based centralities, standard metrics like Area Under the Receiver Operating Characteristic curve (AUROC) and Area Under the Precision-Recall curve (AUPR) are commonly employed.In our study, we employed these metrics to assess the performance of our proposed measures.
AUROC: AUROC, short for Area Under the Receiver Operating Characteristic (ROC), is a widely used metric for assessing the effectiveness of a prediction model.The ROC curve is a visual representation that illustrates the relationship between the True Positive Rate (TPR) and the (FPR).The TPR (y-axis) vs. FPR (x-axis) is plotted for various threshold values [39].AUROC gives the area under the ROC curve.AUROC measures the probability of false alarms or incorrect positive predictions.The AUROC score has a range from 0 to 1, where a higher value signifies superior performance.An AUROC of 1 represents a perfect model, while an AUROC of 0.5 indicates a random model.
AUPR: AUPR stands for Area Under Precision-Recall (PR) curve, is another metric used to evaluate the performance of a prediction model.AUPR demonstrates superior performance in scenarios where the ROC curve may provide an overly optimistic assessment of a predictor's performance, especially with imbalanced data [40,41].The PR curve displays the precision on the y-axis and the recall on the x-axis.Precision quantifies the ratio of correct positive predictions to all positive predictions, while recall calculates the ratio of correct positive predictions to all actual positive instances.AUPR is a single quantity that represents the area under PR curve.

Results
In this section, we conducted experiments to evaluate the efficacy of the proposed approach.The obtained results are presented below for analysis.First, we compared our generalized SAC methods, proposed in Section 5, with existing local similarity measures like CN, JC, AA, RA, and PA and the latest link prediction measures, CCPA and KNLP, on four datasets.Our measures were tested on evaluation measures like AUROC and AUPR, as discussed in Section 6.2.We show that the performance of generalized SAC is good compared to existing link prediction measures.We have explained that the prediction of link score increases for SAC(v, u), by collecting all common neighbors for nodes v,u and applying centrality scores to those common neighbors and then counting the nodes which exceed the average centrality of the graph.In the section below, we discuss the results of the proposed algorithms based on popular existing link prediction measures, but we do not include the latest existing method, KNLP, in the table, as KNLP obtained comparatively small values.So, we present the results of KNLP separately in Tables 4 and 5 for comparison.The discussion about the results of the proposed generalized SAC measures is presented in this section.Average degree (AD), average betweenness (AB), average closeness (AC), and average clustering coefficient (ACC) are considered for the centrality C proposed in Section 5.These proposed measures are compared against the basic link prediction measures of CN, JC, AA, RA, PA, and CCPA. Figure 2 displays the AUROC findings for four datasets.While prediction scores are calculated for all non-adjacent node pairs, the evaluation is solely conducted on the top k pairs of nodes.This approach stems from the notion that node pairs with the highest scores are most likely to form connections in the future.We explored different values of k ranging from 1750 to 35,000.The AUROC and AUPR scores for k ranging from 1750 to 35,000 are given in Figure 2.
Let us choose a specific Facebook-large from the CA-Grqc dataset with k = 17,500 and the SAC CC measure where the AUROC is 0.918.This score suggests that, for this measure and dataset combination at this particular value of k, the model performed well in differentiating between positive and negative predictions in link prediction tasks.Essentially, the AUROC value of 0.918 indicates that there was a notable proportion of true positives compared to false positives across various threshold settings, resulting in this score.
In the CA-Grqc dataset, the proposed measure SAC CC on average demonstrated superior performance compared to all the baselines, followed by SAC D , whereas the worst performing measure was PA, on average.The clustering patterns captured by SAC CC may provide more accurate predictions compared to the simplistic degree-based approach of preferential attachment, resulting in superior performance for SAC CC .In the Facebook-large dataset, our measure SAC CC exhibited strong performance on average.In the Facebook dataset, SAC CC probably accounts for the network's local clustering structure, meaning it does not only examine direct connections between nodes, but also relationships among their mutual friends.In contrast, traditional measures primarily concentrate on pairwise node relationships alone.In the web-polblogs dataset, our proposed SAC D and RA were comparable, as SAC D and RA focus on the number of neighbors a node has.In the bio-celegans dataset, SAC D obtained the highest scores in some k-node pairs, while SAC CC performed better in others.However, overall, SAC CC achieved the highest scores among all measures.In the Facebook-large, web-polblogs, and bio-celegans datasets, JC was the worst performing measure.This is because of the normalization of common neighbors, which tends to decrease the scores on large datasets with increasing numbers of nodes.Specifically, for CA-Grqc, our proposed measure SAC CC consistently outperforms AA by 5%, and CCPA, the latest measure, by 7%.For the Facebook-large dataset, the proposed SAC CC demonstrates a 0.9% enhancement compared to CN, and a significant 5% improvement over CCPA.In web-polblogs, SAC CC exhibits a competitive performance, outpacing RA by 0.3%, and surpassing CCPA, by 12%.Finally, for bio-celegans, SAC CC excels with an 8% improvement over PA and a notable 9% improvement over CCPA.In Figure 3, we present the AUPR results across four datasets.In the CA-Grqc dataset, our proposed measure SAC CC outperforms all the baselines.In the Facebook-large dataset, SAC CC shows strong performance, while PA emerges as the worst performing measure for both the CA-Grqc and Facebook-large datasets.In the web-polblogs dataset, SAC D performs the best among all measures.In the bio-celegans dataset, SAC CC performs better, whereas JC does not performing well on both web-polblogs and bio-celegans.Specifically, for CA-Grqc, our proposed measure SAC CC consistently outperforms RA by 19%, and CCPA, the latest measure, by 28%.For the Facebook-large dataset, the proposed SAC CC demonstrates a 29% enhancement compared to CN, and a significant 46% improvement over CCPA.In web-polblogs, SAC D , outpaces CN by 21%, and surpasses CCPA, by 13%.Finally, for bio-celegans, SAC CC excels with a 31% improvement over RA and a notable 37% improvement over CCPA.

Comparing Proposed Measures
In this section, we present a comprehensive comparison study of the suggested similarity measures on a variety of real-world datasets, including web-polblogs, bio-celegans, Facebook-large, and CA-Grqc.The similarity measures we considered were SAC D , SAC B , SAC C , and SAC CC .Our results in Figure 4 show that SAC CC consistently performs better in terms of AUROC throughout the networks of CA-Grqc, Facebook-large, and bio-celegans.SAC D , however, exhibits the best performance on the web-polblogs dataset.On the other hand, SAC B performs the worst on the CA-Grqc, bio-celegans, and Facebook-large datasets.However, SAC CC performs poorly on the web-polblogs dataset.The web-polblogs dataset pertains to political blogs, where individuals often share their personal experiences rather than consistently citing external sources.The diversity in content within political blogs may contribute to a lower clustering coefficient, leading to the weak performance of SAC CC when compared to SAC D , which emphasizes node connectivity over clustering tendencies.When considering the AUPR in Figure 5, SAC CC consistently demonstrates superior performance across all datasets.Conversely, SAC B consistently performs the worst among all measures across all datasets.These results emphasize the influence of a network's structure and properties on the effectiveness of local similarities based on local and global centralities.Furthermore, it is worth noting that, in various network scenarios, local centralities perform better than global centralities.

Comparing Proposed Measures with Recent Methods like CCPA and KNLP
In Tables 4 and 5, we randomly chose a few node pairs instead of representing them all.These tables summarize the results based on AUC and AUPR obtained for the proposed algorithms, comparing them with the recent methods CCPA and KNLP on four datasets.It should be noted that we considered top k node pairs, with k = 20 datapoints ranging from 1750 to 35,000 i.e., k = {1750, 3500, ..., 35, 000}.In Table 4, we examine the Facebook-large dataset with k = 26,250.The AUROC score for the KNLP measure is 0.257.This implies that the KNLP approach encountered difficulties in accurately discerning between positive and negative predictions of link formation in this dataset and under these parameter conditions.The result suggests a higher prevalence of false positives compared to true positives across different datapoint settings, leading to the AUROC value of 0.257.In Table 4, for CA-Grqc dataset, for the top 8750 node pairs, our approach SAC CC outperform the latest measures, CCPA and KNLP, by 6% and 57%.For the top 26,250 node pairs, SAC CC demonstrated significant improvement over CCPA by 7% and over KNLP by 44%.For the Facebook-large dataset, for the top 8750 node pairs, SAC CC excels with an 11% improvement over CCPA and 37% improvement over KNLP.Furthermore, for the top 26,250 node pairs, SAC CC performs best over CCPA and KNLP by 10% and 44%.In web-polblogs, SAC D performs best over CCPA by 11% on the top 8750 and 26,250 node pairs, and also performs best over KNLP by 14% and 12% for the top 8750 and 26,250 node pairs.For bio-celegans, for the top 8750 node pairs, SAC D demonstrates a 10% enhancement compared to CCPA, and a significant 14% improvement over KNLP.Furthermore, for the top 26,250 node pairs, the SAC CC measure outpaces CCPA by 11%, and surpasses KNLP by 12%.
In the context of Table 5, our SAC D approach exhibits superior performance on the CA-Grqc dataset.Specifically, for the top 8750 node pairs, it outperforms the latest measures, CCPA and KNLP, by 37% and 91%, respectively.Additionally, for the top 26,250 node pairs, SAC CC demonstrates a significant improvement over CCPA, showing a 17% advantage, and over KNLP, showcasing a remarkable 70% improvement.Turning to the Facebooklarge dataset, SAC CC excels for both the top 8750 and top 26,250 node pairs, surpassing CCPA by 48% and 32%, and outperforming KNLP by 69% and 56%, respectively.
In the case of the web-polblogs dataset, SAC CC outperforms CCPA by 14% and KNLP by 21% for the top 8750 node pairs.Moreover, for the top 26,250 node pairs, SAC D demonstrates a significant improvement over CCPA by 10% and KNLP by 17%.For the bio-celegans dataset, SAC CC showcases a notable 29% enhancement over CCPA and a substantial 36% improvement over KNLP for the top 8750 node pairs.Similarly, for the top 26,250 node pairs, SAC CC outpaces CCPA by 28% and surpasses KNLP by 34%.

Discussion
The experimental result shows that our proposed similarity-based centralities (SAC) measures outperformed state-of-the-art models, when compared with existing local similarity-based link prediction measures and the latest measures, particularly SAC CC , outperform existing link prediction measures like JC and KNLP, in terms of AUROC on all datasets.However, SAC CC consistently achieved higher scores in terms of AUPR, indicating its superior predictive power over PA, JC, and KNLP measures on overall datasets.For example, when considering the JC measure applied to the web-polblogs dataset, which represents a network of political blogs, the presence of distinct communities or tightly-connected groups within the network may result in fewer shared connections between nodes from different communities.This phenomenon can lead to less accurate predictions.Moreover, in political blog networks, the formation of links in the preferential attachment (PA) model may depend more on the relevance of topics rather than solely on the connectivity of highly linked political blogs.Consequently, this could lead to lower predictive accuracy compared to models like SAC CC and SAC D , which take into account the presence of closely connected communities in the network.
When comparing the proposed measures themselves, our proposed measure SAC CC performed exceptionally well on datasets like CA-Grqc, Facebook-large, and bio-celegans, as it effectively captured the patterns and structures specific to these networks.SAC D performed better on the web-polblogs dataset, where the number of neighbors is crucial for link prediction.However, both SAC B and SAC C exhibited lower levels of information flow between proteins and are less closely connected.Consequently, they achieved lower accuracy compared to SAC CC and SAC D .In terms of AUPR, SAC CC consistently outperformed other measures, while SAC B performed the worst for all datasets.This indicates AUPR effectiveness in identifying true positive links while minimizing false positives.
These findings emphasize the importance of considering network structure and properties when selecting the most suitable similarity measures for link prediction.

Conclusions
In conclusion, our research addresses the challenging task of predicting missing links based on centralties in complex networks.We propose novel similarity measures that incorporate generalized centrality measures, including degree, betweenness, closeness, and clustering coefficient.Our approach identifies top similarity scores by considering the top 20 node pairs.The results, as measured by AUC and AUPR, demonstrate the superior effectiveness of our approach.Our findings highlight the effectiveness of the proposed measures, particularly in the realm of local similarity based on local centrality measures rather than global centralities.
Future research endeavors could extend this work to predicting links using global similarity measures based on global centralities within complex networks.Additionally, we aim to explore similarity-based centralities in hypergraphs as an extension beyond traditional graphs.Furthermore, considering the significance of weighted networks, where edges are assigned different weights to denote the strength or importance of connections between nodes, it would be valuable to explore how the SAC approach performs in such networks, as the weights may influence the centrality measures and, consequently, the similarity scores.Directed networks, where edges have a specific direction, introduce additional complexities in measuring centrality.However, our current focus remains on unweighted, undirected graphs and we intend to explore weighted, directed graphs in future extensions of our work.

Figure 1 .
Figure 1.An illustration of an undirected toy network with eight nodes and twelve edges.

Table 2 .
SAC D (Similarity based on Average Degree), SAC B (Similarity based on Average Betweenness), SAC C (Similarity based on Average Closeness), SAC CC (Similarity based on Average Clustering Coefficient), CN (common neighbor), JC (Jaccard coefficient), PA (preferential attachment), RA (resource allocation), AA (Adamic-Adar), CCPA (Common Neighbor and Centrality-based Parameterized Algorithm), and KNLP (keyword network link prediction algorithm) similarity scores for non-adjacent node pairs for a graph are shown in Figure 1.

Figure 2 .
Figure 2. AUROC scores for link prediction using common neighbors based on average centrality for top k node pairs, k ranging from 1750 to 35,000, for four datasets.

Figure 3 .
Figure 3. AUPR scores for link prediction using common neighbors based on average centrality for the top 35,000 node pairs across four datasets.

Figure 4 .
Figure 4. AUROC scores for proposed measures of top 35,000 node pairs across four datasets with SAC D (Similarity based on Average Degree), SAC B (Similarity based on Average Betweenness), SAC C (Similarity based on Average Closeness), and SAC CC (Similarity based on Average Clustering Coefficient).

Figure 5 .
Figure 5. AUPR for proposed measures of top 35,000 node pairs across four datasets with SAC D (Similarity based on Average Degree), SAC B (Similarity based on Average Betweenness), SAC C (Similarity based on Average Closeness), and SAC CC (Similarity based on Average Clustering Coefficient).

Table 3 .
Basic properties of datasets.

Table 4 .
Performance of the proposed measures against existing measures in terms of AUROC for the top k predictions, at various thresholds of k.

Table 5 .
Performance of the proposed measures against existing measures in terms of AUPR for the top k predictions, at various thresholds of k.