Relative Entropy of Distance Distribution Based Similarity Measure of Nodes in Weighted Graph Data

Many similarity measure algorithms of nodes in weighted graph data have been proposed by employing the degree of nodes in recent years. Despite these algorithms obtaining great results, there may be still some limitations. For instance, the strength of nodes is ignored. Aiming at this issue, the relative entropy of the distance distribution based similarity measure of nodes is proposed in this paper. At first, the structural weights of nodes are given by integrating their degree and strength. Next, the distance between any two nodes is calculated with the help of their structural weights and the Euclidean distance formula to further obtain the distance distribution of each node. After that, the probability distribution of nodes is constructed by normalizing their distance distributions. Thus, the relative entropy can be applied to measure the difference between the probability distributions of the top d important nodes and all nodes in graph data. Finally, the similarity of two nodes can be measured in terms of this above-mentioned difference calculated by relative entropy. Experimental results demonstrate that the algorithm proposed by considering the strength of node in the relative entropy has great advantages in the most similar node mining and link prediction.


Introduction
In the real world, numerous complex networks can be abstracted as graph data, such as social networks [1], protein interaction networks [2], traffic networks [3], and e-commerce networks [4]. The graph data can be used to not only portray the weight information of nodes, but also to describe the topological information between nodes. Therefore, the graph data have been given special attention in many fields due to their quantities of valuable information [5,6]. Especially in recent years, many scholars have gradually focused on the similarity measure of nodes in graph data [7,8]. As a necessary tool to determine the similarity between two nodes, the similarity measure plays a vital role in the most similar node mining [9], link prediction [10], cluster analysis [11], and so on [12][13][14].
Up to now, plenty of similarity measure algorithms have been proposed to calculate the similarity of nodes, and these algorithms can be roughly classified into three categories: local similarity indices [15,16], quasi-local similarity indices [17,18], and global similarity indices [19,20]. These three types of indices include some representative algorithms, such as the common neighbor (CN) index [15], Adamic-Adar (AA) index [16], local random walk (LRW) index [17], average commute time (ACT) index [19], and so on [18,20]. The CN index calculates the similarity between two nodes by counting their common neighbors. In order to distinguish the contribution of different common neighbors, the AA index is presented by employing the degree of common neighbors. LRW index and ACT index are constructed based on the random walk of particles between two nodes.
In recent years, some similarity measures have been also studied from the perspective of information theory. For example, Tan et al. [21] applied the mutual information to graph data and then designed the mutual information (MI) index to calculate the similarity of nodes. Inspired by MI index, Zhu et al. [22] used the mutual information to weighted graph data and proposed a weighted mutual information model to explore the influence of strong and weak tie effects. However, if these indices based on the mutual information are used to calculate the similarity of nodes, then these nodes of a larger degree will become general similar nodes [23]. Bear in mind that Zhang et al. [24] presented a local relative entropy (LRE) index to calculate the similarity of nodes. In the definition of the (LRE) index, the relative entropy is used to measure the difference between the degree distributions of the two nodes utilized to further obtain their similarity. Moreover, Zheng et al. [25] utilized the relative entropy to measure the difference between the transition probability distributions of two nodes and then constructed the RE-model to calculate the similarity of nodes.
In the measurement results of these mutual information-based indices, the nodes of a larger degree easily become general similar nodes. In the measurement results of these relative entropy-based indices, the situation that many nodes are similar to the nodes of a larger degree can be avoided. However, the LRE index utilizes the degree distribution of each node, and the RE-model uses the transition probability distributions between two nodes. Therefore, the degree of nodes is merely utilized in the definition of the two indices. Unfortunately, the LRE index and RE-model do not make full use of the strength of nodes in weighted graph data, which leads to their performance failing to be improved further. In particular, there is a poor performance when the relative entropy-based similarity measures are applied to carry out the link prediction [26].
Generally speaking, the strength of a node represents its ability to collect information, and the degree of a node represents its ability to diffuse information. Thus, if the similarity measure algorithm is constructed by properly integrating the degree and strength of nodes, then its performance may be further enhanced [27]. To our knowledge, however, there are rare studies on how to improve the performance of the similarity measurement by using the degree and strength of nodes. The similarity measure based on the relative entropy is also given little attention for link prediction in weighted graph data.
Based on the above analysis and discussion, the relative entropy of the distance distribution based similarity measure of nodes is proposed in this paper. The distance distribution of each node can be obtained by calculating the Euclidean distance between the structural weights of two nodes, where the structural weight of each node comprehensively considers its degree and strength in weighted graph data. After that, the probability distribution of nodes is constructed by normalizing the elements in their distance distribution. At last, the relative entropy can be applied to measure the difference between the probability distributions of the top d important nodes and all nodes in graph data, which ensures that the similarity of nodes can be calculated with the lower time cost. We numerically simulated the proposed algorithm and verified its effectiveness and efficiency in the most similar node mining and link prediction. In this paper, we provide a similarity measure algorithm with the following several contributions in mind.

•
The structural weights of nodes are defined by integrating their degree and strength, and then the structural weights-based distance between two nodes can be calculated. • The difference between the probability distributions of the top d important nodes and all nodes in the graph data is measured by using the relative entropy, which can ensure that the similarity of nodes can be calculated with the lower time cost. • The proposed similarity measure algorithm has a great advantage in mining the most similar nodes and performing the link prediction, compared with the majority of benchmark algorithms.
The remainder of this paper is organized as follows. Some basic knowledge of the weighted graph data and similarity measure are reviewed in Section 2. The relative entropybased similarity measure algorithm is defined in Section 3. Some experimental materials are introduced in Section 4. Experimental results are demonstrated in Section 5. The conclusion of this paper is drawn in Section 6.

Preliminaries
In this section, some necessary knowledge is introduced, including the concepts of weighted graph data, the relationship between the node similarity and link prediction, and the definition of relative entropy.

Weighted Graph Data
Formally, the so-called weighted graph data can be expressed as a 3-tuple G = (V, E, W), where V = {v x | x = 1, 2, · · · , n} represents the set of nodes, E = {e xy |x, y = 1, 2, · · · , n} indicates the set of edges, and W = {w xy |x, y = 1, 2, · · · , n} denotes the set of weights. It is not difficult to find that the weighted graph data will degenerate to the unweighted form G = (V, E) if w xy = 1, and x, y = 1, 2, · · · , n. Moreover, the e xy represents the edge that connects nodes v x and v y , and then w(e xy ) = w xy denotes the weight of the edge e xy .
Considering two nodes v x , v y ∈ V, they are adjacent to each other if they are two end nodes on the edge e xy ∈ E. Let a xy = 1 and a xy = 0 respectively denote that an edge between v x and v y is existent and non-existent. Then, the adjacency matrix of the graph G = (V, E) is defined by A = {a xy } n×n . For a weighted graph G = (V, E, W), its weighted adjacency matrix can be expressed as A w = {w xy } n×n .
In order to facilitate the understanding of the content of this article, some relevant notations are summarized in Table 1.

Notations
Descriptions The degree of node v x , i.e., k x = ∑ n y=1 a xy is the sum of the number of edges connected to node v x s x The strength of node v x , i.e., s x = ∑ n y=1 w xy is the sum of weights on all edges connected to node v x s xy The similarity of node v x and node v y

Relationship between the Node Similarity and Link Prediction
In the real world, many graph data are incomplete or inaccurate. These graph data are collected from a wide range of information systems and can only reflect a part of the real information. Thus, link prediction technology becomes a significant and useful tool during the analysis of graph data, as its task is to detect and mine the missing information in graph data. Generally speaking, the link prediction technology aims at quantifying the existence likelihood of a candidate edge between two nodes. In related research, this kind of existence likelihood can be measured by using the similarity between two nodes [28][29][30]. Therefore, the similarity measure algorithm of nodes is an efficient and effective method for performing the link prediction.
In the process of performing link prediction, the observed edge set E needs to be randomly divided into the training set E t and the probe set E p , where E t ∪E p = E and E t ∩ E p = ∅. The edges in E t are regarded as known information, which is used to calculate the similarity between two nodes. The edges in E p are applied to test the performance of similarity measure algorithms by making a comparison of similarity score with the edges in edge set U−E. The set U−E expresses the set of unknown edges, and U denotes the universal set of all possible edges. Thus, the edges in E t and the edges inU−E make up the set of all missing edges i.e., E t ∪ (U−E) in graph data.
From the above analysis, the relationship between the node similarity and link prediction can be briefly described as follows. Given an edge in E t ∪ (U−E), it can be assigned a score by using any kind of similarity measure. After that, all edges in E t ∪ (U−E) are sorted in decreasing order according to their scores. Finally, the edge with the highest-ranked score is most likely to exist.

Relative Entropy
In information theory, the relative entropy is also called Kullback-Leibler divergence, which is a measure of the distance between two distributions [31,32]. In general, relative entropy can be used to measure the difference between two probability distributions. Considering two different probability distributions P and Q, their relative entropy can be described in the following form: where r is the number of components in these two probability distributions P and Q. The greater the relative entropy between the P and Q, the greater the difference between them, and vice versa. Note that the relative entropy is asymmetrical, which is D KL (P Q) = D KL (Q P). Therefore, this paper redefined the Kullback-Leibler divergence in the process of the calculation of node similarity. In order to make the relative entropy able to satisfy the definition of the distance, the redefined formula is rewritten as From the above analysis, there is . Therefore, RD(P Q) is symmetrical. According to the nature of the Kullback-Leibler divergence, we can know that RD(P Q) satisfies the definition of distance measure.

Method
Aiming at the problem of the node similarity in weighted graph data, a similarity measure algorithm is proposed in this section. The proposed algorithm employs relative entropy to measure the difference between the probability distributions of two nodes. The probability distribution of each node is obtained in terms of its distance distribution. The distance distribution of each node is defined by calculating the Euclidean distance between the structural weights of two nodes. The structural weights of nodes can be defined by utilizing their degree and strength information.

Structural Weight Set of Nodes
In weighted graph data, the connections between nodes are varied, and the degree and strength of the different nodes have great variation. Generally, the strength of a node represents its ability to collect information, but the degree of a node denotes its ability to diffuse information. Bearing in mind the specificity of nodes, the structural weight set of nodes is given in this paper. Before giving the structural weight set of nodes, we define the three kinds of structural weights of nodes. Definition 1 (Unit weight of nodes). The unit weight of a node is defined as the average value of the weight for all the edges connecting this node. Then considering a node v x in G = <V, E, W>, its calculation expression of unit weight can be defined by where the definition of s x and k x respectively are described in Table 1 for easy reading, and uw(v x ) represents the unit weight of v x . Clearly, uw(v x ) simply combines its ability to collect and diffuse information.
Definition 2 (Degree weight of nodes). The degree weight of a node fully takes into account its ability to diffuse information in the case of suppressing its ability to collect information. Then, considering a node v x in G = <V, E, W>, its calculation expression of degree weight can be defined by where dw(v x ) represents the degree weight of v x . It is not difficult to find that the dw(v x ) considers the impact of diffusing information of all nodes in G = <V, E, W>.
Definition 3 (Strength weight of nodes). The strength weight of a node fully considers its ability to collect information under suppressing its ability to diffuse information. Given a node v x in G = <V, E, W>, its calculation expression of strength weight can be defined by where sw(v x ) denotes the strength weight of v x . It is can be found that the sw(v x ) takes into account the influence for collecting information of all nodes in G = <V, E, W>.
Definition 4 (Structural weight set of nodes). Considering the specificity of the degree and strength of nodes, we define the structural weight set of nodes as where SA(v x ) indicates the structural weight set of v x , which consists of the unit weight, degree weight, and strength weight of v x .

Distance Distribution of Nodes
As a frequently used distance measure in mathematics, the Euclidean distance is also widely employed in various similarity measure researches. Its advantage is to overcome the correlation interference between variables and eliminate the influence of the dimension of each variable at the same time. Therefore, Euclidean distance is applied to calculate the distance between the structural weights of two nodes, to obtain the distance distribution of each node in this paper.
In this paper, the three kinds of structural weights of nodes are defined. Thus, the distance between two nodes can be calculated by using the formula of Euclidean distance in 3-dimensional space. In the following, we define the distance between two nodes. Definition 5 (Distance between two nodes). In this paper, the distance between two nodes can be calculated by using the formula of Euclidean distance in three-dimensional space and their structural weight set. Considering two nodes v x ,v y in G = <V, E, W>, then the distance between them can be calculated by where d(v x , v y ) expresses the Euclidean distance between v x and v y . It is not difficult to find that the distance between any two nodes can be calculated in G = <V, E, W>.
Based on the above discussion, the distance distribution of each node can be obtained. In the following, we define the distance distribution of nodes. Definition 6 (Distance distribution of nodes). In this paper, the distance distribution of each node can be defined in terms of the distance between this node and other nodes. Given a node v x in G = <V, E, W>, its distance distribution can be defined as where DD(v x ) denotes the distance distribution of v x , and one can observe that the distance distribution of each node has n components.
To use the relative entropy to measure the node similarity, the probability distribution of each node needs to be obtained. Thus, the distance between two nodes should be normalized within the range of [0, 1]. In view of the similarity between two nodes being inversely proportional to the distance between them, the probability of an edge existing between two nodes can be calculated by using a constant 1 to subtract the normalized distance between them. On these bases, we define the probability distribution of nodes.
Definition 7 (Probability distribution of nodes). In this paper, the probability distribution of each node can be constructed by utilizing the normalized distance between it and other all nodes. Given a node v x in G = <V, E, W>, its probability distribution can be expressed as where PD(v x ) denotes the probability distribution of v x , and p(v x , v y ) is the probability of existing an edge between v x and v y , y = 1, 2, · · · , n.

Design of Algorithm
From the above discussion, the three structural weight sets of nodes are given by using their unit weight, degree weight, and strength weight. Then, the distance distribution of nodes is also obtained by calculating the Euclidean distance between the structural weights of any two nodes. Furthermore, considering that the similarity between any two nodes is inversely proportional to the distance between them, the probability distribution of nodes is constructed by using a constant 1 to subtract the normalized distance between them to better describe the similarity of the two nodes.
It is not difficult to find that Equation (9) includes the probability between v x and all nodes in graph data. Thus, there may be a large number of resource losses during applying the relative entropy to measure the difference between the probability distributions of two nodes. This is because the dimension of the probability distribution of each node increases as the size of the graph data increases. In addition, some non-significant nodes also affect the accuracy of algorithm. In reality, these nodes that have a greater impact on the similarity calculation which may be able to achieve good results with a low computational cost [24]. Considering the specificity of the degree and strength of nodes, this paper selects the probability distribution of any node in the graph data and the top d important nodes to construct for the similarity measure according to the unit weight of all nodes.
First, the top d important nodes are found by arranging all nodes in descending order according to their unit weights. Next, the set S = {v 1 ,v 2 , · · ·v d } of the top d important nodes is constructed. Then, the probability between the v x and any nodes in S is obtained according to the elements in PD(v x ). Then, the d-dimension vector . From the above analysis, the difference between the probability distributions of the top d important nodes and any node in graph data can be measured by utilizing the relative entropy. In the following, the relative entropy value between the probability distributions of the top d important nodes and any node in graph data can be calculated.
Definition 8 (Relative entropy value between the probability distributions of two nodes). Considering a pair of node (v x , v y ) in G = <V, E, W>, the relative entropy value between their probability distributions can be defined as where RE(v x , v y ) denotes the relative entropy value between the probability distributions of v x and v y . In this paper, the ln Definition 9 (Difference between two nodes). In this paper, the difference between two nodes is calculated by employing the relative entropy value between them. Considering two nodes v x and v y in G = <V, E, W>, the difference between them can be expressed by where d xy denotes the difference between v x and v x . Clearly, the d xy is symmetrical in terms of Equation (2), and then it can be used to calculate the similarity of nodes.
Generally speaking, the greater the difference between the probability distributions of two events, the smaller their similarity. Thus, the relative entropy of distance distribution based similarity measure of nodes is proposed to transform the difference between two nodes into their similarity.
Definition 10 (Relative entropy of distance distribution based similarity measure of nodes, REDD). In this paper, the similarity of two nodes can be represented by their difference with the help of the similarity measure algorithm REDD index. Considering two nodes v x and v y in G = <V, E, W>, the REDD index can be expressed as where d max is the maximum of the difference between any two nodes in graph data, s REDD xy is the similarity of v x , and v y calculated by the REDD index. REDD is the abbreviation of the algorithm we proposed, and its corresponding pseudo-code is outlined in Algorithm 1.

Algorithm description:
The input is the weighted graph data G =< V, E, W > and dimension d, the output is the similarity matrix S REDD n×n . The construction procedure of the REDD index is operated in the following three phases: initialization phase (line 2), computation phase (lines 4-12), and update phase (line 14). The initialization phase refers to assigning certain storage to the matrix S REDD n×n . The computation phase iteratively calculates the similarity of two nodes by using the previous definitions. The purpose of the update phase is to store the similarity of all node pairs in the matrix S REDD n×n .

Algorithm 1
The construction procedure of REDD index. Input: Weighted graph data G = <V, E, W> and dimension d.

Materials
In this section, we introduce some experimental materials, such as the experiment datasets, benchmark algorithms, and evaluation metrics. Moreover, the experimental environment used in this paper is listed in Table 2.

Datasets Description
In this article, we consider 12 real-world weighted graph data that are freely downloaded from some public academic websites. These weighted graph data include the transportation network, citation network, ecology network, and biological network. The detailed information on the related networks is given below.  The topological statistical characteristics of these real-world weighted graph data are listed in Table 3, where each row from left to right is the network name, number of nodes n, number of edges m, average degree <k>, average strength <s>, average clustering coefficient <c>, average weighted clustering coefficient <c w > and graph density ρ, respectively. Note that the self-connections and multiple edges in these weighted graph data are removed before calculating their topological statistical characteristics.

Benchmark Algorithms
Here, we introduce several similarity measures that are usually used for experimental comparison in the most similar node mining and link prediction. The basic motivation and definition of these similarity measures are given below.
The CN index directly regards the number of all common neighbors between two nodes as their similarity, which is where N(v x ) ∩ N(v y ) represents the common neighbor set of v x and v y , N(v x ) is the set of neighbors of v x , and the |V| denotes the cardinality of set V.
The WCN index is the weighted version of CN index, which is defined as where w xz = w zx expresses the weight of the edge connecting v x and v z . The AA index is the extended version of the CN index, whose advantage is to refine the simple count of common neighbors. For the AA index, it gives less weight to the common neighbors with a greater degree, which is defined as The WAA index is the weighted version of AA index, which is defined as where s z may be smaller than 1, so we use log(1 + s z ) in the above equation to avoid a negative score. The LRW index is a similarity measure based on the local random walk of particles between two nodes, and its calculation expression is where |E| is the number of the edges in graph data, and π xy (t) is obtained according to the density vector evolution equation: In the density vector evolution equation, the P is the transition probability matrix, T is the matrix transpose, and t > 0 is the number of steps the particle takes to walk between two nodes. In this paper, t is specified as 3.
The RE-LRW index is a similarity measure of the local random walk based on the relative entropy. In the definition of the RE-LRW index, the probability distribution P(v x ) of v x is constructed in terms of the transition probability that it reaches other nodes after a three-step walk. Then, according to the degree centrality of each node, the transition probability of the top d important nodes is selected to form the d-dimensional probability distribution. Finally, the relative entropy is used to measure the difference between the transition probability distributions of the top d important nodes and all nodes in the graph data.
The LRE index is proposed with the help of the local structure of each node and relative entropy. Thereinto, the local structure of each node can be represented by utilizing the degree distribution of each node. After that, the probability distribution of each node can be obtained by normalizing all elements in their degree distribution. Finally, the difference between the probability distributions of two nodes can be measured by employing relative entropy, and then their similarity can be calculated accordingly.

Evaluation Metrics
In experiment, the performance of all similarity measures in the most similar nodes mining and link prediction are tested. For this reason, some evaluation metrics need to be introduced. In the most similar node mining, the ratio of mutual most similar nodes(abbreviated as MS) is used to quantify the effectiveness of all similarity measures. In the link prediction, the area under the receiver operating characteristic curve(abbreviated as AUC) is employed to quantify the prediction performance of all similarity measures.
MS can be interpreted as the that if the most similar node of v x is v y , then the most similar node of v y has a higher probability is v x . Therefore, if the most similar node of v x is v y and the most similar node of v y is v x , then v y and v x are mutually the most similar. For example, in the small-scale graph data with 10 nodes, the most similar node of v 1 is v 2 and the most similar node of v 2 is v 1 , but other nodes are not mutually similar. Then, the number of the mutually most similar nodes in this small-scale graph data is equal to 2 and the MS is 0.2. Thus, the calculation expression of MS is where n ms denotes the number of the mutually most similar nodes. In general, the better the performance of a similarity measure, the larger the MS value obtained. AUC can be interpreted as the probability that an edge randomly selected in the test set is assigned a higher similarity than an edge randomly selected in the unknown edge set. After r times independent comparisons, if there are r 1 times that the similarity of the test edge is greater than that of the unknown edge and r 2 times that they have the same similarity, then the AUC value can be calculated as where r = 10,000 indicates the number of times that carried out the comparison of similarity in this paper.

Results
In this section, the performance of the REDD index and seven benchmark indices in the most similar node mining and link prediction is evaluated by employing two evaluation metrics: MS and AUC. There may be some statistical errors in the prediction accuracy due to the training set and test set being randomly divided during the link prediction. For reducing these errors, the final prediction accuracy of each index is the average value of running 30 independent experiments in all graph data. Furthermore, the training set proportion is specified as 0.9 in performing the link prediction.

Analysis of MS Results
First of all, the performance of the REDD and RE-LRW indices are evaluated by using the MS metric. In Figure 1, we investigate the impact of different dimensions d on the MS values of the REDD index and RE-LRW index. From the results, one can observe that the MS curves of RE-LRW index have a large variation range in 10 out of 12 graph data. The MS curves of RE-LRW index are relatively flat in MEET and USAI, while its MS curves are considerably low. Moreover, it can be seen that the MS value of the RE-LRW index is almost close to 0 in MEET. This may be because there are some non-connected subgraphs in MEET, which will interrupt the random walk between nodes, resulting in the poor performance of the RE-LRW index. In contrast, the REDD index is not affected by the disconnection between nodes and can achieve good results in MEET. It can also be found that the REDD index can maintain high MS curves in most graph data, while keeping the variation of MS curves small. In particular, the MS curves of the REDD index are clearly higher than that of the RE-LRW index in FOBA, MEET, and USAI. From the above discussion, the REDD index owns a greater performance in the most similar node mining. To compare the effectiveness between the REDD index and the seven benchmark indices, Table 4 lists the MS results of all indices in 12 weighted graph data. Note that the best MS value of each row is highlighted by using boldface. Furthermore, the RE-LRW opt and REDD opt are used to represent the RE-LRW and REDD indices with the optimal MS value in different dimensions d, respectively. From the results, it can be found that the MS values of the relative entropy-based indices are higher than those of the local structurebased indices and the random walk-based index. This indicates that it is indeed effective for the similarity measures based on relative entropy in the similarity calculation.
In these local structure-based indices, the MS values of CN and WCN indices have a greater difference in FOBA, MEET, FWFW, REHA, CELE, and USAI. For instance, the MS values of CN index are lower than those of the WCN index in FOBA and REHA, while the MS values of the CN index are higher than those of the WCN index in MEET, WFW, CELE, and USAI. This may be caused by the strong and weak ties in the weighted graph data. This phenomenon is also true for AA and WAA indices. Thus, it is necessary to comprehensively consider the degree and strength of nodes to avoid the influence of strong and weak ties. Compared with the RE-LRW indices, the LRW index has lower MS values in most graph data, except MEET. Thus, there are many general similar nodes when the LRW index is used for the similarity calculation. At the same time, It reflects that the similarity measure based on the relative entropy can reduce the dependence on the large-degree nodes, and so the similarity of nodes can be better characterized.
For LRE and REDD indices, they can maintain higher MS values in 12 graph data. From the results, one can find that there are no general similar nodes when LRE and REDD indices are applied to calculate the similarity of nodes. Despite there being some non-connected subgraphs in MEET, the LRE and REDD indices still perform well. The MS values of the LRE and REDD indices are more than 0.4000, but the MS values of the REDD index can reach 0.7048 in USAI and 0.8000 in FOBA. Taken together, the REDD index has better performance during the most similar node mining.

Analysis of Scatter Diagram
In the most similar nodes mining, the scatter diagrams are also used to validate the performance of the similarity measure. To make the experimental results distinguishable, the scatter diagrams of all indices are merely given in the graph data with more than 100 nodes. In scatter diagrams, the horizontal ordinate represents the label of nodes, and the vertical coordinates denote the label of the most similar nodes for the node in the horizontal ordinate. Therefore, the nodes should be scattered on the two-dimensional plane as much as possible in the scatter diagram. If the nodes are concentrated near the diagonal line or present a straight line (i.e., a large number of nodes are most similar to the same node), then the performance of this similarity measure is poor. Figure 2 shows the scatter diagrams of eight indices in MMET, where the degree of node v 19 is the largest, and the degree of node v 48 is the second largest. From the scatter diagrams of the CN, WCN, AA, WAA, and LRW indices, one can see that many nodes are similar to the nodes v 19 and v 48 . It indicates that there are generally similar nodes in the measurement results of these indices. Additionally, there are no generally similar nodes in the measurement results of the RE-LRW index, but its scatter diagram has poor symmetry. From the scatter diagrams of the LRE and REDD indices, there are neither many nodes similar to the nodes of a large degree nor many nodes clustered on a straight line. Thus, the performance of LRE and REDD indices is outstanding in MEET.     Figure 4 shows the scatter diagrams of eight indices in EMAI, where the degree of node v 38 is the largest, the degree of node v 37 is the second largest, and the degree of node v 45 is the third largest. Clearly, there are still generally similar nodes in the scatter diagrams of the CN, WCN, AA, WAA, and LRW indices. From the scatter diagrams of the RE-LRW, LRE, and REDD indices, one can find that these indices that use relative entropy can effectively distinguish the generally similar nodes. From the scatter diagrams of the three indices, it can be also seen that the REDD index performed better than the RE-LRW and LRE indices.      Figure 7, the RE-LRW index can really avoid the situation that the nodes of a large degree become generally similar nodes. Unfortunately, the most similar nodes of many nodes are clustered in a straight line in the scatter diagram of the RE-LRW index. It indicates that the symmetry of RE-LRW index still has great room for enhancement in USAI. In the scatter diagram of the LRE index, many nodes are rarely distributed near the diagonal line, so the LRE index has a relatively better symmetry. As for the REDD index, most of the nodes are not distributed near the diagonal line in its scatter diagram, and many large nodes do not become general similar nodes. Overall, the REDD index performs better in the USAI, compared to the other benchmark indices. In this subsection, the performance of the REDD index and seven benchmark indices is analyzed. From the results of the CN, WCN, AA, and WAA indices, one can see that these only used the degree or strength of nodes are greatly affected by the strong and weak ties. Owing to the nodes of a large degree being more likely to be visited during the random walk, there are generally similar nodes in the measurement results of the LRW index. From the results of the RE-LRW, LRE, and REDD indices, it can be found that these indices use relative entropy and their own superior performance in the most similar nodes mining. Despite the results that the RE-LRW index is performed in MMET and USAI, it still has room for improvement. The REDD index comprehensively considers the degree and strength of nodes, while the LRW index only uses the degree of nodes. Hence, one can observe that the performance of the former is better than that of the latter from their results.

Analysis of Auc Results
To test the effectiveness of similarity measure in link prediction, the AUC metric is further used to evaluate the performance of REDD index and seven benchmark indices. Figure 8 demonstrates the AUC curves of the REDD and RE-LRW indices when the dimension d changes from 2 to 7. From the results, one can observe that the variation amplitude of the AUC curves of the RE-LRW index are almost the same as that of the REDD index in the other graph data, except for USAI, while the accuracies of the RE-LRW index are far less than that of the REDD index under any dimension d. Despite the AUC values of RE-LRW index being higher than that of the REDD index when d is equal to 2, 3, 4, and 5 in USAI, the AUC values of the REDD index are clearly higher than that of the RE-LRW index when d is greater than 6. It indicates that the REDD index owns a greater potential during the link prediction. On the whole, compared with the RE-LRW index, the REDD index is more suitable for link prediction. In the following, we analyze the effectiveness of the REDD index and seven benchmark indices during the link prediction. Table 5 lists the AUC results of all indices in 12 weighted graph data. From the results, one can observe that the AUC values of the REDD index are highest in 11 out of 12 graph data. Despite the AUC value of the REDD index being not as good as that of the four local indices in LESM, its AUC value is superior to that of the LRW, RE-LRW, and LRE indices.
From the results of the CN, WCN, AA, and WAA indices, one can also find that there is a great influence of the strong and weak ties on the similarity measure during the link prediction. For instance, the AUC values of the WCN and WAA indices are significantly greater than that of the CN and AA indices in FOBA, STAM, JAMA, FWEW, LESM, MEET, FWFW, EMAI, REHA, and CELE. This indicates that these similarity measures using the strength of nodes are easier to promote the formation of edges in these graph data, while the AUC values of the WCN and WAA indices are lower than that of the CN and AA indices in FWMW and USAI. It indicates that the fact of weak ties needs to be emphasized in the two graph data. Therefore, it may be more effective to construct the similarity measure by combining the degree and strength of nodes, such as the REDD index we designed.
From the results of the LRW and RE-LRW indices, although the RE-LRW index can enhance the performance of the LRW index in the most similar node mining, the effects of the RE-LRW index are inferior to those of the LRW index in link prediction. In other words, the RE-LRW index has a good performance in the most similar node mining, but its AUC results are quite poor. Therefore, the similarity measure considering only the degree of nodes might perform well only unilaterally in the most similar node mining or link prediction.
In a nutshell, the REDD index not only achieved good results in the most similar node mining, but also acquired a good application in link prediction. It further indicates that it is effective for comprehensively considering the role of the degree and strength of nodes to construct the similarity measure.
Generally, the low complexity is a vital factor in the design of an algorithm. In view of the complexity of local indices being relatively lower, we merely compare the running time of the REDD, RE-LRW, and REDD indices in 12 weighted graph data. Next, the time consumption of the LRE, RE-LRW, and REDD indices are compared by using the metric: normalized time consumption [33].   Figure 9 shows the normalized time consumption of the LRE, RE-LRW, and REDD indices in 12 weighted graph data. From the result, the following three phenomena can be found. The LRE index runs the slowest in 11 out of 12 graph data. The time consumption of RE-LRW index increases with the increase in the number of nodes. The time consumption of the REDD index is at a medium level in LESM, MEET, and USAI. It is worth mentioning that the normalized time consumption of the REDD index is not the highest in all graph data. Hence, it is also feasible to apply the REDD index in large-scale weighted graph data if there is a better experimental environment. Above all, the REDD index owns a satisfactory time complexity in the process of link prediction.

Application to Simulated Networks
As described in the process of link prediction, many real-world graph data may be incomplete. Hence, it is difficult to design a similarity measure applicable to all real-world graphic data. To further verify the effectiveness of the REDD index, the NW small-world model is used to construct nine simulated graph data. Therefore, these simulated graph data are similar to real-world graph data. The NW model can establish the graph data with the different topological characteristics by adjusting the parameters M and P. For example, parameter M can be applied to adjust the average degree of the network, and parameter P can be utilized to regulate the average clustering coefficient of the network. The topological statistical characteristics of nine simulated networks are listed in Table 6. From Table 6, it can be observed that the node number of nine simulated graph data is specified as 100, and the topological statistical characteristics of these graph data are changed as the variation of parameters M and P. From the results of Figures 10-13, it is not difficult find that the performance of the REDD index is hardly affected by the topological characteristics of graph data. Thus, in both the real-world graph data or in the simulated graph data, the REDD index has better performance in the most similar node mining and link prediction. Figure 10 demonstrates the MS curves of the REDD and RE-LRW indices in nine simulated networks when the dimension d is changed from 2 to 7. Compared with the MS performance of the REDD and RE-LRW indices in real-world graph data, their MS performance shows higher accuracy in the simulated graph data. This indicates that the performance of the corresponding algorithm will be improved to some extent if the graph data can be collected more accurately. From Figure 10, one can observe that the MS curves of the RE-LRW index presents a large fluctuation range in different graph data. Thus, the robustness of the RE-LRW index still needs to be improved in simulated graph data.  Figure 11 shows the AUC curves of the REDD and RE-LRW indices in nine simulated networks when the dimension d is changed from 2 to 7. From the results in Figure 11, it can be observed that the REDD index can be better than the RE-LRW index in accuracy and robustness. Therefore, if the similarity measure based on relative entropy is proposed by only considering the degree of nodes, its performance may have no advantage in link prediction. Above all, these results in simulated graph data reflect that it is feasible to comprehensively take into account the degree and strength of nodes for enhancing the performance of the similarity measure based on the relative entropy once again.  Figure 12 gives the comparison of the MS values between the REDD index and seven benchmark indices in nine simulated networks. From the results, it can be seen that the performance of the CN and AA indices is better than that of their weighted version. It indicates that the CN and AA indices are more suitable for performing the most similar node mining in simulated networks. Moreover, it can be also found that the MS values of CN index is the highest in net4 and net7. This indicates that the CN index performs well in the graph data with a larger average clustering coefficient. Compared with the RE-LRW index, the performance of LRW index seems to be less than ideal in both most similar node mining cases.  Figure 13 reports the comparison of theAUC values between the REDD index and seven benchmark indices in nine simulated networks. From the results, it can be seen that the performance of the CN and AA indices may be inferior to that of their weighted version. It indicates that these indices that only consider the degree or strength of nodes are also influenced by strong and weak ties in the weighted simulated graph data. From the results, the performance of the LRW index may have a greater advantage than that of the RE-LRW index in link prediction. Thus, the performance of the RE-LRW and LRW indices need to be further improved in the most similar node mining and link prediction. In addition, it can be found that despite the MS performance of the RE-LRW index being almost the same as that of the REDD index, the AUC performance of the former is far less than that of the latter. This may be because the REDD index comprehensively utilizes the degree and strength of nodes, which results in the performance of algorithm being enhanced. From the above analysis and discussion, the REDD index can also achieve good results in simulated networks after considering the degree and strength of nodes.

Summarization and Discussion
In the previous subsections, we verified the performance of the REDD index in the most similar nodes mining and link prediction. The corresponding figures and tables show the experimental results of the REDD index and seven benchmark indices in 12 real-world weighted graph data and 9 simulated weighted graph data. From these results, we can obtain the following several summarizations and discussions.

•
From the results in Figures 2 and 8, it can be seen that the MS and AUC curves of the RE-LRW and REDD indices change with the variation in dimension d. From the variation range for the corresponding curves, one can find that the REDD index has greater applicability in the most similar node mining and link prediction. This also proves that the conjecture and motivation are feasible in this paper. • From the results of Tables 3 and 4, one can observe that the REDD index owns higher MS and AUC values than the seven benchmark indices. In particular, the AUC values of REDD index are more than 94% in 12 weighted graph data. This is because the degree and strength of nodes are considered in the REDD index at the same time. That makes the REDD index fully combine the ability of nodes to collect and diffuse information. Not only that, but the REDD index considers that the similarity between two nodes is inversely proportional to the distance between them. Thus, the probability between nodes is obtained by utilizing a constant 1 to subtract their normalized distance during the construction of the REDD index. This makes the REDD index better able to describe the fact of generating an edge between two nodes. • According to the results from Figures 2-7, we can find that the measurement results of the CN, WCN, AA, WAA, and LRW indices result in the situation that some nodes of a large degree become general similar nodes, while the relative entropy-based RE-LRW, LRE, and REDD indices better avoid the above-mentioned situation. Despite the scatter diagrams of the RE-LRW and LRE indices showing relatively better effects, they are not as good as the REDD index. This may be because the degree and strength of nodes are not fully considered in the definition of the RE-LRW and LRE indices. • From the results in Figure 9, one can observe that the REDD index has a reasonable time cost, compared with RE-LRW and LRE indices. This indicates that the time consumption for measuring the difference between the probability distributions is less than the time consumption for measuring the difference between transition probability distributions and degree distributions. • According to the results from Figures 10-13, we can see that the REDD index has also better performance in most simulated networks, compared with seven benchmark indices. This indicates that the performance of the REDD index is hardly influenced by the type and structure of the network. It also proves that the REDD index has good universality in the most similar node mining and link prediction. • In this paper, we introduce the relative entropy into weighted graph data and propose a similarity measure of nodes. The proposed measure is tested in multiple graph data, including real-world and simulated graph data. According to the experimental results, we guess that the performance of the REDD index can be also further analyzed by using some statistical methods. For instance, the Monte Carlo approach can be used to describe the reliability and limits of the REDD index.

Conclusions
To further enhance the performance for the similarity measure of nodes in weighted graph data, we designed the relative entropy of the distance distribution based similarity measure of nodes. Considering that the degree of nodes reflects their ability to diffuse information and the strength of nodes reflects their ability to collect information, thus the structural weights of nodes were defined by integrating their degree and strength. On this base, the structural weights-based distance between two nodes was calculated with the help of the Euclidean distance formula, and then the distance distribution of each node also was obtained. Because the relative entropy was used to measure the similarity of nodes in this paper, it is necessary to give the probability distribution of nodes. Hence, the probability distribution of nodes was defined by normalizing their distance distribution. To save time cost, the similarity of nodes was calculated by measuring the difference between the probability distributions of the top d important nodes and all nodes in graph data. In 12 real-world and 9 simulated weighted graph data, the performance of the proposed algorithm and 7 benchmark algorithms was compared by utilizing 2 evaluation metrics. A large number of theoretical derivation and experimental analyses demonstrated that the proposed similarity measure of nodes was more advantageous in both most similar node mining and link prediction.
In a large amount of graph data with complex structures, the status of many nodes may be disturbed, as discussed in Ref. [34] on the problem of graph node perturbation. Therefore, the influence of graph node perturbation will be considered in our algorithm framework to further validate the effectiveness of the proposed algorithm in future work.