Complex networks provide convenient models for complex systems in biology, physics, and social sciences [1
]. The heterogeneous nature of complex networks determines the roles of each node in the networks that are quite different [6
]. Many mechanisms of complex networks such as spreading dynamics, cascading reactions, and network synchronization are highly affected by a tiny fraction of so-called important nodes [7
]. Identifying the most important nodes or ranking the node importance by using the method of quantitative analysis in large scale networks is thus very significant, which allows us to better control rumor and disease spreading [8
], design viral marketing strategies [10
], rank the reputation of scientists and publications [11
], optimize limited resource allocation [13
], protect critical regions from intended attacks [14
Since the pioneering works of several social scientists interested in quantifying the centrality and prestige of actors in social networks, the identification of the most central nodes in complex networks has come a long way [15
], and diverse indices have been proposed to measure it; to name some well-knowns ones, Degree [16
], Closeness [17
], Coreness [8
], Eccentricity [18
], Herfindahl index (H-index) [19
], Eigenvector [20
], Page Rank [21
], HITs [22
], Subgraph centrality [23
], Information centrality [24
], Bonacich Alpha centrality [25
]; Betweenness [26
], Load [27
], Stress [28
], etc. Just as Borgatti and Everett summarized the graph-theoretic perspective [29
], all measures of centrality assess a node’s involvement in the walk structure of a network: the walk type of degree-like centralities such as closeness and eigenvector, is radial, while the walk type of betweenness-like centralities such as load and stress is medial. In other words, the majority of known methods in identifying or ranking important nodes make use of the structural information [6
]. For more details about the above mentioned centralities, we refer to [6
Structural complexity is perhaps the most important property of a complex network [1
]. Network entropy is usually utilized to characterize the amount of information encoded in the network structure and to measure the structural complexity [30
]. Recently, considerable effort has focused on quantifying network complexity using entropy measures [31
], and several entropic metrics have been proposed; to name a few, network connectivity entropy [32
], cyclic entropy [33
], mapping entropy [34
], hotspot entropy [35
], Riemannian-geometric entropy [36
], and q
]. These measures have been shown to be extremely successful in quantifying the level of organization encoded in structural features of networks.
As was mentioned before, we often resort to network structural information for ranking node importance, while entropy can serve as a fundamental tool to capture the structural information of complex networks. However, to the best of our knowledge, there are seldom entropic metrics developed to measure node importance. Two recent exceptions are the Relative Entropy [38
] and Expected Force [39
]. Relative entropy [38
] is proposed as an integrated evaluation approach for node importance, which generates an optimal solution from different individual centrality indices by linear programming. Thus relative entropy serves not as a direct metric for node importance ranking, but a hybrid approach for synthesizing the existing ones. Expected Force [39
] is a node property derived from local network topology to depict the force of infection, and can be approximated by the entropy of the onward connectivity of each transmission cluster. It outperforms the accessibility, k-shell, and eigenvalue centrality in predicting epidemic outcomes. However, it only utilizes the local structural information to depict node importance; the whole structural information is still not in use. It is generally assumed that metrics for ranking node importance should be defined on the local level for each node, “The one thing that all agree on is that centrality is a node-level construct [29
]”. This leaves open the question whether it is able to utilize the graph-level structural information to evaluate the local-level node importance more efficiently.
The main contribution of this work is to borrow network entropy, which is defined to depict the structural complexity at the graph level, to characterize the node importance at the node level. Node importance is defined as the network entropy variation before and after its removal. Just like other state-of-the-art node importance ranking methods, the proposed method is also used to make use of structural information. However, the structural information here is at the systematical level, not the local level. Since more structural information of the network is utilized, better performance is expected. We will conduct empirical investigations on several real life complex networks to demonstrate the feasibility and the superiority of the proposed method. Results will show that this novel entropic metric outperforms other state-of-the-art metrics in identifying the top-k most important nodes.
The remainder of this paper is organized as follows. Section 2
describes the proposed entropic metric to quantify node importance and the algorithm to rank the top-k
most important nodes. And eight real life networks, Snake Idioms Network, and seven other well-known networks, on which our empirical investigations will be conducted, are also depicted in this section. The experiment results are described in Section 3
. The necessary conclusions, interpretations, and further directions are discussed in Section 4
3.1. On the Snake Idioms Network
We can calculate the entropy variations of each node for the Snake Idioms Network with the methods depicted in Section 2.1
, and the results are shown in Figure 2
We depict the entropy variation with both barplots and boxplots as in Figure 2
. The height of the barplot represents the entropy variation after the removal of each node, while the boxplot can visualize the distribution of the entropy variation. The five important percentiles of these Entropy Variations are listed in Table 2
Although the Entropy Variation is calculated on the basis of degree (or betweenness), it is of higher resolution than degree (or betweenness) itself. Taking the in-degree as an example, there are 1268 distinct Entropy Variation values for the 4234 nodes in the Snake Idioms Network, compared to only 29 different in-degree values. On average, with the proposed method, about 4234/1238 ≈ 3.42 nodes will share the same node importance, while with the in-degree centrality, 4234/29 ≈ 132.21 nodes have to be of the same node importance. For two nodes with the same node importance, what we can do is rank it randomly. In this sense, the proposed entropic metric has greater resolving power than the basic degree function. The reason may be that degree only counts how many neighbors there are for each node in the local view, while the Entropy Variation gains a deeper insight in the connectivity pattern and reveals more structural information in the global view.
Considering that several nodes share the same centrality scores, we will conduct a large number of randomizations while calculating the r
) scores and observe their disturbance. Figure 3
shows the r
) disturbances for in-degree centrality and the proposed Entropy Variation (also with in-degree as its information function).
One can observe that the in-degree (DEin) is of much greater divergence than Entropy Variation (EnVin). In this sense, the Entropy Variation calculated on the basis of in-degree is much more robust than in-degree itself. In view of the possible r(k) disturbances, the results in the following parts of this work will be the mean values after 50 times of randomization (random permutations of the nodes which have the same centrality scores).
Since the top-k
largest-degree nodes are usually considered as the benchmark of node importance ranking in many applications [6
], we will firstly compare Entropy Variation with the Degree centrality, and the results are shown in Figure 4
The comparative plot in Figure 4
suggests that the performance of the Entropy Variation (either with in-degree, out-degree, or betweenness) has been notably improved than that of the Degree centralities, though Entropy Variation is calculated on the basis of degree (or betweenness). For a comprehensive comparison, we will sum over r
) from k
= 1 to k
= 800. Entropy Variation with betweenness wins the best performance, with the sum of r
) as 548.27, while the sum of r
) for out-degree is only 440.42. Thus the r
) is raised by almost a quarter (24.49%).
As mentioned before, there are several methods proposed to rank node importance, other than the basic degree centrality. We will conduct further comparative investigations on Snake Idioms Network among different methods, removing the top-10, top-50, top-100, and top-150 most important nodes according to different metrics. The resulting r
) values are depicted in the dodged bar plot form, as shown in Figure 5
. The numeric values are arranged in Table 3
One can observe that, for the Snake Idioms Network, the performance of these methods can be ranked as: Entropy Variation (in) > Betweenness > Entropy Variation (all) > Load > Expected Force > Entropy Variation (betweenness) > Degree (all) > Entropy Variation (out) > Stress > Degree (out) > Closeness (all) > Page Rank > H-index (out) > Eigen Vector > H-index (all) > Information Centrality > Degree (in) > Bonacich Alpha centrality > HITs (hub) > Coreness (out) > HITs (authority) > H-index (in) > Coreness (in) > Subgraph centrality > Eccentricity (in) > Eccentricity (out) > Coreness (all) > Eccentricity (all) > Closeness (in) > Closeness (out).
If we just pick up the best one from the three modes “in”, “out”, and “all” for each method, they can be ranked as: Entropy Variation > Betweenness > Load > Expected Force > Degree > Stress > Closeness > Page Rank > H-index > Eigen Vector > Information Centrality > Bonacich Alpha centrality > HITs > Coreness > Subgraph centrality > Eccentricity.
The proposed ranking method outperforms all other state-of-the-art methods and the advantage is notable. Compared with the second best method, betweeness, the proposed Entropy Variation raises the r
= 10), r
= 50), r
= 100), r
= 150) by 154.13%, 101.08%, 16.94%, and 22.57%, respectively. Comprehensively, the sum of r
) is raised by 42.95% (as in the penultimate column in Table 3
3.2. On Other Well-Known Networks
In order to validate the feasibility of the proposed method in a broader range, we will conduct our investigations on more well-known networks, which have been reported in [57
Considering that the orders of the networks (#node, number of nodes) differ greatly, ranging from 32 (the Hens
network) to 6301 (the Gnutella
network), we will only calculate the r
= 10). Given the r
= 10) values are not scaled by the orders of the networks, our attention is focused mainly on each column, comparing the r
) values in the same column, which reveals the performance of different methods on the same network. The r
= 10) values of different methods on different networks are arranged as a matrix, as shown in Figure 6
It can be seen that the darkest parts of most of the columns lie in the last four lines, which reveals the advantage of the proposed Entropy Variation, regardless of its different information functions. More specifically, the last line, Entropy Variation with in-degree as its information function (EnVin), is marked with “**” five times, which indicates that, for five of the eight networks, namely, Blogs, Hens, High School, Physicians and Snake Idioms Network, the entropic metric, EnVin, outperforms all other ranking methods. For the other two networks, Air Traffic Control (AirTraffic) and Gnutella, Entropy Variation with betweenness (EnVbtw) performs the best. It should be mentioned that the proposed Entropy Variation is not always the best, such as in the case of Neural network, where betweenness gains the highest r(k) score.
The above results are all implemented with the R language [52
]. Codes and data for this experiment are available at [67
4. Discussion and Conclusions
Network entropy is usually utilized to characterize the amount of information encoded in the network structure and to measure the structural complexity at the graph level. In this study, we borrow the network entropy to quantify and rank node importance at the node level. We propose an entropic metric, Entropy Variation, defining the node importance as the variation of network entropy before and after its removal, according to the assumption that the removal of a more important node is likely to cause more structural variation. Like other state-of-the-art methods for ranking node importance, the proposed entropic metric is also used to utilize structural information, but at the systematical level, not the local level. Empirical investigations on the Snake Idioms Network, and seven other well-known real-life networks demonstrate the superiority of the proposed entropic metric. Results show that the proposed method notably outperforms other methods such as Degree, Closeness, Coreness, Eccentricity, H-index, Eigenvector, Page Rank, HITs, Subgraph centrality, Information centrality, Bonacich Alpha centrality, Betweenness, Load, Stress, and the recently proposed entropic metric Expected Force.
As for the empirical investigation, we model the educational game Snake Idioms as a complex network, which helps to capture the holistic scenario of the complex system. Applying the proposed ranking method on this real life network, we are able to pick up the most important idioms, whose absence will cause the widest failure of the game. The results here can provide an important reference for Chinese idioms teaching.
For the proposed Entropy Variation, the point is to take advantage of the holistic structure information to evaluate the importance of each node. Being a graph level metric, more structure information is utilized and the empirical investigations show that the performance is better. However, sometimes the whole structure of the network is not always able to be captured. In that case, Entropy Variation is not suitable and we might resort to local level methods such as degree [16
] and Expected Forces [39
Another drawback of the proposed method is the computational complexity, in comparison with some of the existing centrality metrics such as degree. The proposed algorithm may take some time to run since we have to delete the node one by one and calculate the network entropy time and time again after each node removal. Though the proposed algorithm is in a loop structure, parallel computation is recommended, because the entropy variation computation for each node is mutually independent. In practical terms, we run our program written in the R language with add-on packages doParallel [68
] and foreach [69
] on a laptop, Lenovo T450 with Inter Core i7-5500U CPU, and it took 28.476 s (for the Snake Idioms Network).
It must be remarked that the responses to node removal differ from system to system and many complex networks display different levels of vulnerability against node removals. A centrality which is optimal for one application is sometimes sub-optimal for a different application, and it is impossible to find a universal index that best quantifies nodes’ importance in every situation [6
]. In order to seek more universal conclusions, we will conduct our investigations on more empirical and theoretical networks in the future.