Weighted H-index for identifying influential spreaders

Spreading is a ubiquitous process in the social, biological and technological systems. Therefore, identifying influential spreaders, which is important to prevent epidemic spreading and to establish effective vaccination strategies, is full of theoretical and practical significance. In this paper, a weighted h-index centrality based on virtual nodes extension is proposed to quantify the spreading influence of nodes in complex networks. Simulation results on real-world networks reveal that the proposed method provides more accurate and more consistent ranking than the five classical methods. Moreover, we observe that the monotonicity and the computational complexity of our measure can also yield excellent performance.


Introduction
Many spreading phenomena like the cascading failures [1], virus transmission [2], rumors diffusion [3] and so forth in the real word can be described as the spreading process on the complex networks. The understanding of significant role that a single node plays provides pregnant insights into network structure and functions [4]. So identifying influential spreaders in complex networks has increases much attention. And the fundamental problem is how to identifying and ranking the efficient spreaders in this research.
Degree centrality, the simplest indicator, focuses on number of links per node believes that the most connected nodes are hubs [5]. There are also many classic topology metrics such as betweenness centrality [6], closeness centrality [7] and Katz centrality [8]. These measures show good performance in distinguishing different influential nodes, but the computational complexity is unacceptable when they are applied to large-scale networks. While Kitsak et al. [9] argued that the most influential spreaders are the nodes reside in the core of the network by the k-shell decomposition analysis. However, the k-shell decomposition tends to assign many nodes with different spreading ability in the same coreness index [10]. Thus, researchers proposed some improved methods to overcome this shortcoming. For example, Zeng et al. proposed a mixed degree decomposition (MDD) method by considering both the residual degree and the exhausted degree [11], but the optimal parameter λ is uncertain. Liu et al. took into account the shortest distance from a target node to the node of highest kshell values and presented a more distinguishable ranking list [1_0]. To evaluate the spreading influence of a node, Bae and Kim proposed a novel measure called neighborhood coreness centrality + which used the neighborhood coreness of neighbor [16]. Wang et al. utilized the iteration information produced in k-shell decomposition and presented a ranking method to evaluate the influence capability of nodes [13]. Shuang et al. designed an iterative neighbor information gathering (ING) process to rank the node influence [18]. Besides, there are some other node ranking algorithms have been introduced to achieve the promotion of ranking performance [14, 15, 17, 1_1, 1_2, 1_3, 1_7, 1_6]. Except for the improved k-shell decomposition algorithm, there existed many other excellent algorithms. Liu et al. [1_4] used the total asymmetric link weights to quantify the impact of the node in spreading processed. Zhang et al. [1_5] proposed a VoteRank method to identify a set of decentralized spreaders with the best spreading ability. Ma et al. [1_8] modified local centrality and integrate it with DC by considering the spreading probability.
In this paper, we argue that edges in a network could be quite different [1_9] and have different significance in network structure and function [1_10]. Many measure use the edge's importance to define the importance of a node [1_4, 1_12, 1_13, 1_14, 1_15]. For example, the number of shortest paths go through the edge and it can be regarded as the weight of an edge [1_11]. A measure = ( * ) [1_16], which was found to correlate positive with the volume of passengers traveling between two airports, has been adopted in many works for making a distinction among edges in unweighted networks. Recently Lü et al. [1_17] constructed an operator ℋ on the neighbor's degree of a node and obtained an h-index of each node. The h-index was the overall best in performers when compared with three typical centralities for undirected networks. Inspired by these factors, we propose a weight edge by the product of two degrees of connected nodes. And then utilize the operator ℋ on the neighbors of each node which are extended by weight edges, where is the degree of each neighbor. The sum of neighbors' weighted h-index values defines the importance of a node. To evaluate the effectiveness of the proposed measure, we apply the susceptible-infected-recovered (SIR) model for investigating an epidemic spreading process on sixteen real-world networks. The results show that the proposed method has a better performance of ranking the spreading ability of nodes in general than five other centralities, which compared by making a rank correlation between ranking lists of centrality measure and simulation results by SIR model. Moreover, calculating the weight h-index centrality has a complexity of ( ), where is the number of edges. The weighted h-index centrality is more efficient than other time consuming measure such as betweenness centrality and closeness centrality.
The remainder of this paper is organized as followers. We review the definition of centrality measures used for comparison and introduce our method in section 2. Section 3 reports evaluation methodologies and the experimental results. The conclusion are presented in section 4.

Centrality measures
In this part, we introduce the classic centrality measures and the proposed method. For a given unweighted complex network = ( , ), = | | is the number of nodes, and = | | is the number of edges. Let be the value of edge if node is connected to node . And we use to denote the set of neighbors of node .

Degree centrality
The degree centrality (D) is the simplest indicator to quantify node importance. It focuses on number of links per node and believes that the most connected nodes are hubs. Let ( ) denote the D of node , which is defined as: where is the degree of node .

Betweenness centrality
The betweenness centrality (B) of a node is the sum of the fraction of all-pairs that pass through node . We set ( ) as the B of vertex which is given by: where ( ) and represent the number of shortest paths pass node and the sum of shortest paths in a graph, respectively. A node with higher B will have more control over the network, because more information will pass through this node like in a telecommunications network.

Closeness centrality
The closeness centrality (C) of a node is a measure of centrality in a network, calculated as the sum of the length of the shortest paths between the node and all other nodes in the graph. Thus the more central a node is, the closer it is to all other nodes. The C of node is defined as: where is the shortest distance between node and node .

K-shell centrality
The k-shell centrality (KS) is obtained in the k-shell decomposition process. Each node will be assigned to a k-shell index by the process recursively pruning nodes with degree less than or equal to . The pruning process continues until all nodes in the network are removed. As a result, each node is associated with one k-shell index.

H-index
The h-index (H) of node in a network is defined as the maximum value ℎ such that there are at least ℎ neighbors of degree larger than or equal to ℎ. It is an operator acts on a finite number of integer ( 1 , 2 ,•••, ) and return an h-index value of node , where 1 , 2 ,•••, is the degree of neighbor nodes. Hence, we set ℎ as the h-index of node as follow: where is the degree of neighbor node .

Weighted H-index
The edges in an unweighted network are treated as a same value. In fact, the edges have different significance in network structure and function. We define the edge weights by degree to quantify the diffusion capacity of links. Definition I. The weight of edge is defined as: where and are the degree of node and node respectively, if node connect with node directly. The edges' weights counted by expression (5) in a diagram of a network is shown in Fig. 1.

Fig. 1. A schematic representation of a network.
Definition II. In order to apply ℋ operation on a weighted network, we decompose a weighted edge into multiple weighted edges. We take Fig. 2 as an example to illustrate the computational procedure. We focus on node 3 if we count the importance of node 3. The edge e 37 is one of edges connected to node 3 and the weight of e 37 is 20 measured by formula (5). The w 37 can be extended w 37 into 4 weighted edges based on the degree of node 7. Then the process will continue until all node 3's neighbor nodes being counted. The expression (4) (4) can be modified as follow: where ,1 = ,2 =•••= , = and is degree of node . The ℎ 1 and ℎ 3 are 3. And we can get ℎ 3 = 15 and ℎ 1 = 12. It is worth mentioning that the spreading ability of node 1 and node 3 are 0.30963 and 0.30145 simulated in the SIR model, respectively. So the weighted h-index can distinguish the diffusion importance of node and rank them correctly. Definition III. We define the h-index centrality (WH) of node as follow: In Table 1, we show the values measured by six methods for each nodes in sketch map. Using the spreading probability = 0.29 > = 0.28846, where is the epidemic threshold of the network, the spreading ability of each node is simulated in the SIR model. When we observe the ranking of node 7 and 5 by weighed h-index, they are in the reverse order to the spreading ability . But this situation has been corrected by the measure of WH. The spreading ability of nodes identified by WH is unanimous completely with the simulation results. But the other centralities have more or less inconsistent ordering in their rankings. So we can see that the proposed measure can better rank the spreading ability of nodes than other centrality method considered.

Evaluation methodologies
To study the spreading process, we use the SIR model to investigate the correctness of difference measures. In the initial time, there is only one infected seed node (I) and all other nodes are susceptible state (S). At each time step, infected nodes attempt to infect their susceptible neighbors with a probability and then enter the recovered state (R). This process is repeated until there are no longer any infected nodes. In our simulation, we use relatively small values, so the infected percentage of population is small. When the values of are high, any originated node can infect a large percentage of the population, the important of an individual node cannot be measured. Based on the heterogeneous mean-field theory [21,22,23], we set the infection probability to be slightly greater than the epidemic threshold ~〈 〉/〈 2 〉.
In order to evaluate the correctness of the different methods, we adopt Kendall's tau coefficient as a rank correlation coefficient. In statistics, the Kendall's tau coefficient is used to measure the ordinal association between two measured quantities [24]. The Kendall's tau coefficient of two rank vectors (ranking method ) and (SIR model) defined as: where and are the number of concordant pairs and discordant pairs respectively and 0 = ( − 1)/2, 1 = ∑ ( − 1)/2, 2 = ∑ ( − 1)/2, where is the size of rank vectors and and are the number of tied values in the ℎ and ℎ group of ties, respectively.
We use the imprecision function, which is initially proposed by Kitsak et al. [9], to quantify the accuracy in pinpointing the most influential spreaders. This function is used to measure the difference between the average spreading of the nodes with highest values by and the spreading of the most efficient spreaders according to SIR dynamics. The imprecision function is where p is the fraction of network size , ( ) and eff ( ) are the average spreading efficiency of nodes carrying the highest values and the highest actual spreading efficiency according to the simulated results of SIR model. A smaller represents a higher accuracy of in identifying the most influential spreaders.
The monotonicity of ranking vector described in Ref [16] is adopted to quantify the resolution of different ranking measures as follow: where | | is the ranking number of vector , denotes the ranking vector of network nodes, and | | represents the number of ties with the same rank . The monotonicity ( ) is 1 if vector is perfectly monotonic, and it becomes 0 if all nodes are same in vector . Figure 3 shows that the impression function of ranking based on the proposed method and the other five centralities. Recall that a lower impression implies a high accuracy in identifying the most influential spreaders. We can see that WH (gold circles) give an imprecision that is less than 0.1 for all ranging from 0.01 to 0.30 in nearly all cases. Only in the network Karate Club and HePth do the imprecisions get close to 0.20 when is near 0.02. More noticeable is that the WH performs even better than the other five measures in most case, except at some smaller values of in few parts of the sixteen networks. The B (blue stars) and C (red xs) are always the worst performance in the measure of impression function. The H (green triangles) is regarded as a trade-off between degree and coreness and has a best performers among D, KS and H centrality, although they are pretty close in most subfigures of Figure 3. The impression function demonstrate the improved performance of WH in identifying the most influential spreaders. The WH method yields consistently lower imprecision compared to the benchmark methods.

Comparison of rank correlation coefficient
Spreading dynamics is the most common process in many domains, such as physics and society. We utilize the SIR model to simulate the spreading process for evaluating the effectiveness of the WH on quantifying spreading influence. We apply the Kendall's tau correlation coefficient to evaluate the prediction accuracy. The greater absolute value of implies higher correlation between two sample vector. That means higher correlation between the WH value vector and the spread range vector indicates better prediction accuracy. The Kendall's tau between the node influence index of SIR model and six centralities indices is summarized in Table 3. One can observe that the proposed method which is highly correlated with the size of the infected population of the SIR model outperforms the other ranking means in most networks. It is worth mentioning that the results are very similar which can be found in the Table 4 ( = 1.5 ) and Table 5 ( = 2 ). The rank correlation of WH may not has the highest value among all centralities in a small value of like Oregon. But the proposed method acquires the best performance in higher spreading probability in Table 4 and Table 5. Next, we investigate the Kendall's tau correlation coefficient between the node influence index used SIR model and six centrality indices by varying the infection probability from 0.01 to 0.2 in eight actual networks. The calculations, to evaluate the effect of infection probability , is shown in Fig.4. The WH exhibits obviously correctness on a wide range of probabilities in eight real networks, especially when the infection probabilities is around the epidemic threshold β c (the black dot line). When the infection probabilities is very small, the spreading is typically confined to the neighborhood of the initially infected node, hence the node with larger degree can infected more nodes. That is why D always achieve the largest values when is less than the epidemic threshold for Karate Club, Dolphins, GrQc and Email. When become larger, our method begins to show better performance. Although WH becomes less effective than D, KS and H centralities when grows much larger than in Jazz, US Air, C.elegans and Email. The proposed method can still achieve a better performance on a wide of . So the above mentioned results demonstrated that the WH has a better indicator to identify the spreading influence in complex networks whenever the infection probability is greater than the epidemic criticality .

Monotonicity and efficiency
The monotonicity is defined to quantify the fraction of ties in the ranking list. The higher the value of means the ranking method has a better resolution of different influential nodes. The monotonicity of different ranking methods is summarized in Tabled 6. The WH and C are the best measure of the six measures. In order to clarify the ranking distribution, we plot a complementary cumulative distribution function (CCDF) as shown in Fig. 5. The D, KS and H are decrease rapidly because many nodes are in the same ranking value. And the CCDF of WH and C decline down in different types of networks. This illustrates that by combining the previous analyzing, the WH is more efficient to identify the influential nodes, although the WH and C have an excellent performance at distinguishing the spreading capability of the influential nodes. Interestingly, the CCDF of B centrality has a phenomenon of dropping down. Although the CCDF plot of B centrality decline down more slowly than the others in many networks, B has a worse performance of monotonicity than WH and C. In this part, the comparison of the computing complexity for the proposed method and other five measure is discussed. The six methods' computational complexity is shown in Table. 7, where n is the total number of nodes, m is the number of edges in a network.
The proposed method of calculating the WH has a complexity of ( ), given the degree of each node is known. In addition Different methods have different computation complexity and require different network information.
Both the Floyd's algorithm [3_1] and Brandes' algorithm [3_2] are used to count the number of shortest paths. Calculating the shortest paths can be done using Floyd's algorithm search in time ( 3 ) and Brandes' algorithm [3_2] in time ( ) . The computing complexity of C can be computed in time ( 3 ) using Floyd's algorithm [3_1]. The closeness centrality takes ( 2 + ) when calculated in a sparse graph using Johnson's algorithm [3_3]. The D, B and H only need the local information of a node and the KS needs global information. The computational complexity of D, B, H and KS is ( ), which indicates that the proposed method has a lower computation efficiency and can be used in large-scale network.
Therefore our measure provides an effective way to rank the influential nodes without no increase in complexity than other time consuming measure. So we can argue that the WH is efficient and accurate for identifying and ranking the influential nodes.

Discussion
Spreading like epidemic and information is a ubiquitous process in the social, biological and technological networks. Therefore, identifying influential nodes, which can optimize and conserve spreading resources in a large scale of complex networks, is full of theoretical and practical significance.
In this paper, we proposed a novel method: the weighted h-index centrality to identify and rank the spreading ability of nodes in complex networks. This measure collects centrality information by adding together the weighted h-index of neighbors. To evaluate the performance, we apply the proposed method on sixteen actual networks compared with the size of the infected population in the SIR model. We find that the WH outperforms the other five methods by employing the impression function ( ) and the Kendall's tau ( ) correlation coefficient to measure the rank imprecision and correlation. The proposed weighted h-index centrality exhibits obviously effectiveness of lower imprecision, higher and more competitive monotonicity compared to the other methods. Moreover, we analyze the computational complexity of six methods and the results show that our method is more suitable for large-scale networks because of using local information and having a lower complexity. Therefore, the proposed method can offer an excellent performance on discriminating the influence capability of nodes and provide a more reasonable and efficient ranking in complex networks.
Here, the proposed method has a good extensibility, as our algorithm is based on a node's nearest neighbors. We only concentrate on the centrality information of a node undirected networks, some extensions applying our method may be worth studying as in Ref [40, 41, and 42,1_4,1_12,1_13,1_14,1_15]. Recently, some researches began to concern diverse structure of networks and spreading dynamics, which play a significant role in the communication behavior [43, 44,4_1]. So further work is to apply and find more efficient method to identify and rank the node spreading influence using our method with the mentioned issues.