Identifying Influential Nodes in Complex Networks Based on Multiple Local Attributes and Information Entropy

Identifying influential nodes in complex networks has attracted the attention of many researchers in recent years. However, due to the high time complexity, methods based on global attributes have become unsuitable for large-scale complex networks. In addition, compared with methods considering only a single attribute, considering multiple attributes can enhance the performance of the method used. Therefore, this paper proposes a new multiple local attributes-weighted centrality (LWC) based on information entropy, combining degree and clustering coefficient; both one-step and two-step neighborhood information are considered for evaluating the influence of nodes and identifying influential nodes in complex networks. Firstly, the influence of a node in a complex network is divided into direct influence and indirect influence. The degree and clustering coefficient are selected as direct influence measures. Secondly, based on the two direct influence measures, we define two indirect influence measures: two-hop degree and two-hop clustering coefficient. Then, the information entropy is used to weight the above four influence measures, and the LWC of each node is obtained by calculating the weighted sum of these measures. Finally, all the nodes are ranked based on the value of the LWC, and the influential nodes can be identified. The proposed LWC method is applied to identify influential nodes in four real-world networks and is compared with five well-known methods. The experimental results demonstrate the good performance of the proposed method on discrimination capability and accuracy.


Introduction
Many systems in the real world can be modeled as complex networks to facilitate the study of the systems by complex network analysis. In a complex network, compared with other general nodes, some nodes can greatly affect the function of the whole network. These nodes are called influential nodes, and the quantity of these nodes is small. Ranking and identifying influential nodes in complex networks is one of the most important and basic problems in complex network research.
A variety of measures and methods have been proposed in recent decades for identifying the influential nodes of complex networks [1][2][3]. The most commonly used measures are centrality measures, such as degree centrality [4], closeness centrality [5,6], betweenness centrality [7,8], Katz centrality [9,10], and k-shell [11]. Degree centrality is the simplest metric that considers only the direct neighbors of a target node. Therefore, the degree centrality has low time complexity. However, its accuracy is also low since it ignores the global structure information of the network. Path-based centralities, such as closeness centrality, betweenness centrality, and Katz centrality can get better results. However, as global metrics, they are not suitable for large-scale networks due to their high computational complexity. Kitsak et al. [11] found that the influence of a node is highly related to the position of the node in a network, so they proposed to apply the k-shell decomposition method to identify the influential nodes of the network. However, k-shell decomposition is a coarse-grained method; the same k-shell value is assigned to a large number of nodes. Therefore, scholars have proposed a series of improved methods based on k-shell decomposition [12], such as mixed degree decomposition [13], improved neighbors' k-core [14], neighborhood coreness [15], weighted k-shell [16], k-shell iteration factor [17], extended local k-shell sum [18], influence capability [19], hierarchical k-shell [20], integral k-shell [21], improved k-shell hybrid method [22], etc. As for the field of the search engine, there are also some useful methods, such as PageRank [23], LeaderRank [24], and Hits [25]. Other centrality measures include H-index [26,27] and several measures based on the clustering coefficient, such as local centrality [28] and clustered local-degree [29], etc. In addition, information entropy has also been used to evaluate the influence of nodes in complex networks [30][31][32][33][34].
In recent years, some scholars have made practical explorations from new perspectives. Some researchers have found that the influence of nodes includes direct local influence and indirect influence based on neighbors. Therefore, these researchers constructed influential node identification models based on direct influence and indirect influence [30,31,33,35,36]. Some other researchers have found that considering multiple attributes can achieve better node influence evaluation results than considering only a single attribute. Wang et al. [19] applied a multi-attribute ranking method based on a node's position and neighborhood information. Yang et al. [37] applied TOPSIS to identify influential nodes and combined it with grey relational analysis to rank the importance of complex network nodes. Mo and Deng [38] adopted D-S evidence theory to comprehensively consider degree centrality, betweenness centrality, efficiency centrality, and correlation centrality for evaluating the importance of nodes in complex networks. These multi-attributes methods have proved to be promising strategies to evaluate the influence of nodes. However, there are still some limitations in existing multi-attribute methods, such as some methods being based on global attributes during the evaluation, which results in high computational complexity, unsuitable for current fast-growing social networks. In some methods, the contribution of each attribute is viewed as equal, or the weights of different attributes are set manually, which results in limited improvement in evaluating performance.
Degree and clustering coefficient are two critical attributes in node influence evaluation methods. In particular, degree and clustering coefficient belong to local attributes and have low computational complexity. Meanwhile, Liu et al. [39] deeply studied the role of neighbors on the influence of nodes and concluded that considering the one-step or two-step neighborhood of a node can achieve the best performance in most networks. Inspired by the above discussion, we propose a new multi-attribute weighted centrality based on information entropy, combining degree and clustering coefficient; both one-step and two-step neighborhood information are considered. The main contributions of this paper are summarized as follows. First, this paper considers the indirect influence of nodes and proposes a new measure of indirect influence, i.e., the two-hop clustering coefficient. Second, this paper improves the calculation formula of the two-hop degree. In the existing two-hop degree calculation, the degrees of one-hop neighbors are directly added. In our proposed two-hop degree formula, we take a more reasonable process, eliminating duplicate neighbors and then adding the neighbors of one-hop neighbors. Last but not least, this paper proposes a new multi-attribute weighted centrality with a low computational complexity O(n), as it is calculated based on four local attributes, including degree, two-hop degree, clustering coefficient, and two-hop clustering coefficient. Additionally, the entropy weighting method is exploited to weight the contributions of the four local attributes, rather than weight them equally.
The rest of this paper is organized as follows. In Section 2, relevant knowledge is introduced as a preliminary, including the six kinds of well-known centrality measures and the information entropy theory. In Section 3, the proposed multi-attribute weighted centrality is described. In Section 4, the proposed method is applied to four real world networks and compared with five well-known methods to verify the effectiveness and practicability of our proposed method. Section 5 concludes the paper.

Typical Centrality Measures
The influence of a node can be referred to as centrality. Scholars have proposed a variety of centrality measures to evaluate and rank the influence of nodes. In this section, we briefly introduce six typical centrality measures, including degree, betweenness centrality, closeness centrality, local centrality, clustering coefficient, and clustered local degree. In the following part of this paper, new centrality measures based on the degree and clustering coefficient are proposed, and some of these centrality measures will be used as comparison methods for experiments. [4]). The degree of node i, denoted as k i , which is defined as

Definition 1. (Degree
where N is the node set, node i, j ∈ N, and a ij refers to the connection between node i and node j.
The value of a ij is defined as a ij = 1, node i connected with j 0, otherwise .

Definition 2.
Betweenness centrality [7,8]. The betweenness centrality of node i, denoted as b i , is defined as where n js refers to the number of all the shortest paths between node j and node s, and n js (i) denotes the number of those shortest paths which go through node i.

Definition 3.
Closeness centrality [5,6]. The closeness centrality of node i, denoted as c i , is defined as where d ij refers to the distance between node i and node j. If there is no path between node i and node j, d ij = ∞, then 1 d ij = 0 [5,40].

Definition 4.
Local centrality [28]. The local centrality of node v, denoted as C L (v), is defined as where Γ(u) is the set of the one-hop neighbors of node u, Γ(v) is the set of the one-hop neighbors of node v, and N(w) is the number of the one-hop and the two-hop neighbors of node w.

Definition 5.
Clustering coefficient [37]. The clustering coefficient of node i, denoted as clc i , is defined as where k i refers to the degree of node i, which can be calculated as Formula (1), and e i represents the number of edges connected between the neighbors of node i. Definition 6. Clustered local degree [29]. The clustered local degree of node i, denoted as cld i , is defined as where N(i) is the set of the one-hop neighbors of node i, cc i is the clustering coefficient of node i, which can be calculated as Formula (6), and k i refers to the degree of node i.

Information Entropy and Entropy Weighting Method
Information entropy is a famous concept of information theory, which was first proposed by Claude Elwood Shannon in his work "A Mathematical Theory of Communication" in 1948. It is a useful tool for measuring the uncertainty of information. In general, the more uncertain or random an event is, the more information it will contain. Shannon's information entropy has been applied in many areas, such as data mining, statistical inference, machine learning, image processing, and so on.
According to Shannon's information theory, the information entropy is defined as follows: where X = {x 1 , x 2 , . . . , x n } is a set of possible events, and p(x i ) is the probability of event x i . The entropy weighting method is one of the most important applications of information entropy in the field of social science. Due to its excellent performance, the entropy weighting method is widely used to determine the weights of different attributes in multiattribute decision-making problems [41,42], and it is also widely used in node importance evaluation in complex networks. Therefore, in this paper, we also use the entropy weighting method to determine the weight of the four local attributes.
Suppose there are n attributes to be considered. The weight of attribute i, denoted as w i , is calculated as: where H i (i = 1, 2, . . . , n) is the information entropy of each attribute [19,43].

Direct Influence of a Node with Respect to Local Attributes
As introduced in Section 1, the total influence of a node can be divided into direct influence and indirect influence. Direct influence is the influence of a node on the nodes directly connected to it, that is, the influence exerted on its one-hop neighbors. Considering the degree and clustering coefficient, these two indicators belong to local indicators, which are representative and are simple to calculate. In this study, we selected the degree and clustering coefficient to represent the direct influence of nodes. For the calculation of degree k i and clustering coefficient clc i , see Definition 1 and Definition 5, respectively.

Indirect Influence of a Node on Two-Hop Attributes
The indirect influence of a node is the influence it exerts on nodes that are not directly connected to it. Liu et al. [40] have pointed out that the importance of a node is not only determined by its direct neighbors, but also depends on its multi-step neighbor information. According to their experimental results, for the sake of balancing accuracy and efficiency, considering two-step neighbor information is the best choice. Based on the one-hop neighborhood degree and clustering coefficient, we propose the two-hop degree and two-hop clustering coefficient as indirect influence measures. Wang et al. [19] have also utilized two-hop degree information in their research. However, in their two-hop degree formula, they directly sum the degree of a node's one-hop neighbors, which ignores the overlap of different node's neighbors. Therefore, in our two-hop degree formula, duplicate nodes are eliminated and calculated only once. Definition 7. Two-hop degree. The two-hop degree of node i, denoted as thk i , is defined as where NN i is the number of two-hop neighbors of node i, and NS i denotes the number of same neighbors.
Definition 8. Two-hop clustering coefficient. The two-hop clustering coefficient of node i, denoted as thclc i , is defined as where clc j is the clustering coefficient of node j, which is defined in Definition 5; N(i) denotes the neighbors of node i.

The Multiple Local Attributes Weighted Centrality
In this section, we propose a new multiple local attributes weighted centrality based on the above-mentioned direct influence measures and indirect influence measures for identifying influential nodes in a complex network. For the sake of brevity, we will abbreviate the multiple local attributes weighted centrality as LWC. The basic idea of the LWC is to view each node of the network as a decision scheme and each attribute of the node as an index of the decision scheme. Since different indices have different contributions to the influence of a node, we apply the entropy weight method to calculate the weight of each index; then, the influence of a node is the weighted sum of all indices. The steps of the LWC are described as follows.
Step 1: Construct the multi-attribute decision-making matrix of the node influence. For a complex network with n nodes and the node set V = {v 1 , v 2 , . . . , v n }, the decision scheme can be expressed as D = {d 1 , d 2 , . . . , d n }. In this paper, we adopt the above mentioned four local attributes, degree, two-hop degree, clustering coefficient, and two-hop clustering coefficient, as influence evaluation indices, so the evaluation indices can be expressed as A = {a 1 , a 2 , a 3 , a 4 } = {k, thk, clc, thclc}. Then, d i (a j )(i = 1, 2, . . . , n; j = 1, 2, 3, 4) represents the jth index of node i, and the multi-attribute decision-making matrix P can be obtained as follows.
Step 2: Normalize the multi-attribute decision-making matrix. Due to each index having a different dimension, the matrix P should be standardized into a normalized matrix. First, set d ij = d i (a j )/max d 1 (a j ), d 2 (a j ), . . . , d n (a j ) , then normalize the indices Step 3: Calculate the information entropy of each evaluation index. According to Formula (8), the information entropy of index j can be calculated by Step 4: Determine the weight of each evaluation index. According to Formula (9), the weight of index j can be calculated by Entropy 2022, 24, 293 6 of 13 Step 5: Calculate the influence of each node. According to the indices normalized in Step 2 and the weight obtained in Step 4, the LWC value of node i can be calculated by Step 6: Rank the value of the LWC i in descending order to obtain the node influence ranking list.
Next, the time complexity of the LWC is analyzed.
Step 1 needs to calculate degree k, clustering coefficient clc, two-hop degree thk, and two-hop clustering coefficient thclc.
Four real-world networks selected from different fields are used in our experiments, which consist of sparse graphs and dense graphs: (1) football; (2) netscience; (3) email; and (4) power. The datasets of the football, netscience, and power networks are provided on http://www-personal.umich.edu/~mejn/netdata/ (accessed on 8 February 2022), and the dataset of the email network is provided on https://networkrepository.com/email-univ.php (accessed on 8 February 2022). The characteristics of these four networks are listed in Table 1, including node number |V|, edge number |E|, average degree k , max degree k max , and clustering coefficient clc ave . We conducted the experiments on a computer with an Intel Core i5 2.4 GHz processor and 8 GB RAM. All the codes were implemented in Python 3.9. Table 2 shows the average execution time of 100 runs for each method on the four real-world networks.

Discrimination Capability Analysis
Firstly, we investigate the discriminability of the ranking results obtained by different methods. The monotonicity metric is used for this purpose. The monotonicity metric M [44] is defined as follows: where S represents the ranking list of network nodes, |V| s is the number of nodes with the same rank s of the ranking list S, and |V| represents the total number of nodes of a network. The value of M(S) ∈ [0, 1]. The higher value of M, the better the discrimination capacity the ranking list has. For example, if each node has a unique rank, and the value of |V| s is 1, then M(S) = 1. In this case, the ranking list has a perfect discrimination capacity. Table 3 shows the monotonicity values of different methods on four real-world networks. In Table 3, the column M(X) refers to the monotonicity metric value calculated based on the ranking list obtained by corresponding method X on each network dataset. Bold numbers represent values that can be achieved by the method with the best monotonicity on each network. The results show that the proposed LWC has the most significant monotonicity values on the football, netscience, and email networks. On the power network, the monotonicity performance of the LWC is inferior only to the CC method. These results indicate that the proposed LWC method has excellent monotonicity capacity. Secondly, in order to obtain a more precise specification of ranking distribution by different methods on the four real-world networks, the complementary cumulative distribution function (CCDF) has been plotted in addition to the monotonicity metric. Figure 1 shows the CCDF plots of different methods on the football, netscience, email, and power networks. The slower slope of the CCDF means the more distinct ranks assigned and the better performance of the ranking distribution. It can be observed in Figure 1 that the CCDF plot of the LWC decreases slowly, yielding the best monotonicity and distribution performance on the football, netscience, and email networks. For the power network, as global information methods, the BC and CC methods have a better ranking distribution performance than the LWC method. However, according to Table 2, the execution time of BC and CC are 115.32 s and 38.92 s, respectively, which is much more than the 1.39 s of the LWC method. Furthermore, if focused on the top 1000 ranked nodes, the LWC method has a similar discrimination performance to the BC and CC methods. Therefore, considering both time cost and discrimination performance, the LWC is a competitive method compared with degree, BC, CC, LocalC, and CLD. performance on the football, netscience, and email networks. For the power network, as global information methods, the BC and CC methods have a better ranking distribution performance than the LWC method. However, according to Table 2, the execution time of BC and CC are 115.32 s and 38.92 s, respectively, which is much more than the 1.39 s of the LWC method. Furthermore, if focused on the top 1000 ranked nodes, the LWC method has a similar discrimination performance to the BC and CC methods. Therefore, considering both time cost and discrimination performance, the LWC is a competitive method compared with degree, BC, CC, LocalC, and CLD.

Accuracy Analysis
This section investigates the accuracy of the performance of the above mentioned methods in evaluating the influence of nodes on the four real-world networks. To this end, the susceptible-infected-recovered (SIR) model is adopted to simulate the spreading influence of each node and extract the ground-truth ranking of node influence. Based on the ground-truth influence ranking, we use the Kendall coefficient, imprecision function, and Jaccard coefficient to evaluate the accuracy of different methods. Firstly, we correlate it with the ranking measure produced by each algorithm. Then, we focus on the top influential nodes to calculate the imprecision function and the Jaccard similarity coefficient.
Researchers have proposed a large number of methods to simulate the spreading process in social networks [45]. The SIR model is one of the most popular spreading models, which has been frequently used to identify influential nodes in complex networks [46]. In the SIR model, each node has three states: susceptible, infected, or removed. To simulate the influence range of node i v , we first set the state of node i v as infected, while all

Accuracy Analysis
This section investigates the accuracy of the performance of the above mentioned methods in evaluating the influence of nodes on the four real-world networks. To this end, the susceptible-infected-recovered (SIR) model is adopted to simulate the spreading influence of each node and extract the ground-truth ranking of node influence. Based on the ground-truth influence ranking, we use the Kendall coefficient, imprecision function, and Jaccard coefficient to evaluate the accuracy of different methods. Firstly, we correlate it with the ranking measure produced by each algorithm. Then, we focus on the top influential nodes to calculate the imprecision function and the Jaccard similarity coefficient.
Researchers have proposed a large number of methods to simulate the spreading process in social networks [45]. The SIR model is one of the most popular spreading models, which has been frequently used to identify influential nodes in complex networks [46]. In the SIR model, each node has three states: susceptible, infected, or removed. To simulate the influence range of node v i , we first set the state of node v i as infected, while all other nodes in the network are in a susceptible state. At each timestamp, the infected node tries to infect its neighbors, whose states are susceptible with the probability β, and then the infected node recovers with the probability γ. If a node is in a recovered state, this node is immune and cannot be infected again. Without loss of generality, the probability γ is set to 1. These infecting and recovering processes continue until a steady state is obtained. The simulation experiments are performed 100 times, and the average over these simulations is defined as a node's influence. Finally, the ranking list based on the simulation results is marked as σ for the following accuracy analysis.
Firstly, Kendall's τ correlation coefficient [47] is used to quantify the correlation between ranking list R obtained by each method and the ground-truth ranking list σ obtained by the SIR model. The τ coefficient is calculated as follows where sgn(x) refers to the sign function. N is the total number of nodes of the network, i.e., the length of the ranking list. x i and x j represent the ranks corresponding to node i and node j in the ranking list σ. Similarly, y i and y j represent the rank corresponding to node i and node j in ranking list R. Table 4 shows the τ correlation coefficient between the ground-truth influence ranking list σ and one produced by different ranking methods. As seen from Table 4, the correlation performance of the LWC is the best on four networks compared with the other five methods. Since the τ coefficient considers all node pairs in the network, including a mess of insignificant nodes, the value of τ is usually not very large with the increase in the number of nodes. In practice, people are only interested in important nodes in the network. Therefore, in the next experiment, the imprecision function is adopted to quantify the accuracy performance of the most influential nodes. The imprecision function [40] is defined as (19) where p is the ratio of network nodes we will investigate, p ∈ [0, 1]. M(p) is the average spreading influence of top pN nodes in ranking list R, and Me f f (p) is the average spreading influence of top pN nodes in ranking list σ. The smaller the value of ε(p), the more accurate the method is to identify the most influential top pN nodes. Figure 2 shows the imprecision function results of different methods on the four networks. As we only focus on the most influential nodes, the value of p is set at a range from 0.01 to 0.1. It can be observed that, in the netscience and power networks, the LWC always outperforms the other methods. In the football network, only when p = 0.07 and 0.08 is the performance of the LWC inferior to the degree method but better than the other four methods. In the email network, the performance of the LWC is better than the other five methods, except when p = 0.09 and 0.1. Therefore, on the whole, the LWC method is more accurate than the other five methods.
The third metric for evaluating the accuracy of the proposed method is the Jaccard similarity coefficient [44]. Here, we consider the top T nodes of the ranking list σ and ranking list R. The Jaccard similarity coefficient J is obtained as where σ(T) and R(T) are sets containing the top T ranked nodes of ranking list σ and ranking list R, respectively. Figure 3 shows the Jaccard coefficient of top T influential nodes on the football, netscience, email, and power networks. The value of T is set as 10, 20, 100, and 100, respectively, on the four networks. As can be seen, the LWC is again the top performer in the datasets compared with the other five methods. The third metric for evaluating the accuracy of the proposed method is the Jaccard similarity coefficient [44]. Here, we consider the top T nodes of the ranking list σ and ranking list R. The Jaccard similarity coefficient J is obtained as

(T) R(T) J(T) = (T) R(T)
where (T) σ and R(T) are sets containing the top T ranked nodes of ranking list σ and ranking list R, respectively. Figure 3 shows the Jaccard coefficient of top T influential nodes on the football, netscience, email, and power networks. The value of T is set as 10, 20, 100, and 100, respectively, on the four networks. As can be seen, the LWC is again the top performer in the datasets compared with the other five methods.

Conclusions
Identifying influential nodes in a complex network is a fundamental question that has many promising applications in various fields, such as identifying super-spreaders or opinion leaders in social networks, carrying out viral marketing, controlling disease propagation, stopping the spread of rumors, accelerating information propagation in a network, and so on. However, with the booming of large-scale networks, methods based on

Conclusions
Identifying influential nodes in a complex network is a fundamental question that has many promising applications in various fields, such as identifying super-spreaders or opinion leaders in social networks, carrying out viral marketing, controlling disease propagation, stopping the spread of rumors, accelerating information propagation in a network, and so on. However, with the booming of large-scale networks, methods based on a global attribute with high time complexity have become inapplicable. In addition, some studies have pointed out that methods considering multiple attributes can achieve better performance than methods that consider only a single attribute. Therefore, this paper proposed a low time complexity method based on four local attributes, i.e., the multiple local attributes weighted centrality (LWC), to evaluate and identify node influence in complex networks. In the LWC method, we improved the calculation formula of the two-hop degree; proposed a two-hop clustering coefficient; applied the information entropy to determine the weight of the degree, the clustering coefficient, the two-hop degree, and the two-hop clustering coefficient; and then calculated the weighted sum of these four local attributes. Compared with the degree, the betweenness centrality, the closeness centrality, the local centrality, the lustering coefficient, and the clustered local-degree methods, the experiment results on four real-world networks have shown that the proposed LWC can identify influential nodes effectively and accurately. However, there are still some challenges that need to be further overcome to improve our work. Firstly, the performance of our proposed method is better than the comparison method, but it is not very good and still has great potential for improvement. Secondly, this paper only considers the node influence in static networks. However, complex networks, in reality, sometimes change dynamically. Therefore, in future research, we will continue to try to improve the performance of the algorithm and explore dynamic network characteristics.