Identifying Inﬂuential Nodes in Complex Networks Based on Node Itself and Neighbor Layer Information

: Identifying inﬂuential nodes in complex networks is of great signiﬁcance for clearly understanding network structure and maintaining network stability. Researchers have proposed many classical methods to evaluate the propagation impact of nodes, but there is still some room for improvement in the identiﬁcation accuracy. Degree centrality is widely used because of its simplicity and convenience, but it has certain limitations. We divide the nodes into neighbor layers according to the distance between the surrounding nodes and the measured node. Considering that the node’s neighbor layer information directly affects the identiﬁcation result, we propose a new node inﬂuence identiﬁcation method by combining degree centrality information about itself and neighbor layer nodes. This method ﬁrst superimposes the degree centrality of the node itself with neighbor layer nodes to quantify the effect of neighbor nodes, and then takes the nearest neighborhood several times to characterize node inﬂuence. In order to evaluate the efﬁciency of the proposed method, the susceptible–infected–recovered (SIR) model was used to simulate the propagation process of nodes on multiple real networks. These networks are unweighted and undirected networks, and the adjacency matrix of these networks is symmetric. Comparing the calculation results of each method with the results obtained by SIR model, the experimental results show that the proposed method is more effective in determining the node inﬂuence than seven other identiﬁcation methods.


Introduction
The research direction of node influence in complex networks has attracted wide attention in recent years. The main research focus is to rank the influence of nodes to evaluate their importance. In fact, many systems in real life can be considered as complex network systems [1], such as animal networks [2], traffic networks [3], social networks [4], and economic networks. For example, identifying critical nodes in social networks and monitoring them can prevent the mass spread of infectious diseases. Identifying effective drug targets in cellular network systems allows for faster drug action in the body [5]. Numerous studies [6][7][8][9][10][11][12][13] have shown the significant theoretical and practical importance of identifying influential nodes that have a role in network structure and function. In the network, there are often a small number of nodes that support the entire network architecture, and the accurate identification of these nodes helps to maintain the stability of the network and ensure the normal operation of network functions.
Currently, researchers have proposed many classical methods for identifying influential nodes, including degree centrality [6], betweenness centrality [7], closeness centrality [8], eigenvector centrality [9], bridging centrality [10], LeaderRank [11], k-shell decomposition [12], and H-index [13]. Among these methods, the most widely used are degree centrality and k-shell decomposition. Degree centrality is simpler and more intuitive because it can be expressed directly in terms of the number of connected nodes, and it can be easily applied to large-scale networks. However, degree centrality and k-shell decomposition also have certain limitations. Degree centrality considers the most local information of nodes, while k-shell decomposition divides many nodes into the same layer; both easily cause the node sorting to be too coarse-grained. Researchers have made many efforts to improve the problem of coarse-graining. Joonhyun et al. [14] proposed an improved coreness method that introduces information on neighborhoods to define node influence. Ahmed et al. [15] proposed a method to identify the influence of nodes according to the circular area density formula by taking the nodes' own degree centrality and the shortest path distance between two nodes as mass and radius, respectively. Li et al. [16] used the gravity model to propose a method that uses neighborhood information and path information to measure the influence of nodes in the spreading process. Li et al. [17] considered the role of the clustering coefficient, and they combined the clustering coefficient and the sum of the degree centrality of the nearest neighbor node to quantify node influence. Sheng et al. [18] combined the global and local structural characteristics of the network, global information reflecting the proximity to other nodes in the network, and local information as the contribution value of the nearest neighbor node to the measured node. Yang et al. [19] first proposed an improved k-shell decomposition method based on the k-shell value and the number of iterations of node removal in the k-shell decomposition, and then combined the improved method with degree centrality and the shortest path length to characterize the node influence. Zareie et al. [20] proposed an improved clustering ranking approach, which takes the common hierarchical structure of nodes and their neighborhood set into account. Yan et al. [21] propose, a new method that considers the local topological characteristics of nodes, center position of nodes, and effect of neighbor nodes.
The above methods are from different perspectives to solve the problem of insufficient differentiation in the process of node identification, but there is still room for improvement in recognition accuracy. In order to improve the accuracy of identification, we propose a new method of identifying node influence by combining information on the node itself and neighbor layers, which first superimposes the degree centrality of the node itself and neighborhood nodes within a certain range, and then considers the contribution of multiple fetches of nearest neighbor information to the accuracy of the results. To verify the effectiveness of the proposed method, this paper uses the SIR model [22,23] to perform 1000 independent simulations on multiple real networks to obtain the information dissemination ability of nodes. The proposed method in this paper is compared with degree centrality, betweenness centrality, closeness centrality, area density centrality [15], GM method [16], CLD method [17], and GLI method [19] in terms of discrimination and accuracy. The experimental results show that the proposed method in this paper can distinguish the influence among nodes and improve the recognition accuracy.
This section clarified the background of the study and the current research progress in the field. The rest of the paper is framed as follows: Section 2 explains the definition of comparison methods and the detailed idea of the proposed method based on information on the node itself and neighbor layers. The datasets used in this paper and the evaluation criteria for the experiments are given in Section 3. In Section 4, we present verification experiments in terms of the discrimination and accuracy comparing the proposed method with others. The final summary of this paper is given in Section 5.

Materials and Methods
An unweighted and undirected network is represented by G = (V, E), where V = {v 1 , v 2 , . . . , v n } represents the set of n nodes, and E = {e 1 , e 2 , . . . , e m } represents the set of m edges. We can use adjacency matrix A = (a ij ) n × n to represent the structural characteristics of a complex network, where the adjacency matrix is symmetric. The element a ij in the matrix can represent the edge information between any two nodes. If a ij = 1, it means that there is a connection relationship between two nodes; otherwise, there is no connection relationship. Hereafter, k i represents the degree centrality of node i.

Node Influence Based on Node and Neighbor Layer Information
Degree centrality represents the most local information of nodes in the network. It only reflects the importance of nodes according to the nodes themselves and ignores the influence of the surrounding nodes. Numerous studies have shown that the neighbor layer information of a node has an indispensable effect on node influence. During the spreading process, the node will certainly spread information to neighbor nodes through their connection in sequence until it cannot be transmitted. It is destined that the surrounding environment of the node will directly affect the spreading ability of the node. In order to explore the role of neighbor layer information on node spreading, this paper proposes a node influence identification method based on the node itself and neighbor layer information, called the NINL method. The main idea of this method is to consider the local information within a certain range of nodes. First, a radius is defined according to the average path length of the network, and the main consideration is the influence of the surrounding environment caused by the set of nodes within this radius. Then, the information of the nodes themselves is combined with the information of nodes within this range, and the initial local information is defined as where Γ is used to represent nodes within a certain distance from node i, L refers to the average path length of the network, and x refers to the smallest integer not smaller than x.
To reflect the influence of nearest neighbor information more comprehensively, Equation (1) is extended to define the influence of nodes by using recursion, and the recursive formula is shown below.
· · · · · · · · · N I NL p (i) = ∑ j∈Γ i N I NL p−1 (j) (2) where p refers to the number of iterations, and Γ i represents the nearest neighborhood nodes of node i. We can speculate the p-value according to some networks. In this paper, we chose p = 3, for reasons discussed in the next section. We use an example network below to illustrate the calculation process in detail.
Then, according to ceil(L) = 3, we can get that the first-level neighbor node set of node 1 is {3}, the second-level neighbor node set is {4, 5, 6}, and the third-level neighbor node set is {2, 7, 8, 9}. To sum up, the set of neighbor nodes within the three layers is {2, 3, 4, 5, 6, 7, 8, 9}; thus, NINL 0 (1) = k 1 + (k 2 + k 3 + k 4 + k 5 + k 6 + k 7 + k 8 + k 9 ) = 1 + (1 + 4 + 6 + 4 + 3 + 1 + 4 + 5) = 1 + 28 = 29. Table 1 shows the calculation results of the node influence of NINL 0 -NINL 3 of the example network.  From the table, we can see that NINL 0 defines nodes 2, 3, 5, 6, 7, 10, 11, and 12 as the same value, and nodes 4, 8, and 9 as the same value. After getting the nearest neighbor node information for the first time, NINL 1 defines nodes 1 and 13 as the same value, nodes 2 and 7 as the same value, nodes 5 and 8 as the same value, and nodes 10 and 11 as the same value. NINL 2 also distinguishes the influence of nodes 1 and 13. The influence of nodes 5 and 8 was also identified. As for nodes 2 and 7, they could not be distinguished because they had the same information about themselves and their neighbors as nodes 10 and 11. At the same time, we can observe that nodes 2 and 7 were locally symmetrical, just like nodes 10 and 11. The above phenomenon shows that the proposed NINL method can effectively distinguish the influence of the nodes in the sample network.
The specific computational flow of Algorithm 1 is shown below.

Algorithm 1. The NINL p Method
Input: the network G = (V, E) Output: node influence of NINL p centrality 1: for i = 1 to |V| 2: for j = 1 to |V| 3: calculate the shortest path length between node i and node j 4: end for 5: end for 6: calculate average path length L 7: for i = 1 to |V| 8: calculate the Degree centrality of node i 9: end for 10: for i = 1 to |V| 11: find the neighbor nodes with a radius of ceil(L) from the node i 12: calculate NINL 0 of node i according to Equation ( Degree centrality [6] in complex network reflects the node's connection information, which can be expressed by the number of nearest neighbor nodes. A greater value of the degree denotes a greater influence of the node. The degree centrality of node i can be defined as In addition, degree centrality can be normalized to

Betweenness Centrality
Betweenness centrality [7] refers to the ratio of the number of shortest paths through a node to the number of shortest paths for all pairs of nodes in the network. The betweenness centrality of node i can be expressed as where g i st represents the number of shortest paths through node i, and g st refers to the number of shortest paths for all node pairs in the node set.

Closeness Centrality
Closeness centrality [8] refers to the reciprocal of the sum of the shortest path lengths between a node and other nodes. This method takes the global information of the network into account. The formula is as follows: where d ij refers to the shortest path length between node i and node j.

Density Centrality
The principle of density centrality [15] is the circular area density formula. The degree centrality of the node and the path length between two nodes are used as the mass and radius in the density formula, which highlights the influence of the number of neighbor nodes. The density centrality of node i can be defined as where Γ r i refers to the set of nodes whose path length is less than or equal to r; r = 3 was applied in the original paper.

Gravity Model
The gravity model [16] is derived from the law of gravity. It considers the relationship between node neighbor layer information and path length. The degree centrality is regarded as quality, and the shortest path length is taken as the distance. The influence of node i is expressed as

Clustered Local-Degree (CLD) Method
The clustered local-degree method [17] first obtains the sum of degree centrality of the nearest neighbor nodes, and then links the obtained results with the clustering coefficient of the nodes to propose a method for identifying the influence of nodes, as expressed below.
where C i represents clustering coefficient of node i, and Γ i represents the nearest neighborhood.

GLI Method
The GLI method [19] contains both global location information and local structure information. First, considering that the k-shell decomposition method divides many nodes into the same layer, an improved k-shell decomposition method (Iks) was defined as follows: where ks(i) refers to the level of node i after k-shell decomposition, and nit(i) refers to the number of iterations when the node is deleted during the iteration. Then, researchers proposed a new node influence identification method named the GLI method by linking the obtained improved method with degree centrality and path information, expressed as follows: In this case, the neighbors within three hops of the node are considered, i.e., r = 3.

Datasets
In order to verify the effectiveness of the method proposed in this paper, several real networks with different structures were selected for experiments. The real networks used in this article were the Contiguous network [24], Dolphin network [25], Polbooks network [26], Word network [27], Jazz network [28], Slavko Facebook network [26], USAir network [29], Netscience network [27], Infectious network [30], and Email network [31]. The datasets can also be found at https://github.com/Ismileo/Datasets, accessed on 19 July 2021 [32]. Table 2 shows the basic topological properties of the networks. In the table, n and m denote the number of nodes in the network and the number of connected edges, respectively. k max refers to the maximum degree, <k> refers to the average degree of all network nodes, D is the network diameter, L is the average path length, C refers to the average clustering coefficient, and r refers to the assortative coefficient of the network [33]. The SIR model can be used to describe the process of information dissemination, which is widely used in the field of infectious diseases. In this paper, the SIR model was used to simulate the node infection process to obtain the influence of nodes. Nodes in the SIR spreading model can exist in the following three states: (i) susceptible (S), (ii) infected (I), and (iii) recovered (R). The susceptible state means that the node is not currently infected but has the possibility of being infected, the infected state means that it has been infected and can infect other nodes with a certain probability, and the recovered state means that it cannot infect other nodes and cannot be infected by other nodes. The specific process is to first set a node to be in an infected state and other nodes to be in a susceptible state, and then the infected node can infect the susceptible nodes with a certain spread probability β, while the infected nodes will also be converted to the recovery state with a recovery probability λ = 1, until the whole process is in a stable state. In this paper, the spreading probability β was set to take a value around the spreading threshold [34], and the total number of nodes infected by a node during spreading was considered as the influence of a node. In this article, we carried out 1000 independent simulations, and the average value was used to represent the node influence according to SIR model. The proposed method and other methods were compared on the basis of the ranking results obtained from the SIR spreading model. A closer ranking result to the SIR model denotes a higher accuracy.

CCDF Method
In the process of obtaining node influence, there will be multiple nodes with the same value. The complementary cumulative distribution function (CCDF) can show the probability distribution of the sorting results, and the effect of each method can be observed through the change trend. The specific formula is as follows: where n i refers to the numerical value in the i-th place in a ranking list, and n refers the number of all nodes. When the number of different values in a ranking list is closer to n, it means that the method can distinguish the influence of each node more effectively, and the decrease rate of CCDF will be smaller.

Kendall Correlation Coefficient
The Kendall correlation coefficient [35] is usually used to evaluate the correlation of two sorting results, with a value range of [−1,1]; when the value is 1, it means that the two groups have the same sorting result, whereas a value of −1 means that two sets of numbers are completely negatively correlated, and a value of 0 means the sorting results are independent of each other. Suppose there exist two sequences X and Y with n elements, whereby the sequence XY i = (x i , y i ) is formed by taking the elements at the same position in the sequences X and Y. For any two elements XY i and XY j in the newly composed sequence, there are three cases: (i) if x i > x j and y i > y j or x i < x j and y i < y j , it is said that the node pairs are concordant; (ii) if x i > x j and y i < y j or x i < x j and y i > y j , the node pairs are discordant; (iii) if the above conditions are not met, the node pairs are neither concordant nor discordant. The Kendall correlation coefficient can be expressed by the following formula: where C refers to the number of concordant pairs, and D is the number of discordant pairs. It is worth noting that, in addition to the Kendall correlation coefficient, there are the Pearson correlation coefficient, Spearman's rank correlation coefficient, and Gamma correlation coefficient [36].

Jaccard Similarity Coefficient
Jaccard similarity [37] can be used to evaluate the similarity of the two methods in the sorting results, expressed as the ratio of the number of intersection nodes to the number of union nodes. The specific formula is where X(r) and Y(r) refer to the first r elements in the two lists X and Y, respectively. A closer J r result to 1 denotes that the two lists are closer, which can be used to evaluate the accuracy of a certain ratio of node identification.

Experiment and Analysis
In this section, the CCDF, Kendall correlation coefficient, influence consistency, and Jaccard similarity coefficient were used to evaluate the proposed method. The experimental process and results are presented below.

Discrimination Experiment
The easy accessibility of degree centrality has led to it being widely used. Degree centrality can be expressed only by the number of nearest neighbor nodes. This is its advantage and its disadvantage. The disadvantage is that it is too simple to represent node information using degree centrality, which will cause multiple nodes to be defined with the same influence. The information of the nodes is also influenced by the environmental elements, which leads to the majority of nodes having different influence results. Therefore, the first step of an effective method should be able to distinguish the importance of each node. Only by distinguishing the influence of each node can an effective ranking be carried out. To verify the discernibility of each method on the node, this paper used the CCDF method to determine the discrimination effect of the obtained sorting results, as shown in Figure 2.  Figure 2 shows the change trend of CCDF obtained using each method on the six networks of Contiguous, Dolphin, Jazz, Slavko, Infectious, and Email. The downward trend of CCDF at a certain point is determined by the frequency of the corresponding ranking point. If the number of corresponding ranking nodes is higher, the change trend is more drastic. The closer the CCDF is to a straight line, the better the effect is of distinguishing node influence. From the results in Figure 2, we can see that DC and CC methods declined faster among the six networks compared to other methods. DC represents the most local information of nodes; hence, it is very easy to define multiple nodes as the same value, which also highlights the biggest shortcoming of the DC method. Furthermore, CC considers the global shortest path length, and the result of superposition is likely to have the same value. The CLD method has a certain limitation because it only considers the node's own clustering coefficient and the degree centrality of the nearest neighbor nodes, which results in it not being able to distinguish the influence of nodes well. It can also be observed that the BC method can make an effective distinction at an early stage, but there is a sudden decrease at a later stage. The BC method considers the pivotal role of nodes in the network, but there are also several nodes with insignificant pivotal roles in the network. Such nodes will also vary in importance when they are affected by the environment; however, the BC method cannot distinguish this phenomenon. Overall, GM, GLI, and the method proposed in this paper best discriminated the influence of network nodes, basically demonstrating a straight line. It can be concluded that the method proposed in this paper can successfully distinguish the spreading influence of each node.

Selection of p-Value
In order to ensure the rationality of the value of p, this paper used the Kendall coefficient to carry out simulation research on 10 networks, and the simulation results are shown in Figure 3. We set the value of p as the x-axis and Kendall coefficient as the y-axis. To observe the changes more clearly, a partially enlarged subfigure is presented. According to the experimental results, the Kendall coefficient value increased with the increase in p at the beginning, but stopped increasing after reaching a threshold. In the partial enlarged image, we can see that the Kendall values of the networks except for Jazz and Infectious began to decrease after a p-value of 3, whereas the Jazz and Infectious networks began to decrease after a p-value of 4. Therefore, this paper selected p = 3 for subsequent experiments.  Table 3 shows the Kendall values obtained by comparing all methods and the SIR model using the 10 networks. In this paper, the spreading probability β set in the SIR model was the value near the spreading threshold β th , and 1000 independent SIR model simulation experiments were conducted to represent the influence of a node with the average value. According to the experimental results in Table 3, the BC method had the worst effect in identifying nodes. Comparing the Kendall coefficients of the CLD and DC methods, it can be found that superimposing the information of the nearest neighbor nodes could improve the accuracy of node identification. Comparing the GM, GLI, and DC methods, it can be found that the combination of the node itself and the neighbor layer information could more effectively identify influential nodes. From this table, we can see that the NINL method proposed in this paper had a more obvious recognition effect and a more accurate effect compared with other methods, thus verifying that accumulating the nearest neighbor information several times after combining node and neighbor layer information can more accurately reflect the node influence and achieve effective identification.

Influence Consistency Experiment
The influence consistency experiment was used to show the correlation between each method and the node influence and the correlation between the proposed method and each method. A higher correlation denotes a higher accuracy of node recognition. Figure 4 shows the experimental results of six methods on the Word, Jazz, and USAir networks. The x-axis is the node influence calculated by each method, and the y-axis is the node influence calculated by 1000 independent SIR spreading models. The experimental plots show that all six methods had positive correlations with node influence. Among them, the DNC and GM methods had an obvious upward convex trend, the GLI method had an obvious downward convex trend, and the DC, CLD, and NINL methods had a straight-line trend. The NINL method proposed in this paper showed better linearity in the three networks, and the scattered points were basically maintained around the straight line, suggesting a relatively strong correlation with the influence of nodes. In summary, the method proposed in this paper is more advantageous for discovering influential nodes.
To explore the intrinsic connection between the methods, Figure 5 shows the nodal influence correlation diagram between the proposed method and the other five methods in this paper. Each node in the figure represents the value obtained by a node in the network according to different methods. The x-axis refers to the NINL method values, and the y-axis refers to five method values of DC, DNC, CLD, GM, and GLI. As shown in Figure 5, the NINL method showed a positive correlation with the other five methods, and in the Netscience network, a more divergent scatter plot was obtained. In the Email network, the NINL method and CLD method seemingly presented a straight line. There was no such correlation when using the other methods. By observing the scatter plots of the NINL method and the DC method in the three networks, we can find that the value of DC method was concentrated in a very small interval, highlighting that the weak distinguishing ability of degree centrality for nodes was the main reason for the divergence of the results.

Recognition Effect of Each Method under a Certain Range of Propagation Probability
In order to more comprehensively evaluate the influence of the spreading probability on the ranking accuracy, this paper used the SIR model to obtain the node influence within a certain range of spreading probability, and the effectiveness of the proposed method was verified by the Kendall correlation coefficient. The results are shown in Figure 6. It can be seen from the figure that the BC method had the weakest ability to accurately identify influential nodes on the six networks. When the spreading probability was small, it can be seen that the DC, DNC, and GM methods had higher Kendall correlation coefficients, which was due to the fact that the infected node could only infect a portion of nodes close to it. At this time, the spreading process was limited to a certain local area, and the degree of a node in the local area had a great impact on the spreading influence. As the spreading probability was near the spreading threshold, the NINL method fully considered the relationship between the nearest neighbor and neighbor layer information within a certain range. Compared with other methods, NINL had a higher correlation with node influence, reflecting a better recognition effect. When the spreading probability was increased, we can see that the correlation curve had a significant decline in the USAir network. This is because the clustering coefficient of the network was very high, which led to the connection between nodes becoming closer. Under these circumstances, the information between nodes could be easily transmitted through the closely connected nodes layer by layer. As a result, the recognition accuracy of the NINL method approached that of the other methods. In general, the proposed NINL method could more accurately evaluate the influence of nodes than other methods around the propagation threshold.

Recognition Effect of Each Method under a Certain Percentage of Ranking Results
There are a very small number of nodes in the network responsible for the normal operation of the entire network. It is of great significance to identify the most influential nodes. Table 4 shows the top 10 nodes obtained using each method in the Word and USAir networks. In this table, the Φ value given in the last column of the table is the ranking result calculated by the SIR model under the condition of spreading probability β. By observing the results of each method and the SIR model, it can be found that, in the Word network, the top 10 nodes identified by the DC, DNC, GM, GLI, and NINL methods had nine identical nodes in terms of Φ, whereas eight of the CC and CLD methods were identical. The BC method had only seven identical nodes. In the USAir network, the GM, GLI, and NINL methods identified nine identical nodes in terms of Φ, whereas DC and DNC only identified eight identical nodes, CC identified six identical nodes, and BC and CLD only identified five identical nodes. According to the above analysis, the NINL method could accurately identify the top 10 nodes.
In order to explore the recognition accuracy of the top nodes, the Jaccard similarity coefficient was used as the evaluation standard to carry out related experiments. Figure 7 presents the results of Jaccard similarity experiments on six types of networks. The x-axis represents the range of ranking results considered, and the y-axis represents the Jaccard similarity coefficient. A larger Jaccard coefficient denotes a higher similarity and, thus, a more effective recognition result. It can be seen from Figure 7 that, as the range of the ranking results increased, the Jaccard similarity coefficients obtained by each method became more and more stable. At the same time, the NINL method proposed in this paper clearly showed a superior similarity curve to the other seven methods in the Slavko, Netscience, Infectious, and Email networks. Thus, it can be considered that the NINL method was highly correlated with the influence of nodes and, thus, could more accurately identify the first r nodes.

Conclusions
In this paper, we studied the problem of identifying influential nodes in the network. Identifying influential nodes accurately in the network can provide a clearer understanding of the overall network function implementation process and the information dissemination between nodes. This paper mainly uses the symmetric adjacency matrix of the unweighted and undirected network to obtain various information. The paper first defines a node's own information and neighbor layer nodes within a certain range as the initial node influence, and then takes the nearest neighbor information multiple times as the final influence of nodes. In different networks, the number of times to fetch the nearest neighbor information is different. In order to ensure the rationality of the p-value, a verification experiment of p selection was done. At the same time, this paper used the CCDF curve to perform a discrimination experiment on multiple real networks, while the Kendall coefficient and Jaccard similarity coefficient were used to carry out recognition and accuracy experiments. The method proposed in this paper effectively avoided the phenomenon of most nodes having the same value. Furthermore, it had higher accuracy in identifying the influence of nodes near the propagation threshold. The experimental results show that the proposed method was more effective than other methods in identifying influential nodes, which is significant for understanding the node information dissemination process.