4.1. Date
To better demonstrate the validity of the node
PE metric for representing the importance of nodes, we evaluated it on twelve real networks from different domains. None of the networks allow for the existence of self-loops, i.e., two vertices of an edge having the same vertex. The twelve real networks comprised (i) two human social networks, the Train [
34] and Karate [
35] networks; (ii) a collaboration network, Ca_Sandi_Auth [
36]; (iii) an animal network, Dolphins [
34]; (iv) a DIMACS10 and a bio-c. elegans neural network [
36]; (v) an email network, Email-Enron [
36]; (vi) two miscellaneous networks, PolBooks [
34] and AdjNoun [
36]; (vii) an interaction network, Crime [
34]; (viii) a metabolic network, Yeast [
34]; a co-authorship network, Netscience [
34]; and an infrastructure network, Uspowergrid [
34]. These networks are publicly available and were downloaded from
http://konect.cc/networks/ (accessed on 13 December 2021) and
https://networkrepository.com/networks.php (accessed on 13 December 2021).
The topological features of the network dataset are presented in
Table 2.
4.2. Evaluation of the Susceptible–Infected–Removed Model
This section focuses on the principles and implementation of the susceptible–infected–removed (SIR) model to set the stage for subsequent experiments. In the study of infectious disease dynamics, to analyze the influence of crucial nodes in complex networks, the SIR model [
37] (developed by Kermack and McKendrick in 1927 when they studied the transmission patterns of the Black Death and plague) is frequently employed. The SIR model replicates the natural state of disease transmission. It divides the population into the following three categories: S for susceptible, I for infected, and R for removed (
Figure 1).
The proportions of individuals in susceptible, infected, and recovered status as a percentage of the total over time is expressed as follows:
where,
s(
t),
i(
t), and
r(
t) denote those nodes in susceptible, infected, and recovered status at time
t, respectively;
β is the infection rate; and
γ is the recovery rate.
In the SIR model, the proportion of nodes in different states to all nodes varies with time (
Figure 2). When the probability of infection is high, all susceptible nodes eventually become infected over time, whereas all infected individuals are eventually in the recovered state.
We applied the SIR model to analyze the impact of critical nodes in the complex networks and to model the propagation of information between the nodes. The propagation process was as follows. First, one node was infected, and all others were susceptible. In each time step, nodes in the infected condition infected other neighboring nodes in the susceptible condition with a probability
β (here, set as the prevalence threshold of the network). Second, each of the previously infected nodes entered the recovered state with a probability
γ (set as 1) [
18,
38]; nodes in the recovered state would not be reinfected. The process of propagation was reiterated until the network was free of infected nodes. The propagation capability of node
i is expressed by
Ri, which is the mean number of final nodes recovered over 1000 independent runs, with each independent run of node
i being the only infected seed. The higher the value of
Ri, the better is the propagation capability of node
i. Third, a list of nodes ranked in descending order of importance was obtained based on the propagation ability of the nodes in the network.
Based on the theory of heterogeneous mean fields [
39], the prevalence threshold of the SIR model was approximated as
where
<k> denotes the mean degree of community
c.
4.4. Kendall Coefficient (τ)
We used the SIR and SIRS models to produce a descending order rankings list of node importance based on their propagation ability in the network. The Kendall coefficient,
τ [
41], was applied to estimate the correlation between the importance rankings list obtained by each importance measure and the real importance rankings list generated by the SIR model. The higher the
τ value, the higher the correlation between the two rankings lists and the higher the accuracy of the results obtained by the method. The closer the Kendall coefficient is to one, the more accurate the ranking result is and the more effective the method is in identifying important nodes.
The Kendall coefficient considers a pair of binary groups consisting of two sets of random variables,
X and
Y. For any pair (
Xi,
Yi) and (
Xj,
Yj), the pair of binaries is said to be consistent if both
Xi > Xj and
Yi > Yj or
Xi < Xj and
Yi <
Yj. They are considered inconsistent if
Xi > Xj and
Yi < Yj or
Xi < Xj and
Yi >
Yj; if
Xi =
Xj or
Yi = Yj, the pair is neither consistent nor inconsistent. The Kendall coefficient,
τ, is calculated as follows:
where
nc and
ni denote the number of consistent and inconsistent binary groups, respectively, and
n denotes the number of binary groups. The Kendall coefficient
τ is in the range [−1, 1]. Ideally, if
τ = 1, then the rankings list produced by the degree centrality metric is identical to the rankings list produced by the actual propagation process.
4.5. Epidemic Models Experiment
This section compares the node PE metric with nine other node importance metrics: K-shell++, DC, BC, CC, EC, PE(N2), H-index, GIN, and LGC. In order to compare the performance of cn and N2, we use PE for PE calculated with cn and PE(N2) for PE calculated with N2. First, the SIR and SIRS models were applied to determine the impact of nodes on the dynamic propagation process to obtain the node propagation ability generated by the natural propagation process of the ranked list. Then, the Kendall coefficient was applied to estimate the extent to which the node PE metric was similar to the propagation capability of a single node. The performance of the other nine comparison metrics was similarly measured using the Kendall coefficient.
The correlation between the rankings lists provided by the ten different importance measures for nine real networks in different domains and the rankings list obtained from the SIR model by adjusting the infection rate,
β, are depicted in
Figure 4.
Figure 4 shows that the fold of the node
PE metric is at the top of each comparison plot, especially near the threshold value where the value of
τ for
PE is largest, indicating its effectiveness in identifying vital nodes. When 0.1 ≤
β ≤ 0.4,
PE obtained larger
τ values in the nine real networks, especially in the Adjnoun, Ca_Sandi_Authh, and PolBooks networks, indicating that
PE more accurately identified important nodes in the networks. In large networks, such as Crime and NetScience,
PE performed well and obtained the maximum
τ value for both networks, indicating that
PE has an advantage in identifying important nodes. In contrast to certain centrality measures, the performance of which fluctuated widely from network to network, the node
PE metric performed well across networks, indicating its stability.
When the contagion probability,
β, is very low in the SIR model, the disease does not spread because the infected node has only a small probability of infecting its neighbors; hence, the node only infects a limited area or not at all, making it difficult to measure the proper spread of the node. Conversely, when the transmission rate is high, the disease infects a large proportion of nodes regardless of which node it started from, which is meaningless for comparing the impact of individual nodes. Therefore, we focused on the range in which the transmission rate was around the epidemic threshold [
39].
From
Table 3, we obtained the
τ values of each importance metric under the prevalence threshold of each network, from which it can be seen that the node
PE metric performed better than the other importance metrics;
PE performed best for eight of the nine networks.
The nine real networks used in the SIR model experiments are representative networks in various fields, and the experimental results are general.
Figure 4 and
Table 3 show that
PE obtained higher
τ values for all nine networks, especially the Ca_Sandi_Auth, Email-Enron, Dolphins, Crime, and Polbooks networks, for which the value was higher than that of the other nine methods. The closer
τ was to one, the more accurate the sorting results were. We found that
PE tends to perform better than
PE(N2), indicating that
cn is better suited to identify important nodes than
N2. The results show that
PE identified the critical nodes in the network more accurately, and had strong applicability and good performance in most networks.
The correlation between the rankings lists provided by nine different importance measures, six real networks in different domains, and the rankings lists obtained from the SIRS model by adjusting the infection rate β are shown in
Figure 5.
Figure 5 shows that
PE performed the best compared to the other eight importance measures on both small and large networks. We found that the performance of degree centrality was best when the value of
β was small. This might be because when the infection rate is small, it is difficult for infected nodes to infect other nodes; at this time, the more neighboring nodes a node has, the more likely it is that the node will infect other nodes, which is consistent with reality.
Figure 5 shows that the rankings list obtained from
PE correlates more strongly with that obtained from the SIRS model as the infection rate increased, especially when
β was near the prevalence threshold, and was higher than other importance indicators. Therefore, the rankings list of nodes obtained by
PE was more accurate, and
PE identified important nodes in the network accurately.
Table 4 shows that we obtained the
τ values of each importance metric under the prevalence threshold of each network, from which it can be seen that the node
PE metric performed better than the other importance metrics;
PE performed the best on all six networks.
Figure 5 and
Table 4 show that in the six real networks used in the SIRS model experiment, regardless of size, the importance rankings list obtained by
PE was more closely related to the real importance rankings list simulated by the SIRS model. Furthermore, the
τ value was higher and better than that of the other eight importance indexes. In addition, the results show that
PE could identify the important nodes in the network in the SIRS model.
4.6. Robustness Experiment
This section evaluates the accuracy of the algorithm in identifying important nodes from the perspective of robustness [
42] and whether the significance of a node is determined by the impact on network connectivity after removing the node. We assess the impact of node failure on network connectivity through the largest connectivity coefficient [
43]; the greater the impact, the more important the failed node is.
A connected component is a network subgraph in which any two nodes in a subgraph are connected. There are many disconnected networks in the real world. These are broken down into multiple connected components, in which the connected component with the largest number of nodes is called the largest connected component [
43]. The size of the largest connected component reflects the connectivity of a complex networks. The scale of the largest connected component changes owing to the removal of nodes. After removing the important nodes, its scale becomes smaller; the greater the change, the more crucial the nodes that were removed. Hence, the robustness of the network was estimated with the largest connectivity coefficient.
The largest connected coefficient (denoted
r) can be defined as the ratio of the number of nodes contained in the network’s largest connected component to the overall number of nodes in the network. It is formulated as follows:
where
nc denotes the number of nodes contained in the network’s largest connected component after the removal of some nodes and
n denotes the total number of nodes in the network. The value varies according to the ratio of the number of nodes removed from the network to the overall number of nodes in the network, which is denoted
f. A gradual decrease in the
r value is observed as the number of nodes removed increases.
By drawing the network nodes on two-dimensional coordinates in terms of the importance evaluation algorithm, the curve of the change in the largest connectivity coefficient of the network was analyzed after the nodes were removed one by one on the basis of their order of importance, from largest to smallest. The more pronounced the downward trend of the curve, the better the effect of the algorithm. Eight different real-world networks respectively used six different importance measures to sort the nodes and remove them in order of importance from largest to smallest (
Figure 6).
We used the robustness value,
R [
42,
44], to estimate the performance of the method;
R is calculated as follows:
where
n denotes the total number of nodes in the original network and
rj signifies the largest connected coefficients after removing
j nodes. Every time a node is removed, the largest connectivity coefficient of the network is calculated and added to
R; this process iterates until the network is empty. Consequently, the smaller the final
R value is, the faster the network crashes, illustrating that the important nodes identified by the algorithm are more accurate.
We analyzed the robustness of the above six different importance measures on the basis of connectivity.
Table 4 presents the evaluation results of their robustness,
R.
From Equation (19), it was inferred that the smaller the robustness value, the faster the network collapses, expressed as better performance of the algorithm in identifying important nodes.
Table 5 shows that
PE rapidly reduced the maximum connectivity coefficient,
r, of the network on all eight networks, with the smallest
R value, and identified the important nodes in the network. The results verify that when
PE removed network nodes in order of importance from largest to smallest, it minimized the robustness; i.e., the node
PE metric accurately identified the important nodes in the network.