Identifying Inﬂuential Nodes of Complex Networks Based on Trust-Value

: The real world contains many kinds of complex network. Using inﬂuence nodes in complex networks can promote or inhibit the spread of information. Identifying inﬂuential nodes has become a hot topic around the world. Most of the existing algorithms used for inﬂuential node identiﬁcation are based on the structure of the network such as the degree of the nodes. However, the attribute information of nodes also affects the ranking of nodes’ inﬂuence. In this paper, we consider both the attribute information between nodes and the structure of networks. Therefore, the similarity ratio, based on attribute information, and the degree ratio, based on structure derived from trust-value, are proposed. The trust–PageRank (TPR) algorithm is proposed to identify inﬂuential nodes in complex networks. Finally, several real networks from different ﬁelds are selected for experiments. Compared with some existing algorithms, the results suggest that TPR more rationally and effectively identiﬁes the inﬂuential nodes in networks.


Introduction
Complex networks simplify the complex systems that are found in the real world. The research on complex networks can help people to deeply understand them, such as their internal dynamic evolution and for behavior prediction [1][2][3]. Some nodes in complex networks play key roles in the information dissemination of entire networks; these are called influential nodes. Identifying influential nodes in complex networks is an important direction in the field of complex network research [4], and has practical value for information dissemination in real-world networks [5], spread of infectious diseases [6], product promotion [7,8], etc. This task can effectively reduce economic cost and avoid economic loss to a certain extent [9]. For example, in traffic networks [10], the identification of the influential nodes of roads can be used as an important reference for road resource allocation and flow diversion. Identifying the influential nodes in biological networks [11] can provide auxiliary means for disease treatment and understanding of biological information.
In recent decades, from structural centrality and iterative refinement of centrality to dynamic operation based on nodes, many algorithms for identifying influential nodes have been proposed in the field of complex networks. For instance, the degree centrality (DC) [12] algorithm judges the importance of the nodes by counting the number of their directly connected neighbor nodes. Chen et al. [13] considered the semi-local information of nodes and proposed semi-local centrality. Betweenness centrality (BC) [14] describes the ability of nodes to control the information flow along the shortest path in the network. Newman [15] considered that the shortest paths were more important than the non-shortest paths in the process of information propagation, and proposed the random walk betweenness centrality. The K-shell [16] algorithm strips the nodes in the outer layer of the network, and considers the nodes at the core of the network has having strong influence.

Basic Idea
PageRank is generated from web page sorting, which is not only the core of Google and other search engines, but also currently used to rank data in various network environments. PageRank is popular due to both its perceived effectiveness and its easy-to-understand philosophy, rather than ranking objects based on difficult-to-measure intrinsic qualities [27]. As mentioned above, in PageRank, pages are abstracted to nodes and hyperlinks are abstracted to edges. The influence indicator of nodes in PageRank is defined by the PR value. One node's PR value depends on the importance of other nodes pointing to this node. One node has a higher PR value if it has many edges from other high-PR nodes. In PageRank, each node has an initial PR value, which stabilizes after continuous iteration. In this process, a web user may not only jump to an adjacent web page from some web page with a random probability, but also jump to any web page with an additional probability. The PR value of node i at time t is defined as where n is the number of nodes in the network, N i is the set of neighbors of node i, α is the jump probability, and k j represents the count of nodes to which node j points.
In the PageRank algorithm, each node distributes its own PR value evenly to neighboring nodes. However, the probability and amount of information exchanged between nodes in the actual network are largely not average allocations. In the process of information dissemination in the network, to simplify the description, it is assumed that only one node needs to transmit information through the network. To transmit the information to all nodes at a faster rate, the node spreads the information to all nodes in a set of its neighbors, and nodes that are more "trustworthy" will receive more information until the network reaches a steady state, or the algorithm reaches a certain number of iterations, which is consistent with our perception in reality.
Inspired by this phenomenon, we propose a trust-value concept that consists of the similarity ratio and the degree ratio. In this section, we explain similarity ratio, degree ratio, and trust-value in detail. Finally, we introduce the proposed trust-PageRank algorithm.
Herein, the network is represented by G = (V, E), where V represents the nodes set of networks and E represents the edge set. The adjacent matrix of the network G is A = {a(i, j)}. In the undirected network, if there is an edge between the nodes v i and v j , then a(i, j) = 1. n = |V| is the number of nodes in the network and m = |E| is the number of edges in the network. Table 1 provides the definitions of symbols that will be used below.

H(i)
The hobby set of node i. s(i, j) The similarity between nodes i and j.

S v
The sum of similarity between node v and its neighbor nodes.
The ratio of the similarity between nodes i and j to the ratio of the sum of the similarity between node j and its neighbors.

D i
The sum of the degree of all adjacent nodes of node i.
The ratio of the degree of node i to the sum of the degrees of all adjacent nodes of node j. T(i, j) The trust-value of node i to node j.

The Similarity Ratio
In social networks, if two individuals have common hobbies, they have quite a few similarities, to some extent, which means they interact with each other more frequently and exchange more information with each other. Conversely, the similarity between individuals is relatively low if there is no intersection between them; accordingly, they may not interact frequently. Thus, we define the similarity ratio as: Definition 1. The similarity ratio of a node to an adjacent node is the value of the similarity between the two nodes divided by the sum of the similarity between the adjacent node and its neighbor nodes. The similarity ratio of node i to adjacent node j is defined in Equation (2).
Assume there is a network as shown in Figure 1a. The hobby sets of nodes v, 1, 2, 3, and 4 are H , and H(4), respectively. Their values are shown in Table 2.  If the similarity s(i, j) of nodes i and j is defined as

Hobby Set Value
where s(1, v) = 2, s(2, v) = 1, s(3, v) = 0, and s(4, v) = 1, and the ratios of the similarity of nodes 1, 2, 3, and 4 to the summation of node v are R s(1,v) = 0.5, R s(2,v) = 0.25, R s(3,v) = 0, and R s(4,v) = 0.25, respectively. If each hobby set of one node is treated as a set of neighbor nodes, as shown in Figure 1b, the similarity between the nodes is determined by their common neighbor nodes. The more similar the neighbor nodes of two nodes, the more similar these two nodes. In addition, the purpose of Equation (3) is only to illustrate the idea that the similarity of nodes depends on the similarity of their neighbor nodes. Here, we use the SimRank [28] algorithm for calculating similarity.
In SimRank, two objects are similar if they are related to similar objects. Just like a social network, if two individuals have more common or similar neighbors, they will be more similar to each other. In SimRank, the similarity between a and b is defined as where C is a constant that represents an attenuation factor. The value of C does not affect the results, since only the ratio of similarity is calculated in the experiments. In this paper, C is taken as 1. N a (i) represents the ith neighbor node of node a. If N a or N b is an empty set, then |N a ||N b | = 0.
To prevent the case where the divisor is 0 in Equation (4), s(a, b) = 0 is defined. As we can see from Equation (4), how to calculate s(a, b) in SimRank is an iterative process. Thus, before starting the iteration, we initialize

The Degree Ratio
In the network, the higher the degree of a node, the wider the range in contact with surrounding nodes. If a node needs to propagate information, then propagating the information to the neighbor nodes with larger degrees will benefit the information spread. However, it is one-sided to propagate the information only to the node with the largest value. As an example, as shown in Figure 2, if node v only propagates information to node 1 with the highest degree, then regions b, c, and d will not receive any transmitted information. To spread the information to all corners as quickly as possible, we propose the degree ratio with the number of neighbor nodes and their degree at overall consideration. Definition 2. The degree ratio of node i to its adjacent node j is the ratio of the degree of node i to the sum of the degree of adjacent nodes of j as shown in Equation (5):

The Trust-Value
The degree of trust between nodes in the network is composed of the similarity ratio of the node based on the attribute information and the degree ratio based on the topology. Definition 3. The trust-value of node i to adjacent node j depends on the similarity ratio and degree ratio of j to i. The greater the degree of adjacent node, the more similar the two nodes and the higher the trust-value. The trust-value T i,j of node j to i is defined as Normally, k ∈ [0, 1]. In this paper, we discuss the value of k in Section 3.
In the network shown in Figure 3, there are 5 nodes. The sum of the degree of all their adjacent nodes is D 1 = 9, D 1 = 7, D 1 = 9, D 1 = 7, and D 1 = 10. Table 3 shows the similarities between nodes. In Figure 3, nodes 1 and 2, and nodes 3 and 4 have similar characteristics. So, they have the same similarities, both of which are 0.6, as shown in Table 3. This is consistent with our intuitive observation. Combining the similarity ratio and the degree ratio, Figure 3 depicts the calculation process of the trust-value between nodes 1 and 2.

Trust-PageRank Algorithm
Information is transmitted to adjacent nodes according to the trust-value. Nodes with higher trust-value receive more information. By introducing trust-value into PageRank to identify the influential nodes, the trust-PageRank (TPR) algorithm was constructed.

Definition 4.
In the process of the iterative voting of trust-PageRank, the influence of node i at time t is defined as where T(i, j) denotes the trust-value of node j to node i obtained by Equation (6), n is the count of nodes in the network, N i represents the set of neighbors of node i, and α is the jump probability. In the experiments, we take The algorithm initializes the input network data, iteratively obtains the nodes' similarity set, and then calculates the similarity and degree ratios according to the similarity and initialization sets. Then, the trust-value between the nodes is obtained, and is combined with the PageRank to count the TPR indicator value of each node. The specific steps of the algorithm are shown in Algorithm 1.

Algorithm 1 Trust-PageRank Algorithm
Input: G = (V, E); Output: rankList; 1: for v s ∈ V do 2: Initialize the network and count the set of adjacent nodes of v s ; 3: end for 4: end if 10: simRationMap← s(v u , v v ); //put the initial similarity values of nodes into the simRatioMap 11: end for 12: for v i ∈ V do 13: for v j ∈ N i do 14: Calculate s(v i , v j ) according to Equation (4); 15: simRationMap← s(v i , v j ); 16: end for 17: end for 18: while ite<Iteration do 19: for v k ∈ V do 20: for v m ∈ N k do 21: Calculate the sum of the degree of adjacent nodes of v m and the similarity between v m and its adjacent nodes; 22: Calculate the similarity ratio R s(m,k) by Equation (2); 23: Calculate the degree ratio R d(m,k) according to Equation (5); 24: Calculate the trust-value T mk of the node v k to v m using R s(m,k) and R d(m,k) using Equation (6); 25: Calculate the TPR k value of node v k according to Equation (7); 26: rankList ← TPR k ; // Putting the value into the rankList. 27: end for 28: end for 29: ite++; 30: end while 31: rankList.sort(); // return the sorted rankList.

Preparation
Thirteen representative real networks were used to evaluate the TPR algorithm or estimate for parameter k, including two small-scale networks (Kite [29] and Karate), the email network (Email), the international E-road network (Euroroad), the power grid of the United States (Power Grid), the networks of protein-protein interactions (Protein, PDZBasem and Propro), U.S. political books network (Polbooks), football club network (Football), animal social network (Dolphins), website user network (Hamster), and metabolic network of the roundworm (Elegans), which were selected from distinct fields containing social, transportation, protein-protein interactions, correspondence, and other aspects for validation. The statistics of the networks are summarized in Table 4, all of which are available from KONECT or NETWORK. (1) DC [12]: An intuitive algorithm based on network topology that judges the importance of nodes by counting the number of their neighboring nodes. (2) BC [14]: The importance of node i depends on the percentage of all shortest paths in a network that contain node i. (4) HITS [20]: Each node has an authority value and a hub value, which affect each other as the indicator used to evaluate the influence of nodes.

SIR Spreading Model
The node identification method should produce influential nodes related to the actual propagation process as much as possible. However, most complex networks lack the influence ability label of nodes. Thus, we employ a mature epidemic transmission model, the SIR [16] model, to evaluate the performance and effectiveness of TPR; the SIR model has also been used as a criterion in other relevant papers [30,31]. There are three states in SIR: (1) susceptible: nodes in this state are infected by other nodes; (2) infected: the infected nodes infect other nodes with a probability of α and return to the recovered state with the probability of β; (3) recovered: recovered from infected status, does not have the ability to infect other nodes, and cannot be infected. In the experiment, one node is selected as a seed node, and then t-step infections occur. The experiment will iterate many times, and the average numbers of infected nodes and recovered nodes after infection are taken as the propagation ability of the node, denoted as F(t). The influence value K i of node i defined as where N ite is the number of iterations, and n I and n R represent the number of infected and recovered nodes, respectively. We set N ite = 100. Due to the presence of randomness in the SIR model, the whole process needs to run many times, each time requiring multiple iterations. Thus applying the SIR model over the large-scale network is time-consuming [31,32]. However we can still use it as a performance metric.

Kendall Correlation Coefficient
The Kendall correlation coefficient [33] is used to measure the correlation of multi-column grading variables, and is defined as In the Kendall correlation coefficient, it is assumed that two element sets corresponding to the two sequences A and B are (A i , B i ) and (A j , B j ), respectively. Only if A i > A j and B i > B j , or A i < A j and B i < B j , are the two elements considered to be consistent. In Equation (9), C and D represent the number of consistent pairs and inconsistent pairs in the two sorted sequence sets, respectively; N is the number of statistical objects. When τ ∈ [−1, 1], the larger the value of τ, the more consistent they are. τ = 0 indicates that the two sets are independent of each other; τ = −1 indicates that the ordering correlation between the two is opposite.

Estimate for Parameter k
The trust-value proposed above consists of both the similarity ratio and the degree ratio.
To determine their exact proportion in the trust-value to ensure the accuracy of the experimental results, we took parameter k as ranging from 0.1 to 1. The infection probability and recovery probability in the SIR model were set to 0.3 and 0.6, respectively, 10 steps were propagated, and 100 iterations were run. The Kendall correlation coefficients between the SIR and TPR algorithms were calculated in the PolBooks, Dolphins, Hamster, and Elegans networks. The experimental results are shown in Figure 4.  Figure 4 shows that in the Dolphins, Hamster, and Elegans networks, the values of the correlation coefficient τ increase slightly with the change in the value of the parameter k. In the PolBooks network, the maximum of τ is reached when k is between 0.85 and 0.9. To accurately evaluate the influence of nodes in the network and avoid adopting too many parameters in the experiments, the value of k was set to 0.85.

Preliminary Effectiveness Analysis
To clearly visualize the internal structure of the network, we selected small-scale networks such as Kite [29] and Karate, shown in Figure 5. Table 5 Table 5 shows that the results of these algorithms for the symmetric Kite network are similar. The influential nodes identified by TPR are consistent with DC. In addition, compared with PageRank, TPR considers nodes 6 and 8 as being more influential than node 2. From the perspective of TPR, the TPR value of node 2 derives from nodes 1 and 3, which have lower TPR value. Node 2 has a small degree, and its similarity with node 3 is not dominant compared to other neighboring nodes of node 3 (nodes 4 and 5). The nodes adjacent to node 6 have excellent ranking and higher TPR values, the degree of node 6 is higher and node 6 is more similar to its neighbors, as is node 8. In addition, node 2 only connects node 3 to a leaf node 1, whereas nodes 6 and 8 are more influential, intuitively, which also proves that TPR is more rational.
In the Karate network, the top 10 nodes selected by TPR are basically consistent with those selected by other algorithms, with the only difference in the ordering. The top 10 nodes picked by DC, PageRank, TPR are nodes 1, 2, 3, 4, 9, 14, 24, 32, 33, and 34, but BC does not select nodes 4 and 24 but rather nodes 6 and 20. HITS considers node 31 as more important than 24.

Analysis of SIR Spreading Model
In the SIR propagation verification, since the propagation ability of a node is more representative in the initial propagation phase [13], in this paper, the seed node only propagates for 10 steps (t = 10) instead of the whole infection reaching a steady state. First, in each iteration, the infected nodes infect their neighbor nodes with a probability of 0.3 and return to recovery state with a probability of 1, which means β = 0.3, γ = 1. The TPR algorithm was compared with the DC, BC, PageRank, and HITS algorithms. The statistical results between the influence values of the nodes calculated by these algorithms and the nodes' propagation ability by the SIR model are shown in Figure 6. In general, the stronger the positive correlation of the scatter graph, the closer the algorithm's results are to the SIR model and the more appropriate the algorithm. In the Email network (Figure 6a), DC, PageRank, HITS, and TPR algorithms can accurately reflect that the influence and SIR propagation ability of nodes are positively related. The distributions of nodes in DC, PageRank, and TPR algorithms are concentrated, whereas the distribution of the nodes in BC is more dispersed overall. In the Euroroad network (Figure 6b), each algorithm exhibits positive correlation characteristics. This feature is more obvious in TPR and PageRank, whereas DC cannot distinguish nodes with the same influence. BC performed poorly and the nodes gathered on the left side of the horizontal axis. In the HITS algorithm, some nodes with high propagation ability in the SIR model have low HITS values. The DC, PageRank, HITS, and TPR algorithms performed similarly in the Protein network (Figure 6c), and the distribution of nodes in the BC algorithm was relatively scattered. These algorithms performed similarly on the Euroroad and Power Grid networks (Figure 6d). From the above experimental results, in the Email, Euroroad, and Protein networks, the centrality value calculated by the TPR was more linearly consistent with the propagation ability obtained by the SIR model; that is, the nodes with similar influence in the SIR model have little difference from the influence in the TPR algorithm. In Power Grid, the result of the HITS algorithm was better than that of TPR.
Then, to evaluate the infection ability of the influential nodes identified by each algorithm, the top 10 influential nodes identified by the DC, BC, PageRank, HITS, and TPR algorithms were selected as the seed nodes of the SIR model. The probability of infection β was 0.3, the recovery probability γ was 1, 20 steps were propagated, and 100 iterations were performed. Finally, the total number of infected and recovery nodes in the network F(t) were counted to represent the propagation ability of the top 10 nodes selected by each algorithm. The higher the F(t), the more influential the 10 selected seed nodes. The results are shown in Figure 7.  (Figure 7c) networks, the top 10 nodes identified by each algorithm have similar propagation ability. When t < 5, the slope of the curves in DC, BC, PageRank, and TPR algorithms are equivalent, whereas the slope of the curve in the HITS algorithm is slightly smaller than those of the other algorithms. In the Euroroad network (Figure 7b), after the propagation of 20 steps, the numbers of infected and recovered nodes and the infection rate of the seed nodes in the network are TPR > DC > PageRank > HITS > BC. In the Power Grid network (Figure 7d), the top 10 nodes identified by the DC have strong abilities to spread. The number of infected nodes and the infection rate of the top 10 nodes are DC > TPR > PageRank > BC > HITS and DC > TPR > PageRank > HITS > BC, respectively.
In the above experimental results, when 0 < t < 15, the numbers of infected and recovery nodes of each algorithm increase. When 15 < t < 20, each algorithm is in a stable state, the F(t) function of the TPR algorithm has higher initial curve slope, and the growth rate is obvious in the Email, Euroroad, and Protein networks. TPR is second only to DC in Power Grid, indicating that the top 10 nodes identified by the TPR have advantages in initial propagation.

Analysis of Kendall Correlation Coefficient
To more clearly compare the algorithms, we applied the algorithms to the PolBooks, PDZBase, Football, and Propro networks, and compared the results obtained by each algorithm and the SIR model, then calculated the Kendall correlation coefficient, setting β = 1/(< k > −1), γ = 1. The Kendall coefficients between the ranking of the top 30% nodes obtained by the DC, BC, PageRank, HITS, TPR, and SIR models are shown in Table 6. The higher the Kendall coefficient, more effective the algorithm. The table shows that in the PDZBase, Football, and Propro networks, the TPR algorithm obtains higher correlation coefficient values than the other algorithms. In the PolBooks network, DC has obtained the highest correlation coefficient.

Conclusions
In this paper, we considered both the topology structure of networks and nodes' attribute information to define the degree ratio and the similarity ratio, then the trust-value was proposed. Combined with PageRank, this paper proposed the trust-PageRank algorithm to identify influential nodes based on trust-value. In the experimental part, the TPR was first applied to the Kite and Karate networks to verify the effectiveness of the algorithm. To evaluate the accuracy and infection ability of the nodes identified by the TPR algorithm, eight real networks were selected, and the SIR model and Kendall correlation coefficients were used to evaluate the propagation rate and ability of to identify influential nodes in the networks. The influential nodes of each network identified by DC, BC, PageRank, HITS, and TPR algorithms were compared with the results obtained by the SIR model, and the correlation coefficients of ranking between the selected algorithms and the SIR model for the top 30% of nodes were calculated. The results showed that the TPR algorithm can effectively identify the influential nodes, and has advantages in the initial propagation of SIR and the identification of nodes with similar influence.
In general, we proposed an algorithm to identify influential nodes, not only considering the structure of network but also the attribute information of nodes. The accurate experimental results suggest that it should receive more attention and it may be helpful for controlling information dissemination in real-world networks, the spread of infectious diseases, etc. In addition, in some networks, nodes have certain features. For example, in social networks, people, as nodes, have features such as age, gender, and behaviors that may also affect trust-value, which are worthy of further attention.

Conflicts of Interest:
The authors declare no conflict of interest.