Community Detection Based on Node Inﬂuence and Similarity of Nodes

: Community detection is a fundamental topic in network science, with a variety of applications. However, there are still fundamental questions about how to detect more realistic network community structures. To address this problem and considering the structure of a network, we propose an agglomerative community detection algorithm, which is based on node inﬂuence and the similarity of nodes. The proposed algorithm consists of three essential steps: identifying the central node based on node inﬂuence, selecting a candidate neighbor to expand the community based on the similarity of nodes, and merging the small community based on the similarity of communities. The performance and effectiveness of the proposed algorithm were tested on real and synthetic networks, and they were further evaluated through modularity and NMI anlaysis. The experimental results show that the proposed algorithm is effective in community detection and it is quite comparable to existing classic methods.


Introduction
Complex networks play an important role in representing complex systems in subject areas such as social networks, biology, psychology, informatics, management, etc. [1]. Complex networks can be represented by a graph G = (V, E), where V and E represent the set of nodes and edges in the network, respectively [2]. In real networks, nodes and edges can represent various individuals and relationships. For example, identifying influential nodes is one of the research hotspots in the study of complex networks, used to analyze the network structure [3,4]. However, using only a group of influential nodes does not enable one to grasp a network's hidden information completely.The community structure that exists in real networks can help us to analyze the network structure in depth [5]. It is generally believed that community structure is a network subgraph with close internal connections and sparse external connections, which may have a certain independent structure or specific function [6]. In reality, community detection is useful in solving numerous problems affecting human communities, such as analyzing networks of social opinion [7], recommending products for customers [8], finding users displaying malicious activity to protect system security [9], identifying influential nodes [10,11], and so on.
The question of how to detect a community is an important issue that attracts researchers from all over the world. There are signs of great progress in the research on community detection in complex networks, including GN [12], LPA [13], EM [14], and so on. GN is based on edge betweenness; LPA is based on label propagation; and EM is a clustering algorithm. However, these algorithms have certain limitations. The accuracy of the GN algorithm is high, but its complexity costs are much higher. LPA runs quickly, but its accuracy is unstable. EM displays a better ability to cluster, but it is unable to detect highly modular communities. Therefore, it is necessary to detect communities accurately and stably.
To address this problem, we propose a community detection algorithm named NINS (a community detection algorithm based on node influence and the similarity of nodes). In this paper, the NINS algorithm consists of three parts: identifying the central node, selecting a candidate neighbor to expand the community, and merging the community. The proposed algorithm is stable, and can detect modular and realistic communities. Experiments on real and LFR networks show that the proposed NINS is effective and quite comparable to existing community detection algorithms.
The structure of this paper is as follows: Section 2 reviews the related work. In Section 3, the proposed algorithm is introduced, including the detailed steps of the algorithm, complexity analysis, and a description of the implementation process. The network data descriptions and numerical results based on various methods applied to real and synthetic networks, respectively, are shown in Section 4. Moreover, the experimental results are discussed in Section 5. Finally, our conclusions are presented in Section 6.

Related Work
Community detection is a hot topic in network science. The earliest study on this subject was reported in 1970 [15]. Next, Girvan and Newman proposed the network community structure [12]. From that time until now, research on community detection has gradually developed. It has been proven that community detection is an NP-hard problem [16,17]. Some classic algorithms are listed in Table 1, where N is the number of nodes in the network, M is the number of edges in the network, and m is the number of iterations. From the perspective of detecting community structure, some algorithms (e.g., GN [12], Louvain [18], and CDIA [19]) detect non-overlapping communities, where each node only belongs to one community; other algorithms (e.g., CPM [20] and ONES [21]) can detect overlapping communities, where one node can belong to two or more communities. From the perspective of hierarchical clustering, these types of algorithms can be divided into two categories [16,22]. Some of them are agglomerative algorithms, where each node assigns as a community and iteratively merges the smaller communities according to their similarities [18,23,24]. Other hierarchical clustering algorithms are divisive methods, where the network is taken as a community and is divided into some smaller communities. GN [12] is the most popular divisive algorithm, assessing the central edges based on their shortest path centrality. However, it only accurately detects small or medium networks of 10,000 nodes at most. Thus, Arasteh et al. [25] proposed a fast divisive algorithm based on edge degrees.Chen et al. [26] detected communities in complex networks using an edge-deleting algorithm with restrictions.
From another perspective, the label propagation algorithm (LPA) [13] is also a famous community detection algorithm. In LPA, each node has a unique label, which is updated based on the most common labels among its neighbors. The complexity of LPA is O(N). The convergence of LPA is provable mathematically but it requires the exact algorithm iteration number, which is always dependent on the network parameters (e.g., node and edge numbers) [27]. In 2019, Basuchowdhuri et al. [27] proposed a community detection algorithm named LINCOM, which involves two steps: selecting the broker node by means of an objective function and merging this node into the community with the majority of its neighbors.
From the perspective of deep learning, Al-Andoli et al. [28] introduced a deep-learning algorithm for community detection. To decrease the trainable parameters needed for the deep-learning model, they first divided the network into some smaller parts and proposed a novel similarity constraint function that improved the algorithm's effectiveness. Agrawal and Patel [29] proposed a community detection algorithm named SAG based on topological structure and node attributes. Tsitseklis et al. [30] proposed a scalable community detection method for complex data graphs via hyperbolic network embedding and graph databases. He et al. [31] used the modularity function to sample node sequences and learn node representation by using the skip-gram model to detect communities.
In recent years, some other algorithms have been proposed to detect communities. For example, Feng et al. [32] proposed a community detection algorithm based on node betweenness and structure similarity. Majid Arasteh [33] proposed a gravity algorithm to detect the communities of large-scale networks; the proposed algorithm runs quickly but its accuracy is not ideal. Pourabbasi [34] proposed a new single-chromosome evolutionary algorithm for community detection in complex networks by combining content and structural information. Newman [35] proposed an information-theoretic method for discovering the building blocks in specific networks to show the consistency of community structure in complex networks. Cauteruccio et al. [36] proposed an algorithm to identify virtual communities based on user stereotypes. Mengoni et al. [37] proposed an algorithm to identify hidden communities based on history analysis and session analysis of co-occurrence of activities.
In a bid to further support and enhance the study, discussion, and understanding of community detection, systematic literature reviews have been performed to analyze community detection approaches. Naik [38] surveyed parallel and distributed paradigms for community detection in social networks. Yassine et al. [39] reviewed community detection methods using social network analysis in online learning environments. Attea et al. [40] performed a review of heuristics and metaheuristics for community detection. Huang et al. [41] summarized the community detection methods in multilayer networks. Calderer [42] reviewed community detection in large-scale bipartite biological networks. Gasparetti et al. [43] reviewed community detection in social recommender systems. Rosvall [44] provided a focused review of community detection methods with different motivations, including the cut-based perspective, clustering perspective, stochastic equivalence perspective, and dynamical perspective. Dao et al. [45] conducted a comparative evaluation of community detection methods.

Algorithm
In this paper, based on node influence and the similarity of nodes, an agglomerativebased community detection algorithm named NINS is proposed to detect modular communities by producing groups of densely connected nodes. The proposed algorithm works on unweighted and undirected networks and detects non-overlapping communities, the numbers of which do not need to be set before the execution of the algorithm. NINS consists of the following three essential steps: identifying the central node based on node influence, selecting a candidate neighbor to expand the community based on the similarity of nodes, and merging the small community based on the similarity of communities. Table 2 summarizes the symbols and notations used in the paper.

Notation Description
neighbor set of node i k i degree of node i S ij the similarity of nodes i and j aveS(., i) the average similarity of node i with its neighbors common neighbor set of node i and j S maximum number of nodes in the small community C i the i-th community n q the number of nodes that belong to the community C j and are connected to the community C i N number of nodes in the network M number of edges in the network t number of the small community k average degree of node m number of iterations d average distance of nodes D network diameter C clustering coefficient of the network r assortative coefficient of the network Q modularity a ij represents whether nodes i and j are connected or not: if nodes i and j are not connected, a ij = 0; otherwise, a ij = 1 δ(C i , C j ) represents whether nodes i and j are in the same community or not: Three Essential Steps of the Algorithm 3.1.1.
Step 1: Identifying the Central Node Based on Node Influence The central node has the highest influence in the network, which can attract its neighbors. In the network, the greater the degree of a node, the lesser it can be affected by one of its neighbors. For example, in the rumor-spreading process, the more neighbors a node has, the lesser it will be affected/influenced by one of its neighbors. Thus, 1/k j is used to represent node i's influence on its neighbor j. The influence of node i can be calculated as the sum of the influence on its neighbors: where Γ(i) is the neighbor set of node i and k j is the degree of node j.
In step 1, we sort the nodes by node influence and choose the first node as the central node of the community.

3.1.2.
Step 2: Expanding the Community Based on the Similarity of the Nodes It is known that nodes within a community are more closely connected than those outside a community. In step 2, the similarity of the nodes and the average similarity of each node with its neighbors are used to select a candidate neighbor, which is used to expand the community. The AA [46] algorithm is selected to measure the similarity of the two nodes, which can better reflect the degree of node connection than the other node similarity algorithm based on local structure information: where t is the common neighbor of nodes i and j.
The average similarity is proposed as a measure to reflects the average similarity of each node with its neighbors, which can be calculated as follows: In step 2, node i's neighbor j is added to the community when aveS(., j) < S(i, j). The reason for this is that the similarity among nodes in the same community is larger than that of others that do not belong to that community. A higher similarity would ensure that the community consists of nodes with a more dense connection. The central node's neighbors may not always belong to the same community as it is the node with the highest similarity. The similarity of connected nodes in the same community is generally larger than the average similarity.

Step 3: Merging Small Communities Based on Community Similarity
In step 3, the community C i with the maximum number of S nodes would be selected as a small community. It is known that the nodes within the same community are more closely connected. Therefore, it is obvious that the small community is likely to merge with the community with the highest similarity. Modularity maximization and label propagation are two methods used as community similarity measures, which are often used in optimization methods for detecting community structures in networks. However, it has been shown that modularity suffers from a resolution limit [47]; therefore, it is unable to detect small communities [46]. Thus, label propagation [13] is introduced to measure the similarity between small communities and the other communities: where n q represents the number of nodes that belong to the community C j and are connected to community C i .

The Proposed Algorithm and Complexity
Firstly, the proposed algorithm calculates node influence and chooses the most influential node as the central node. Then, it selects a candidate neighbor to expand the community based on the similarity of nodes and average similarity (if the neighbor has one neighbor, it should be merged into the community). The process runs iteratively. Next, according to the number of nodes in the community, the initial community is divided into small and large communities. Finally, the small community is merged to the community with the highest number of neighbors. The algorithm is stopped when the number of nodes of each community in the network is greater than S. In this paper, S = 3. The pseudo-algorithm of NINS is shown in Algorithm 1. A real network, Karate [48], was selected to illustrate the algorithm.
The NINS algorithm takes O(N k ) to calculate node influence and repeats this process N times in steps 1 and 2. According to the changes in the network, choosing the central node and selecting a candidate neighbor at each iteration takes O(1 + log N). In step 3, beginning with t small communities and merging these small communities into other communities (t ≤ N) takes O(t k ) time. The algorithm does not stop until the node number of communities is larger than S.
The complexity cost of the proposed algorithm is O(N k + N(1 + log N) + St k ). When nlogn > M, the complexity of NINS is O (NlogN); otherwise, it is O(M). The complexity cost of GN [12] is O (N M 2 ), that of Louvain [18] is O (N log N), LPA [13] is O(N), CDIA [19] is O(M), and EM [14] is O (mN 3 ), where m is the number of iterations.

Algorithm 1: NINS Algorithm.
Input: Network G = (V, E), maximum number of nodes in the small community S Output: communities for each i in G.nodes do Calculate I(i) by Equation (1) end Sort in descending order (V,key = I(i)) for each i in G.nodes that does not belong to a community do choose i with the highest influence as the central node for each j in i.neighbors that does not belong to a community do Calculate the nodes similarity S(i, j) by Equation (2) Calculate average nodes similarity aveS(., j) by Equation (3) if node j has one neighbor or S(i, j) > aveS(., j) do Merge node j into the community C i that node i belongs to for each b in j.neighbors that does not belong to a community do iterate 8-12 end end end communities.append(C i ) end for each C i in communities do if n(C i ) ≤ S do Calculate similarities of C i with its neighbor communities by Equation (4) Merge C i into the community with the highest similarity with C i Remove C i from communities and update communities end end

Example of the Algorithm
A real network, Karate, was selected to illustrate the algorithm. The network of Karate is shown in Figure 1. The central node was first selected by sorting node influence based on Equation (1). Node 34 was the most influential node and its influence value was 5.767. Next, the similarity of the node and its neighbors was calculated. Then the community was expanded. The initial communities are shown in Table 3. Small and large communities were distinguished based on whether the node number of a community was greater than three or not. If the community node number was greater than three, it was considered a large community; otherwise, it was considered a small community. The initial results showed two small communities [10,17]. For node 10, only node 34 was its neighbor. It should thus be merged into the community C 1 , to which node 34 belongs. For node 17, its neighbors were node 6 and node 7. Both nodes 6 and 7 were in the community C 2 , and node 17 would be merged into community C 2 accordingly. The final community result of Karate offered by the proposed algorithm is shown in Figure 2, which was the same as the real community structure. It is worth noting that the size of nodes in the figure is positively correlated with node influence; that is, the greater influence of the node, the larger the node in the figure.

Experiment
In this section, first, we describe the datasets used in our experiments. Then, we explain the evaluation criteria, referred to as modularity Q [23] and normalized mutual information (NMI) [49]. The results are explained at the end.

Data Description
To test the performance of NINS, seven real networks with different sizes were used for comparison with several classic methods. Dolphin [50] is a social network of 62 dolphins. Karate [48] is a real social network containing the network of friendships among the 34 members of a karate club at a US university. Football [12] is a network of American football games among Division IA colleges during the regular season of fall 2000. Course registration [19] is a record of college students at Northeastern University. NS [51] is a co-authorship network of scientists working on network science. Power [52] is the power grid of the western United States. Router [53] is a symmetrized snapshot of the structure of the Internet at the level of autonomous systems. Table 4 summarizes the key properties of the selected datasets.
Nine LFR networks [54] were generated and used to test the performance of NINS. The mixed parameter of the fixed network was 0.5, and the size of the community was minc = 10 and maxc = 20. The nodes of these LFR networks were 1000-9000, respectively.

Evaluation Criterion
The modularity Q [23] was used as one of the evaluation criteria to compare the performance of NINS with different algorithms on the considered real datasets.
where −1 ≤ Q ≤ 1; M is the number of edges, a ij represents whether nodes i and j are connected or not. If they are not connected, a ij = 0; otherwise, a ij = 1. C i is the community that node i belongs to. δ(C i , C j ) represents whether nodes i and j are in the same community or not. If δ(C i , C j ) = 1, this means that nodes i and j are in the same community, that is, The normalized mutual information (NMI) [47] was the other evaluation criterion used to determine the performance of the proposed NINS. It can be calculated as follows: where A is the real partition, B is the detected partition, C A is the number of real communities, C B is the number of detected communities, N ij represents the number of nodes shared by real community i and the detected community j, the number of nodes is denoted as N, N i is the sum over row i of matrix N ij , and N j is the sum over column j of matrix N ij .

Experiment on Real Networks
The proposed algorithm was compared with several classic algorithms in seven real networks in terms of modularity Q and runtime. The results of the proposed NINS algorithm are shown in Table 5. For the Karate, Dolphin, Football and Course registration networks, the community results of the proposed algorithm are shown in Figures 2-5, respectively. The greater the influence of the node, the larger the node in the figure.
To verify the performance of NINS, comparison experiments were conducted with five different algorithms in five real networks. The modularity and runtime of six methods in five real networks are shown in Figures 6 and 7, respectively. As shown in Figure 6, the performance of the proposed NINS modularity was better than that of LPA [13] and EM [14] in five real networks, close to that of CDIA [19], and lower than that of GN [12] and Louvain [18]. As shown in Figure 7, the time complexity of the proposed NINS was better than that of the GN and EM algorithms, and close to that of the Louvain, LPA, and CDIA algorithms. The NMI of the communities found in the Karate network determined using our method was 1, that of GN was 0.0819, Louvain was 0.6782, LPA was 0.4213, CDIA was 0.8372, and EM was 0.8372. The results obtained for the real networks indicate hat the proposed NINS is effective in community detection and it is quite comparable to existing classic methods.    Two large networks(Router and Power) were selected to verify the performance of NINS, along with three different algorithms (LPA [13], EM [14], and Louvain [18]). The results of the three classic algorithms are shown in Table 6. As shown in Tables 5 and 6, it is obvious that the modularity of Louvain was higher than that of NINS, but its runtime was higher than that of NINS; the runtime of LPA was lower than that of NINS, but its modularity was lower than that of NINS; the modularity and runtime of EM were the worst. The results obtained in large real networks also indicate that NINS is effective in community detection and it is quite comparable to existing classic methods.

Experiment on LFR Benchmark Networks
To verify the performance of NINS, comparison experiments were conducted with three classic algorithms (LPA [13], EM [14], and Louvain [18]) in nine LFR networks. Due to the high complexity of GN [12], we did not choose GN as a competitor in this section. The NMI and runtime of four methods in nine LFR networks are shown in Table 7 and Figure 8, respectively. A higher NMI value reflects better performance by the algorithm, which indicates the detection of a more realistic community.
As shown in Table 5, the performance of the proposed NINS was better than that of the other three algorithms when it comes to the detection of realistic communities. As shown in Figure 8, the runtime of the proposed NINS was close to that of the Louvain algorithm with the increase in the number of nodes, which is better than the EM algorithm but worse than the LPA algorithm. The results obtained in LFR networks also indicate that the proposed NINS is effective and quite comparable to existing classic methods.

Discussion
There are some algorithms that are similar to our method. These algorithms all include identifying the central node, expanding communities, and merging small communities, but the specific methods are different. For example, reference [32] proposes a community detection algorithm that chooses central nodes based on the betweenness and average betweenness, and expands the community by adding the node into the community to which its most similar neighbor belongs.In general, measuring the influence of nodes using betweenness does involve a high time complexity, as well as a low accuracy. The sum of a node's influence on its neighbors better reflects a node's influence. The similarity of connected nodes in the same community is generally larger than the average similarity. Thus, the NINS algorithm proposed in this paper first identifies the central node by calculating the influence of each node, and then expands the community by computing the similarity of nodes and average similarity, and thereafter merges the small communities into the community with the highest similarity. The complexity of NINS is smaller than the algorithm proposed in [32].
Comparing the experiments on the Karate, Dolphin, and Football networks, where the networks represent real communities, the NMI values of the proposed NINS and the algorithm proposed in [32] are (1, 0.2767), (0.603, 0.4699), and (0.8921, 0.8677), respectively; the modularity values of the proposed NINS and the algorithm proposed in [32] are (0.3715, 0.2018), (0.4907, 0.4346), and (0.5684, 0.5857), respectively. The proposed NINS algorithm performs better than the algorithm proposed in [32] in the detection of more realistic and modular communities.
The performance and effectiveness of the proposed algorithm were tested on real and synthetic networks. First, the proposed algorithm was compared with five different algorithms in five small real networks based on modularity and runtime. As shown in Figures 6 and 7, the modularity of the proposed NINS was better than that of LPA [13] and EM [14], close to that of CDIA [19], and lower than that of GN [12] and Louvain [18]. The time complexity of the proposed NINS was better than that of the GN and EM algorithms, and close to that of the Louvain, LPA, and CDIA algorithms. Next, two large real networks were selected to test the performance of NINS. As shown in Table 6, the modularity of NINS was higher than that of LPA and EM but lower than that of Louvain. The runtime of NINS was better than that of EM and Louvain but worse than that of LPA. Finally, the proposed algorithm was compared with three different algorithms in nine LFR networks based on NMI and runtime. As shown in Table 7 and Figure 8, NINS was effective in detecting more realistic communities when compared with three classic methods (Louvain, EM, and LPA). Although inferior to LPA in terms of time complexity of LPA, it was also quite competitive. In general, the experimental results show that the proposed algorithm is effective in community detection and it is quite comparable to existing classic methods.

Conclusions
In this paper, considering node influence and the similarity of nodes, we propose a community detection algorithm named NINS. The proposed algorithm detects nonoverlapping communities in unweighted and undirected networks. NINS consists of the following three essential steps: first identifying the central node based on node influence, selecting a candidate neighbor to expand the community based on the similarity of nodes, and then merging the small communities into the community with the most similarity. We compared the proposed algorithm with several classic algorithms in five small real networks and two large real networks, analyzing their modularity and runtime. To show the algorithm's performance and efficiency in networks that have a real community structure, the proposed algorithm was compared with three different algorithms in nine LFR networks, with 1000-9000 nodes, respectively. The results show that NINS performs well in detecting more realistic communities, with a lower computational cost.
Furthermore, during the process of detecting communities, the influential nodes can be obtained, which is a great benefit for better understanding the network. At present, the algorithm detects only non-overlapping communities. It cannot detect overlapping communities. In the future, we will pay attention to the detection of overlapping communities.