Overlapping Community Discovery Method Based on Two Expansions of Seeds

: The real world can be characterized as a complex network sto in symmetric matrix. Community discovery (or community detection) can effectively reveal the common features of network groups. The communities are overlapping since, in fact, one thing often belongs to multiple categories. Hence, overlapping community discovery has become a new research hotspot. Since the results of the existing community discovery algorithms are not robust enough, this paper proposes an effective algorithm, named Two Expansions of Seeds (TES). TES adopts the topological feature of network nodes to ﬁnd the local maximum nodes as the seeds which are based on the gravitational degree, which makes the community discovery robust. Then, the seeds are expanded by the greedy strategy based on the ﬁtness function, and the community cleaning strategy is employed to avoid the nodes with negative ﬁtness so as to improve the accuracy of community discovery. After that, the gravitational degree is used to expand the communities for the second time. Thus, all nodes in the network belong to at least one community. Finally, we calculate the distance between the communities and merge similar communities to obtain a less-undant community structure. Experimental results demonstrate that our algorithm outperforms other state-of-the-art algorithms.


Introduction
Many complex systems exist in the form of networks in the real world, such as social networks [1,2], traffic networks [3,4], network sparsification [5] and protein interaction networks [6,7]. These complex systems can be characterized as complex networks sto in symmetric matrix for analysis and research. Entities in the complex network are represented by nodes, and the relationships between the entities are represented by edges [8,9]. Many researches based on complex networks have been investigated, such as social computing [10], network computation [11], and community discovery [12]. The community structure (module or cluster) is an important feature of a complex network, which means that the network is composed of several communities. The connections between the nodes in the community are very close, while the connections between the communities are relatively sparse [13]. The purpose of community discovery (or community detection [14]) is to mine community structures in a complex network. Community discovery can reveal the universal features of a complex network and help in understanding its topology accurately, which provides guidance for the use and transformation of the network and promotes the practical application of the network. Hence, community discovery has become one of the hotspots of complex network research [15] and various researches have been investigated, such as disjoint community detection [16,17], overlapping community detection [18], and multiobjective community detection [19].
Early researches on community discovery mainly focused on nonoverlapping communities, which assumed that each node belongs to only one community and there is no overlap of any two communities. Many representative algorithms have been proposed, such as the graph-partitioning-based method [20], label-propagation-based method [21], clustering method [22,23], and optimization method [24,25]. However, in the real world, things often have the characteristics of diversity. One thing often belongs to multiple categories and there may be overlap between communities. Therefore, overlapping community discovery has become a new research hotspot in recent years. Researches on overlapping community discovery can be divided into two categories: global-network-informationbased and local-network-information-based methods.
The methods based on global network information aim to find the community structure in the whole network by optimizing a certain global objective function using whole connection information, which mainly include the link-based method [26,27] and the clique percolation method [28]. These methods can get better results in community discovery, but they have high time complexity and are not suitable for large-scale complex networks with numerous nodes. The methods based on local information aim to find the community structure starting from a node in the network by optimizing a certain local objective function using local connection information, which mainly include the label propagation method [29,30] and the local community expansion method [31,32]. Since the process of community discovery is only related to the local information in the network, the time complexity is low. Thus, these methods are suitable for large-scale complex networks. However, their disadvantage is that when the parameters of the algorithms change slightly, the results of community discovery change remarkably.
To tackle this problem, this paper proposes an overlapping community discovery method based on Two Expansions of Seeds (TES). The main features of this method are that the topological feature of the network (node degree centrality) is used to define the gravitational degree and the local maximum node is taken as the seed. The reason is that the greater the gravitational degree of the node, the greater its influence and the stronger its information transmission ability in the network is, which is beneficial for robust community discovery. Then, the seed is expanded by the greedy strategy based on the fitness function. When new nodes are added to the community, the community structure may be changed, thereby, there may be nodes with negative fitness. To avoid such nodes, this paper adopts the community cleaning strategy. After the expansion based on the fitness function, a community can cover most of the nodes in the network, but there are still a small number of nodes that cannot be assigned to any community because of the uction of community fitness. To solve this problem, this paper uses a gravitational function to expand the nodes that are not included in any community for the second time. Thus, all nodes belong to at least one community. Finally, by calculating the distance between communities and merging similar communities, we effectively uce the undant communities. The main contributions of this paper are as follows: • We propose an overlapping community discovery algorithm named TES. • TES employs the gravitational degree to find the local maximum nodes as the seeds and expands these seeds by the greedy strategy. • Experimental results verify that our algorithm has better performance than other competitive algorithms.
The rest of this paper is organized as follows: Section 2 briefly summarizes the related work. Section 3 proposes our algorithm, named TES, which is composed of three parts: seed selecting, twice node expanding, and overlapping community merging. Section 4 reports the performance of TES. We draw the conclusion in Section 5.

Related Work
In this section, we will briefly review the categories of the overlapping community discovery methods first. Then, we will introduce the methods of local community optimization and expansion in detail and analyze the shortcomings of the-state-of-the-art algorithms. This paper aims to deal with the problem of unreasonable seed selection for local community optimization and expansion.
The overlapping community discovery methods can be divided into four categories: link-based method, clique percolation method, label-propagation-based method, and localcommunity-optimization-and-expansion-based method.

•
The link-based method converts the cluste objects into network edges (or links) and deals with these edges by nonoverlapping partitions. Since a node is usually a vertex of multiple edges, if these edges belong to different linked communities, the node is an overlapping node. The LINK algorithm [27] is representative of this method. In addition, k-means was employed to expand seeds twice in dynamic community detection [33]. • The clique percolation method considers that a community is composed of a number of fully connected subgraphs. defined as a clique, and an adjacent clique forms a community. Since a node may belong to more than one clique, it is an overlapping node. However, the algorithm has higher constraints on interconnected conditions and depends on the selection of parameter k. The CPM algorithm [28] is representative of this method. • The label-propagation-based method assigns a unique label to each node during initialization; updates the label and its membership by iteration; and finally, assigns the nodes with the same label to the same community. Apparently, if a node has multiple labels, the node is an overlapping node. The COPRA algorithm [29] is a representative of this method. • The local-community-optimization-and-expansion-based method starts from the local communities, expands the communities gradually based on the optimization function, and forms cross-regions between multiple extensions, thus finding overlapping community nodes. The representative algorithms are LFM [31] and GCE [32]. In addition to the above algorithms, there are some classical methods, such as the semisupervised learning method [34]; deep learning method [35]; and the CONGA algorithm [36], which splits the clone node by itself and adds a virtual edge between the split nodes to find the overlapping nodes.
Among the abovementioned methods, the fourth one-local community optimization and expansion-becomes more and more popular. For example, the research in [21] found that taking the local maximum node defined by the degree centrality as the seed can discover higher quality communities and avoid instability at the same time. The research in [37] was about two methods to define the node influence: the community structure of social networks and the influence-based measure of node intimacy center, and took the nodes with great influence as the seeds. The EAGLE algorithm took the largest clique in the network as the seed and igno the second largest one, which has high time complexity [38]. Another paper [39] selected a group of nodes as seeds that were closely connected in the network, namely, an Egonet (hawk-eye network), but this method is more suitable for networks with a large global clustering coefficient. A seed set expansion method based on graph partitioning was proposed in [40] to find a group of nodes with low conductivity, and the node closest to the cluster was taken as a seed. The online social network (OSN) algorithm, as a multilever community discovery algorithm, combined user interests and cohesiveness to coarsen the initial network and found an initial community assignment using stochastic inference in the coarsest network [41]. All these methods use the local topology information of the network to optimize the local optimization function to find the community structure in the network. It does not need to know the global topology of the network, and shows certain advantages in large-scale networks. Therefore, seed selection is the foundation of this kind of method, which will affect the quality of community structure mining. The LFM algorithm [31] and the DEMON algorithm [32] expand the community by random seed selection, which inevitably causes the instability of community discovery. The GCE algorithm improved the LFM algorithm by mining k-cliques as the seed through the classic Bron-Kerbosch algorithm in the network [42]. In this method, cliques are fixed, but the seed selection depends on the selection of parameter k, which can easily cause the problem of low network coverage.
To solve the problem of unreasonable seed selection for local community optimization and expansion, this paper proposes an overlapping community discovery algorithm based on two expansions of seeds. A node with the local maximum gravitational degree defined by degree centrality is taken as a seed. This method has the advantages of a high-quality community and robust results, but the disadvantage is that these communities cannot cover the whole network. To overcome this problem, the communities are expanded for the second time to ensure that each node belongs to at least one community.

Proposed Method
In this section, we propose the TES algorithm, which is composed of three parts. The first part employs the gravitational degree defined by the network topological feature (degree centrality) to find the local maximum nodes as the seeds. The second part expands these seeds by the greedy strategy based on the fitness function. Then, the communities are expanded for the second time based on the gravitational function. The third part calculates the distance between the communities and merges the similar communities to get the final communities.

Seed Selection
In actual networks, some nodes are usually closely connected with other nodes, called central nodes, which contribute greatly to information transmission. They are usually scatte across the whole network and located in regions where the nodes are more closely connected. This is consistent with the fact that the nodes in a community are closely connected, while the connections between communities are sparse. Hence, the central nodes can be taken as the seeds. The centrality of a node reflects its centrality and importance in the network [43]. Inspi by the gravitational relationships in the dynamic social network [44], this paper proposes a gravitational degree based on degree centrality to measure the influence of the central nodes on other nodes.
Newton's law of universal gravitation holds that any two particles are attracted by a force in the direction of the line between them. The gravitation is proportional to the product of their masses and inversely proportional to the square of their distance, as shown in Equation (1).
where g is the gravitational constant, m 1 and m 2 are the masses of two particles, and r is the distance between two particles. In this paper, a network is represented by an undirected graph G = (V, E), where V = {v 1 , v 2 , . . . , v n } is a set of n vertices and E = {e 1 , e 2 , . . . , e m } is a set of m edges.

Definition 1.
Node centrality is the degree of a node, denoted by d(v i ).

Definition 2.
If there is an edge between nodes v i and v j , then node v j is a neighbor of node v i . All neighbors of node v i are denoted by n(v i ).

Definition 3.
To measure the similarity between nodes v i and v j , this paper employs the Jaccard similarity coefficient [45], denoted by s(v i , v j ).
Definition 4. The distance between node v i and its neighbor v j is d Definition 5. The gravitation of node v i to its neighbor v j is Gr(v i , v j ).
Using the node degree to measure the quality of a node can reflect the ability of information transmission to its neighbor. The gravitational degree of v i to its neighbor v j is directly proportional to the node degree and inversely proportional to the distance between them. Definition 6. The gravitational degree of node v i is the sum of its gravitation to all nodes in the network.
The greater the gravitational degree of node v i , the greater its influence on the network. The stronger the information transmission ability of a node, the more likely it is to become a seed node.

Definition 7.
If the gravitational degree of a node is no less than that of all its neighbors, the node will be called the local maximum degree node of the network.
The local maximum node has a large gravitational degree and strong information transmission ability. Most of them are scatte in the network. Therefore, this paper selects the local maximum nodes as the seeds. The seed selection algorithm is shown in Algorithm 1. First, all nodes are marked as 0 and the gravitational degree of each node is calculated. The node with the largest gravitational degree is put into the seed set. Then, the node with the local maximum degree is marked as 1, and the node and its neighbors are moved out of the vertex set. Search for the next seed iteratively until all nodes have been marked and moved out of the vertex set.

Community Discovery
For each seed in seed set S, this paper iteratively adds its neighbors to the community to discover natural communities. There are many ways to expand the community, including the minimum one norm [20], the label propagation method [29], and the fitness function method [31,42]. This paper employs the fitness function method since it can provide good results on real datasets.
Definition 9. For community C in network G = (V, E), its fitness f (C) is defined as where d C in and d C out are the sum of the degrees of the nodes that are inside and outside community C, respectively. d C in = 2 * e(C) and d C out = |E| − e(C), where e(C) is the number of edges inside community C. α > 0 is an adjustment parameter. α in the fitness function is the resolution parameter, which can adjust the scale of the community discove. The smaller α is, the greater the influence of d C in . This will lead to a rapid increase of f (C) after adding node v i to community C. Therefore, community C can accept more nodes. When α tends to be 0, the community may expand to cover the entire network. On the contrary, the larger α is, the smaller the impact of d C in . This will lead to the tiny increase of f (C) after adding node v i . Therefore, a small community is formed. When . The more sparse the connection between community C and outside is, the smaller d C out is and the larger f (C) is, which can reflect the local connection density of community C. Figure 1, suppose community C is composed of nodes 5, 6, and 7. N(C) = {1} since node 1 connects with community C. Node 8 does not belong to N(C) since it does not connect with community C. Suppose there is an edge between nodes 5 and 8, node 8 belongs to N(C). d C in is 4 since the degrees of nodes 5, 6, and 7 in community C are 1, 1, and 2, respectively, and 1 + 1 + 2 = 4. Another method is to count the number of edges in community C. There are 2 edges in community C, thus, d C in = 2 * 2 = 4. Similarly, d C out = 14 − 2 = 12.

Example 2. In
Definition 10. The fitness f (v i ) of node v i can be obtained as follows: The disadvantage of this method is that although most of nodes can be assigned to the corresponding communities, some nodes fail to be assigned, thus resulting in low network coverage. Therefore, this paper expands the nodes that have not been assigned to the community for the second time. This is in accordance with the actual situation. For example, in a social network, everyone has friends and belongs to a circle of friends [37]. This paper assumes that each node belongs to at least one community. A gravitational function is defined by the ratio of the gravitation between nodes and the gravitational degree of nodes. The gravitation of node v i is the sum of the gravitational degrees between node v i and its neighbors. The more neighbors of node v i the community C contains, the greater the gravitation between the community and node v i is. The gravitational function is given as follows: Definition 11. The gravitation of community C to node v i is measu by the gravitational degree, and the gravitational function GF(C, v i ) is When the seed set is found in the first stage, the seed is expanded by the greedy strategy, that is, the local objective function of the community is maximized by adding node to the temporal community or deleting it from the community. We will show the principle of the algorithm as follows: We put a seed into temporal community C first. Then, we calculate the fitness of all its neighbors and add the maximum fitness neighbor v max into C, as shown in lines 3-7 of Algorithm 2. After adding the maximum fitness neighbor, the structure of the community will be changed. At this time, the fitness of each node for the new temporary community should be updated. If a node has a negative fitness, it will be removed from the community, as shown in lines 9-14 of Algorithm 2. Iterate the above expansion until the fitness decreases when any node is added. We store temporal community C into community set CS and remove these nodes from the network.
Obviously, when a community is expanded, the fitness of the nodes in the community and the neighbors need to be recalculated. To solve this problem, we adopt the following steps. If there is an edge between v i and v j , then d ij is 1; otherwise, it is 0. The initials of d C in and d C out are 0. If node v i is added into the community, we adopt Equations (10) and (11). If node v i is removed from the community, we adopt Equations (12) and (13).
Example 3. In Figure 1, according to Example 2, community C ={5, 6, 7}. Let node 1 be added into community C. According to Equation (10), d C∪{v 1 } in = 4 + 2 * 3 = 10, since when node 1 is added, there are three edges that are added into community C. According to Equation (11), d C∪{v 1 } out = 12 − 3 = 9. Let node 5 be removed from community C. According to Equation (12), since there is one edge in community C that connects with node 5.
According to Equation (13) Equation (8) is used to update the community fitness. In this way, we only need to know the degree of node v i and calculate d ij of the nodes which are both in community C and neighbors of v i . To further speed up the calculation, we store d C in and d C out , which will be updated when temporal community C adds a new node or removes a node, as shown in lines 7-8 and 11-12 of Algorithm 2.

Algorithm 2 GetNaturalcoms.
Require: network G = (V, E), seed set S, and parameter α; Ensure: community set CS; if f (v max ) > 0 then 7: C ← C ∪ v max ; 8: Update d C in and d C out ; 9: for each v j ∈ C do 10: if f (v j ) < 0 then 11: V ← V − C; 21: end for 22: return CS Finally, we expand nodes for the second time. If a node does not belong to any community, the node is merged into the community with the greatest gravitation, as shown in Algorithm 3.

Algorithm 3 ExpandingSecond.
Require: node set V, and community set CS; Ensure: community set CS; 1: if V = ∅ then 2: for each v j ∈ V do 3: for C i ∈ CS do 4: i max ← argmax({GF(C i , v j )}); 5: end for 6: end for 8: end if 9: return CS

Merging Overlapping Communities
In a nonoverlapping community, a node belongs to only one community [46], while a node may belong to multiple communities in an overlapping community. Therefore, there may be similarities between two communities. When a certain similarity is reached, the excessive overlapping phenomenon will occur, resulting in a undant community [47]. Hence, after discovering the communities, this paper defines a measure of community distance which is used to discover and merge the overlapping communities to simplify the community structure. Definition 12. The distance between communities C 1 and C 2 is In this paper, is the threshold of the distance parameter. If δ(C 1 , C 2 ) < , communities C 1 and C 2 are merged into one community since they overlap excessively. The Merge_Overlap algorithm is shown in Algorithm 4.
To avoid invalid calculations, we adopt the principle of inverted index to prune invalid detection of overlapping communities. Therefore, set Cp(v i ) is used to store the communities in which node v i belongs. An illustrative example is shown as follows: Suppose we have 3 communities: C 1 = {a, b}, C 2 = {b}, and C 3 = {c}. We know that Cp(a) = {1}, Cp(b) = {1, 2}, and Cp(c) = {3}. To obtain the overlapping community of C 1 , we calculate Cp(a) ∪ Cp(b) = {1, 2} since C 1 = {a, b}. Therefore, communities C 1 and C 2 are two overlapping communities. It is not necessary to calculate the distance between communities C 1 and C 3 . Therefore, the inverted index is an effective pruning strategy. According to the above example, we should create set Cp(v i ) at first, as shown in lines 1-7 of Algorithm 4. Apparently, if the number of elements in Cp(v i ) is greater than 1, it indicates that node v i belongs to multiple communities and is an overlapping node. We determine whether the communities in Cp(v i ) overlap or not, as shown in lines 8-19 of Algorithm 4.
To sum up, Algorithm 5 presents the overlapping community discovery algorithm based on two expansions of seeds.

Algorithm 4 MergeOverlap.
Require: network G = (V, E), community set CS, and parameter ; Ensure: the new community set CS; 1: for v i ∈ V do 2: for each C j ∈ CS do 3: if v i ∈ C j then 4: end if 6: end for 7: end for 8: for each C j ∈ CS do 9: for each v i ∈ C j do 10: if length(Cp(v i )) > 1 then 11: Cv ← Cv ∪ Cp(v i ); 12: end if 13: end for 14: for cv i ∈ Cv do 15: if dis(C j , C cv i ) < then 16:

Theoretical Analysis
The space complexity and time complexity of TES are O(k * n + m) and O(k * n 2 + m)), respectively, where k, n, and m are the number of seeds, nodes, and edges in G, respectively. The reason is shown as follows: The space complexity of network G is O(n + m). The space complexity of all neighbors of each node is O(m) since each edge should be calculated. Thus, the time complexity of n(v i ) of each node is also O(m). Further, the space complexity and time complexity of Suppose we find k seeds, where k << n. When a node is added into or removed from a community, no more than n edges are checked. Thus, the time complexity of Equations (10)- (13) are O(n). Hence, the time complexity of Equations (7)-(9) are also O(n). A node can be assigned into no more than k communities. Therefore, the time complexity of Algorithm 2 is O(k * n 2 ).
Suppose there are t nodes which are expanded twice, where t << n. Each node will be added into each community once. Thus, the time complexity of lines 3-5 is O(k * n). Therefore, the time complexity of Algorithm 3 is O(t * k * n).
Obviously, the time complexity of lines 1-7 of Algorithm 4 is O(k * n) since there are k communities and n nodes. Similarly, the time complexity of lines 8-19 of Algorithm 4 is also O(k * n).
Apparently, each community has no more than n nodes. Thus, the space complexities of these communities are O(k * n). Hence, the space complexities of Algorithms 2, 3, and 4 are O(k * n).
Since t << n, the time complexity of TES is O(n + m + k * n 2 + t * k * n + k * n) = O(k * n 2 + m) and the space complexity of TES is O(n + m + k * n) = O(k * n + m).

Baseline Methods
To verify the performance of TES, three state-of-the-art algorithms are selected: CONGA [36], COPRA [29], and LFM [31]. In addition, the TES algorithm has three key steps: searching for seeds, discovering communities based on two expansions, and merging overlapping communities. The two expansions of communities include the first expansion of the community based on the fitness function and the second expansion of the community based on the gravitational function. The community expansion based on the fitness function includes community cleaning. To verify the reasonability of these parts, four comparative algorithms-TES_Seed, TES_Unclean, TES_Fitness, and TES_Unmergeare constructed, and their specific descriptions are shown in Table 1.

Algorithms Description
TES_Seed Nodes are randomly selected as seeds. TES_Unclean Community cleaning is not performed after the first expansion. TES_Fitness The community is expanded only once based on the fitness function. TES_Unmerge Overlapping communities are not detected.

Benchmark Datasets
In this paper, we compare the performance of the TES algorithm on five real network datasets. The real network datasets are shown in Table 2.

Evaluation Criteria
To evaluate the performance of the proposed algorithm, this paper employs extended modularity [38] and overlapping modularity [53] as the evaluation criteria.
The main idea of modularity (Q) is that if a subgraph is a community, the number of edges of its internal nodes is greater than that of a randomly generated subgraph [54].
Unfortunately, the Q function can only be used to evaluate nonoverlapping communities. To evaluate the overlapping community structure, extend modularity (EQ) was proposed based on the Q function [38]. The EQ function is shown as Equation (15).
where m is the total number of edges of the network, K is the number of communities discove, d i is the degree of node v i , O i is the number of communities to which node v i belongs, and A is the adjacency matrix of the network. If there is an edge between v i and v j , then A ij = 1; otherwise, A ij = 0.
Overlapping modularity (Q ov ) is another method to evaluate the structure of overlapping communities [53], as shown in Equation (16): where m is the total number of edges of the network, A is the adjacency matrix of the network, β is the strength of an edge l = (i, j) which belongs to community C, k in j is the in-degree of node j, and k out i is the out-degree of node i. EQ and Q ov are both in the interval [0,1]. The greater they are, the better the community discovery results will be.

Parameter Selection
The selection of parameters will affect the results of community discovery. The TES algorithm has two parameters, α and . According to Equation (7), α = 1 is a special value. According to Equation (14), is in the range of (0,1). Thus, we select α in the range of    Figures 2 and 3 show the trend of EQ along with the increase of parameters α and , respectively. In general, the influence of α on community discovery is greater than that of . In Figure 2, it can be seen that with the increase of α, EQ increases first and then decreases. For different networks, the maximum EQ is obtained with different α. The maximum EQ values are achieved at α = 1.3 on the Karate and Football networks, α = 1 on the Dolphin and Lesmis networks, and α = 0.9 on the Power network. From Figure 3, it can be seen that has little influence on the Karate network and the Football network, but has the greatest impact on the Dolphin network. EQ values of the Dolphin and Lesmis networks are not significantly changed when ∈ [0.1, 0.5]. However, when ∈ [0.5, 0.9], EQ decreases rapidly as increases. All five networks obtain the maximum EQ when = 0.5.
In conclusion, for five real networks, when α = 1.3 and = 0.5 for the Karate and Football networks, α = 1 and = 0.5 for the Dolphin and Lesmis networks, and α = 0.9 and = 0.5 for the Power network, the optimal community discovery results can be achieved. Therefore, in the rest of this paper, TES selects the above parameters for different networks.

Module Performance Evaluation
In this subsubsection, we verify that each module has an effect on the improvement of the proposed algorithm. The experiments are carried out on five real networks, and evaluation criteria EQ is selected to evaluate the influence of each module on the TES algorithm. The parameters of TES_Seed, TES_Unclean, TES_Fitness, and TES_Unmerge are the same as those of TES. The experimental results are shown in Table 3. The coverage rates of the nodes with only one expansion and two expansions are calculated, respectively. Therefore, TES_Fitness with one expansion and TES with two expansions are selected. The coverage rates of the two algorithms are reported in Table 4.
From Table 3, it can be seen that all four parts of the TES algorithm have impacts on the TES algorithm and have different influence on different networks. Therefore, TES outperforms the other four algorithms. For example, TES gets 0.675 on Power dataset, which is larger than that obtained by the other four algorithms. According to Equation (15), we know that the greater EQ is, the better the community discovery results will be. The reasons are as follows: It should be noticed that the results of TES_Seed in Table 3 are not robust. The reason is that TES_Seed randomly selects the seed to expand, resulting in different community discovery results. Thus, the results are different even under the same parameters. Hence, the results of TES_Seed in Table 3 are the average value of 20 times. After the first community expansion based on the fitness function, TES_Unclean does not clean the community. When the community structure changes, there may be negative fitness nodes in the community, which will effect the quality of the community discovery results. TES_Unmerge has the most significant impact on the algorithm, which proves that excessive overlapping between communities has a great impact on community structure. TES_Fitness expands the community based on the seeds only once, which leads to the decrease of EQ and affects the coverage rate of network nodes. From Table 4, we know that the coverage rate of TES_Fitness are all less than 1. The reason is as follows: For complex networks with fewer nodes, the coverage rate of the nodes can be high with only one expansion. However, with the increase of the nodes, the network scale becomes larger and larger, and the coverage rate with only one expansion becomes lower and lower. After two expansions of the community, the TES algorithm can cover all nodes in the network completely, and a high coverage rate of 1.00 can be achieved for a large network such as Power.
Hence, we can safely say that the four parts of the TES algorithm are all very important. The community discovery result is robust since the local maximum node is selected as the seed based on the gravitational degree. Community cleaning can avoid negative fitness nodes when the community structure changes. The natural community can significantly increase the coverage rate of the network nodes through two expansions, and the merging of the overlapping communities can deal with undant communities effectively. The four parts can effectively improve the quality of community discovery.

Algorithm Performance Evaluation
To report the performance of the TES algorithm, this paper selects three state-of-theart algorithms: the CONGA algorithm, based on the splitting method for overlapping community discovery; the COPRA algorithm, based on the label propagation method; and the LFM algorithm, based on local community optimization and expansion. The parameter of CONGA is community number c, which needs to be determined according to the modularity degree function. The parameter of COPRA is the label length v, which is from 2 to 8 with steps of 1. The parameter of LFM is the resolution parameter α, which is from 0.8 to 1.5 with steps of 0.1. For each algorithm, we select the best results as the final results shown in Figures 4 and 5.  From Figures 4 and 5, TES outperforms all three competitive algorithms since both EQ and Q ov obtained by TES are better than those of the other three algorithms on the five datasets. For example, from Figure 4, we know that EQ of TES is 0.417 on the Karate network, while the other three algorithms are all less than 0.4. Similarly, Q ov of TES is 0.697 on the Power network, while the other three algorithms are all less than 0.56. As we know, the greater EQ and Q ov are, the better the community discovery results will be. Hence, the community discovery results of TES are significantly improved compa with the other three algorithms. The reason is that the natural community discovery is based on local community optimization, and the expansion is only related to the local topology structure of the network, not the global topology of the whole network. Although the LFM algorithm is based on local community optimization, EQ and Q ov values achieved by the LFM algorithm are lower than that of the COPRA algorithm on the Lesmis and Football networks, but higher than that of the COPRA algorithm on the other three networks. The reason is that the LFM algorithm randomly selects seeds, meaning that the community structure discove is not robust.
In summary, TES has better performance than all competing algorithms.

Case Study
To further clarify the performance of TES, the Karate network is employed to show the community discovery results. Figure 6 shows the community discovery results obtained by the TES algorithm. The seeds are nodes 1, 17, 26, and 34, and four communities are obtained. Node 10 is the overlapping node of the grey and yellow communities. It can be seen from Figure 6 that the TES algorithm finds four communities, while the CONGA and COPRA algorithms both discover two communities and the LFM algorithm discovers five communities. Compa with the CONGA and COPRA algorithms, the partition of the network by TES algorithm is more detailed. For example, community {5,6,7,11,17} is closely related to node 1, but the nodes inside community {5,6,7,11,17} have stronger connection relationship with each other. The TES algorithm can mine small communities in large-scale communities, mainly because in the first part of the algorithm, the center node with strong information transmission ability is taken as the seed. Although the LFM algorithm discovers five communities, the seed does not have centrality since the LFM algorithm randomly selects seeds. The expanded community structure locality is poor, and a community is included in another community. The reason for this kind of situation is that the LFM algorithm does not detect the merged undant community, which illustrates the importance of detecting overlapping community in the TES algorithm.

Conclusions
In this paper, we propose an overlapping community discovery algorithm, named TES, which has three parts. In the first part, the local maximum node is taken as the seed based on the gravitational degree. The second part discovers the natural community by two expansions. The community is expanded based on the fitness function. After adding a new node, the community is cleaned. The second expansion is based on the gravitational function. The third part examines and merges the overlapping communities. To verify the reasonability of these parts, four comparative algorithms, TES_Seed, TES_Unclean, TES_Fitness, and TES_Unmerge, are proposed. Besides these four algorithms, three state-of-the-art algorithms: CONGA, COPRA, and LFM, are employed. Experimental results on five real networks report that TES outperforms all these competitive algorithms. Informed Consent Statement: Informed written consent was obtained from the authors for publication of this paper. Data Availability Statement: Data was obtained from http://www-personal.umich.edu/mejn/netdata/.

Conflicts of Interest:
The authors declare that there is no conflict of interest regarding the publication of this article.