A Central Edge Selection Based Overlapping Community Detection Algorithm for the Detection of Overlapping Structures in Protein–Protein Interaction Networks

Overlapping structures of protein–protein interaction networks are very prevalent in different biological processes, which reflect the sharing mechanism to common functional components. The overlapping community detection (OCD) algorithm based on central node selection (CNS) is a traditional and acceptable algorithm for OCD in networks. The main content of CNS is the central node selection and the clustering procedure. However, the original CNS does not consider the influence among the nodes and the importance of the division of the edges in networks. In this paper, an OCD algorithm based on a central edge selection (CES) algorithm for detection of overlapping communities of protein–protein interaction (PPI) networks is proposed. Different from the traditional CNS algorithms for OCD, the proposed algorithm uses community magnetic interference (CMI) to obtain more reasonable central edges in the process of CES, and employs a new distance between the non-central edge and the set of the central edges to divide the non-central edge into the correct cluster during the clustering procedure. In addition, the proposed CES improves the strategy of overlapping nodes pruning (ONP) to make the division more precisely. The experimental results on three benchmark networks and three biological PPI networks of Mus. musculus, Escherichia coli, and Cerevisiae show that the CES algorithm performs well.


Introduction
The majority of the biological processes are constituted by a group of proteins which are connected densely [1]. The protein-protein interaction (PPI) network contains the communications among the protein groups that communicate with each other closely [2], which can be used to predict the complexity or function of normal proteins. The structures of the PPI networks can reflect some principles of the cellular organization [3]. Recently, the graph theory has been widely used to detect potential biological significance in PPI networks by regarding the proteins as nodes and the interactions

Data Source
In order to assess the viability of CES and compare its performance with other algorithms, five real networks were selected, including three benchmark networks-Zachary's Karate Club Network [32], Dolphins Social Network [33], and American College Football Network [9]-and two protein interaction networks-E. coli Network, M. musculus Network, and Cerevisiae Network (Table 1). The first three benchmark networks describe community networks related to social communications or animal groups. (1) The Karate network dataset describes the interaction between every two members affected by two coaches in a karate club at a university in the United States. The nodes and edges refer to students and the communications among them, respectively. The resulting network includes 34 nodes and 78 edges. (2) The Dolphin network describes the relationship between two groups of bottlenose dolphins. After seven years of observation by Lusseau et al., a community including 158 edges and 62 nodes was obtained. Each edge represents the intersection between two dolphins, and the relationship in the community is relatively stable. According to the real situation, these dolphins can be divided into two categories. (3) The Football network, with 115 nodes and 612 edges, describes the rugby matches in 2000 between 12 different clubs and 115 teams. The nodes, edges, and categories represent different teams, the matches between every two teams, and the 12 clubs, respectively.
The other three datasets are as follows. (1) E. coli: This dataset describes the interaction between the proteins in E. coli. Each node in the network represents a protein, and an edge between the two nodes represents a relationship between the two proteins. The final network has 1396 nodes and 2092 edges. After removing these networks, the network with 344 nodes and 513 edges can be constructed. This dataset is a core protein interactive of the E. coli species, and the dataset name is Ecoli20170205.
(2) M. musculus: This dataset describes the interaction between the proteins in M. musculus. Each node in the network represents a protein, and an edge between the two nodes represents a relationship between the two proteins. The final network has 1883 nodes and 2597 edges. After removing these networks, the network with 941 nodes and 1149 edges can be built. This dataset is a core protein interactive of the M. musculus species, and the dataset name is Mmusc20170205. (3) Cerevisiae: This dataset describes the interaction between the proteins in Cerevisiae. Each node in the network represents a protein, and an edge between the two nodes represents a relationship between the two proteins. The final network has 2172 nodes and 5124 edges. After removing these networks, the network with 2110 nodes and 4936 edges can be built. This dataset is a core protein interactive of the Cerevisiae species, and the dataset name is Scere20170205.

Procedure of the CNS
In 2017, Qi Jinshan and Liang Xun proposed CNS to detect overlapping communities [21], which includes two main steps, the central node selection and the clustering procedure.
(1) In the first step, the exact central nodes can be achieved by evaluating the influence of a node. Suppose that a network G = (V, E) is given, where the V(G) and E(G) represent the set of nodes and edges in the graph G, respectively.
The definition of neighboring nodes of node v is set as the following formula: The definition IB(v 1 , v 2 ) of the influence between the node v 1 and the node v 2 is set as the following formula: represents the Jaccard distance between node v 1 and node v 2 .
The definition of all influence of node v is set as the following formula: The strategy of the central node selection is that if all influence of the node v is more significant than its neighbors, then it is selected as a central node in the community.
(2) In the second step, the non-central nodes can be clustered into the correct categories. Such a clustering procedure extends the communities, which are initialized from each central node. The relationship between a community and nodes is defined as the following formula: where EC i represents a community needing to be extended, u represents the neighbor nodes of EC i , and v represents both the neighboring nodes of u and the nodes in EC i . The neighboring nodes of EC i can be enriched by adding nodes with an attract value higher than the threshold ε = 0.4 [21]. As a result, the new community can be achieved by iterating the search-and-add of neighboring nodes.

Limitation of CNS
Although the OCD algorithms based on central node selection have many advantages in detecting overlapping communities, such as combining the local information and global information of the regulatory social networks, the accuracy of the central node selection and the overlapping degree of the networks still hold the potential to be expanded. Specifically, considering the fact that the process of central node selection only focuses on the node itself and ignores the influence among the nodes, it may lead to CNS being incorrect. Many constraints should be considered for the formation of the overlapping nodes in each community they belong to, which leads to difficulty in using CNS to achieve overlapping nodes. Therefore, the degree of the overlapping node is insufficient in CNS. In either case, the result of the community detection can hardly match the real network well. For instance, in a small benchmark demo network containing 8 nodes and 12 edges (Figure 1a can be calculated by the CNS algorithm, and node 3 will be regarded as the only central node. While in the benchmark network, two central nodes, node 3 and node 6, will be considered as central nodes.
Molecules 2018, 23, x FOR PEER REVIEW 5 of 18 algorithm, and node 3 will be regarded as the only central node. While in the benchmark network, two central nodes, node 3 and node 6, will be considered as central nodes.

OCD Algorithm Based on Central Edge Selection (CES)
To avoid the shortcomings of CNS, we proposed CES using the information of edges to detect the overlapping communities. The workflow of the CES algorithm shown in Figure 2 contains three major parts, including a procedure of central edge selection, a clustering procedure, and an ONP step. The theory of CMI, introduced in Section 2.3.1, takes into consideration the influence among nodes to make the target central node more reliable. The network can be divided by edges to reduce the difficulty of getting overlapping nodes, and then optimized by ONP.

Central Edge Selection
The process of central edge selection is composed of two parts: An improved central node selection integrated with CMI, and the central edge selection.

OCD Algorithm Based on Central Edge Selection (CES)
To avoid the shortcomings of CNS, we proposed CES using the information of edges to detect the overlapping communities. The workflow of the CES algorithm shown in Figure 2 contains three major parts, including a procedure of central edge selection, a clustering procedure, and an ONP step. The theory of CMI, introduced in Section 2.3.1, takes into consideration the influence among nodes to make the target central node more reliable. The network can be divided by edges to reduce the difficulty of getting overlapping nodes, and then optimized by ONP. algorithm, and node 3 will be regarded as the only central node. While in the benchmark network, two central nodes, node 3 and node 6, will be considered as central nodes.

OCD Algorithm Based on Central Edge Selection (CES)
To avoid the shortcomings of CNS, we proposed CES using the information of edges to detect the overlapping communities. The workflow of the CES algorithm shown in Figure 2 contains three major parts, including a procedure of central edge selection, a clustering procedure, and an ONP step. The theory of CMI, introduced in Section 2.3.1, takes into consideration the influence among nodes to make the target central node more reliable. The network can be divided by edges to reduce the difficulty of getting overlapping nodes, and then optimized by ONP.

Central Edge Selection
The process of central edge selection is composed of two parts: An improved central node selection integrated with CMI, and the central edge selection.
(1) In the first part, the CMI theory is used to improve the process of the central node selection, which alters the central nodes to affect their neighboring nodes. Here, a formula used to revise the ALL value of nodes is shown as follows: where v and u refer to the confirmed central node and its neighboring nodes, respectively. GF is a coefficient used to revise the ALL value of nodes according to CMI. The influence between nodes in the network is calculated using Formula (2), and updates the ALL value by Formula (5), after determining one central node using the strategy of CNS in the CNS algorithm.
The pseudo-code of the improved central node selection can be described as following Algorithm 1: Revise according to the CMI 13 For End for 16 End for where CN refers to the set of central nodes. The N(CN) represent all the neighboring nodes of the confirmed central nodes, which reduces the possibility of two adjacent nodes becoming the central nodes together; as a result, the case where two adjacent nodes are central nodes together cannot occur in the real network.
(2) In the second part, after selecting the central nodes, the procedure of the central edge selection is to classify all the edges connected with the central node as the central edges, and the remaining edges are classified as the non-central edges.
For each central node, the central edges category (CEC) is determined by CEC(CE i ) = i, where CE i = {e(v 1 , v 2 )|v 1 = v or v 2 = v} represents the set of the Central Edges linked to a central node with i index, and e(v 1 , v 2 ) represents the edge between node v 1 and node v 2 . Edges other than the central edges are classified as non-central edges.
Considering the same demo network constructed in Section 2.2.2, more reasonable results can fortunately be achieved after recalculating the benchmark network ( Figure 1a) with the CES algorithm. In the first circle, we calculate the ALL(v) value (Figure 3a), which is the same as the CNS results (Figure 1b), and regard node 3 as the first central node. Then the values of node 3's neighboring nodes are revised ( Figure 3b) according to the theory of CMI, which is introduced in the following Section 2.3.1. Hence, the other central node, node 6, can be selected, as a result of which the values of node 6's neighboring nodes are smaller than node 6, and node1, node 2, node 4, and node 5 are not taken into account. Then, two overlapping nodes can be selected-node 4 and node 5. The result is the same as the benchmark network division (Figure 1a). CNS results (Figure 1b), and regard node 3 as the first central node. Then the values of node 3's neighboring nodes are revised ( Figure 3b) according to the theory of CMI, which is introduced in the following Section 2.3.1. Hence, the other central node, node 6, can be selected, as a result of which the values of node 6's neighboring nodes are smaller than node 6, and node1, node 2, node 4, and node 5 are not taken into account. Then, two overlapping nodes can be selected-node 4 and node 5. The result is the same as the benchmark network division (Figure 1a).

Clustering Procedure
The clustering procedure intakes the result from the procedure of central edge selection to categorize the non-central edges by three steps: Calculating the distance between the non-central edge and the central edges, allocating the non-central edge into the correct category, and converting the edge division into the node division.
1) In the first part, a novel edge similarity measure ( , ) kj ELC e e [24] is defined as follows to calculate the distance between the edges with edge information.
where ( , ) e a b represents the edge ( , ) e a b which has two nodes, node a and node b; and () Na represents the neighboring nodes of node a. Therefore, the distance between the non-central edge   [27].
3) Finally, the remaining edge divisions are converted to the node division. The category of each edge and the corresponding two nodes in the network are the same. In this way, the node division of the network can be achieved as the final result.

ONP Procedure
In this paper, we have improved the ONP algorithm [28] by mixing two strategies. The two strategies are related to each other, and the first strategy is the special case of the second strategy, which can eliminate some steps of pruning and save running time of the CES algorithm.

Clustering Procedure
The clustering procedure intakes the result from the procedure of central edge selection to categorize the non-central edges by three steps: Calculating the distance between the non-central edge and the central edges, allocating the non-central edge into the correct category, and converting the edge division into the node division.
(1) In the first part, a novel edge similarity measure ELC(e k , e j ) [24] is defined as follows to calculate the distance between the edges with edge information.
where e(a, b) represents the edge e(a, b) which has two nodes, node a and node b; and N(a) represents the neighboring nodes of node a. Therefore, the distance between the non-central edge e k and the set of the central edges in CE i can be defined as DNC(e k , CE i ): where the e m and e j represent the central edges belonging to the categories i.
(2) After calculating all the distances DNC(e k , CE i ) of e k , the minimum value of DNC(e k , CE i ) can be found, and the non-central edge e k belongs to the corresponding category i based on the NN algorithm [27].
(3) Finally, the remaining edge divisions are converted to the node division. The category of each edge and the corresponding two nodes in the network are the same. In this way, the node division of the network can be achieved as the final result.

ONP Procedure
In this paper, we have improved the ONP algorithm [28] by mixing two strategies. The two strategies are related to each other, and the first strategy is the special case of the second strategy, which can eliminate some steps of pruning and save running time of the CES algorithm.
(1) In the first strategy, overlapping nodes, whose connections are central edges completely in some categories, can be removed from some categories; that is, con(v i , C j ) ∈ CE i , where C j represents the edges in the category j, and con(v i , C j ) represents the connections between the central node v i and C j . It is not necessary to calculate the number of non-central edges between central nodes and categories.
For the example in Figure 4, suggest that node 1 and node 2 are central nodes and node 3 is the overlapping node. According to the first strategy, node 3 can be changed to the left category only; that is, the connection between node 3 and the right category is the edge 2 to 3, which is completely the central edge.
1) In the first strategy, overlapping nodes, whose connections are central edges completely in some categories, can be removed from some categories; that is, , where j C represents the edges in the category j, and ( , ) ij con v C represents the connections between the central node i v and j C . It is not necessary to calculate the number of non-central edges between central nodes and categories.
For the example in Figure 4, suggest that node 1 and node 2 are central nodes and node 3 is the overlapping node. According to the first strategy, node 3 can be changed to the left category only; that is, the connection between node 3 and the right category is the edge 2 to 3, which is completely the central edge. 2) In the second strategy, the connections of each overlapping node in different categories have a different proportion, and overlapping nodes whose proportion is less than prop can be removed; that is, and the empirical value prop represents the threshold during the ONP. A simple network is shown in Figure 5 in which node 2 and node 7 are central nodes and node 3 is the overlapping node. The connection between node 3 and the right category has only one non-central edge, while the connection between node 3 and the left category has many non-central edges. Therefore, node 3 will be included in the left category only.

Time Complexity Analysis
If the network is scale-free, such as the PPI network, then the network obeys the power-law distribution [34]. Suppose n represents the number of nodes, m represents the number of edges, the seed represents the number of central nodes, and  (2) In the second strategy, the connections of each overlapping node in different categories have a different proportion, and overlapping nodes whose proportion is less than prop can be removed; that is, where clus(v i ) represents the categories of the node v i , and the empirical value prop represents the threshold during the ONP.
A simple network is shown in Figure 5 in which node 2 and node 7 are central nodes and node 3 is the overlapping node. The connection between node 3 and the right category has only one non-central edge, while the connection between node 3 and the left category has many non-central edges. Therefore, node 3 will be included in the left category only.

Time Complexity Analysis
If the network is scale-free, such as the PPI network, then the network obeys the power-law distribution [34]. Suppose n represents the number of nodes, m represents the number of edges, the seed represents the number of central nodes, and adj(i) represents the number of node i's neighboring nodes. In the procedure of the central edge selection, time is mainly spent in calculating the all values of all nodes, which is O(n 2 ) based on Formula (3), improving central nodes selection based on CMI, which is O(n × adj(i)) according to the improved CNS pseudo-code, and the selection of central edges based on the central node, which is O(n). In the clustering procedure, time is mainly spent in dividing the non-central edges into appropriate categories, which is O(seed × m 2 ) according to Formulas (6) and (7). In the ONP procedure, time is mainly spent in finding connections of overlapping nodes in different categories, which is O(n × m). In the power-law distribution, the degree of each node is the probability of a natural number k where P(degree = k) ∝ 1 k γ ; that is, if a node's degree is k, then the probability is 1 k γ . In 2001, Béla Bollobás et al. found the γ = 3 in a big network [35]. The degree of the network is DN = 1 × 1 1 3 + 2 × 1 2 3 + . . . + n × 1 n 3 ≤ 6 π 2 × n, and the number of the edges is m = DN 2 ≤ 3 π 2 × n. So, the final time complexity is O(n 2 + seed × m 2 + n + n × adj(i) + n × m), that is O(n 2 ). From Table 2, the comparison of the algorithms' running times can be seen clearly. The runtime in seconds (RT(s)) in the table represent the runtime and the bold numbers represent the best RT among all algorithms.

CPM Algorithm
In 2005, Palla et al. proposed CPM based on the theory of mass infiltration to analyze the overlapping community structure of networks [18]. The result of CPM is based on the conception of K-cliques, which represents the K nodes connected with each other, and the two K-cliques are adjacent if they have (K − 1) common nodes. If K is given, CPM can search all adjacent K-cliques in the networks starting from any K-cliques, and these adjacent K-cliques are divided into the same cluster. Then CPM starts from any K-cliques which are not divided, and starts iteration by searching all adjacent K-cliques. The CFinders package [18] (version 2.0.6, Eötvös University, Budapest, Hungary) is supposed to get the process of the CPM.

LC Algorithm
In 2011, Kim Y et al. proposed LC based on hierarchical clustering [19]. The advantage of LC is that the node community scheme and link community scheme can be compared quantitatively by measuring the unknown information left in the networks besides the community structure. It can be used to determine quantitatively whether link community schemes should be used rather than node community schemes. However, LC easily achieves the local minimum and tends to divide the communities into small clusters.

Evaluation
To evaluate the performance of our CES based algorithm we used three evaluation standards, EQ, NMI, and CR, to compare with the performance of CNS and CPM. Specifically, for the PPI network, an additional Gene ontology (GO) enrichment analysis was introduced to evaluate the biological meaning of the network constructed by the four algorithms.

EQ Algorithm
In 2004, Newman et al. proposed an evaluation algorithm module Q, which can be used to evaluate the result of non-overlapping community detections, though it is not suitable to detect overlapping communities. In order to amend the algorithm, in 2009, Shen et al. proposed a novel evaluation EQ algorithm [29].
In the formula, m refers to the number of edges in each community, and CN v and CN w refer to the number of categories that node v and node w belong to, respectively. EB vw is a logical value that represents the existence status of the edge between node v and node w; 1 for existent and 0 for missing. D(v) and D(w) represent the degree of node v and node w, respectively. The EQ value ranges from 0 to 1, and a higher value indicates closer structure to the standard division. In the exceptional case, when the result of the community structure is identical to the original standard division, the EQ value is 1.

NMI Algorithm
In 2009, Lancichinetti et al. proposed a novel evaluation algorithm called NMI [30,31], which evaluates the accuracy between the CES result and the standard division. The NMI score ranges from 0 as completely different, to 1 as identical. The following equation defines NMI: where X refers to the standard division of the community and Y refers to the CES constructed community division. H(X|Y) norm and H (Y|X) norm are the normalized condition entropy of X with respect to Y, with H (X|Y) H(X k ) , where NC represents the number of categories in the network, and X k represents the network of category k; and H (Y|X) norm is likewise.

CR Algorithm
The CR is used to describe the coverage of nodes in the community compared to those in the original community. It can be defined as CR = 100 × n n , where n refers to the number of nodes in the produced community division, and n refers to the number of nodes in the original.

GO Enrichment Analysis
In biological network study, GO is a common method used to compare the proteins (or genes) in a predicted network to the known universal functional groups with annotations, and evaluates how close the connections are. Three major aspects are involved in the GO analysis: (1) Biological process (BP) compares the functions or final outcomes of proteins from specific gene sets that carry the same function; (2) molecular function (MF) describes the biochemical activity of the given protein's sets; and (3) cellular component (CC) emphasizes the relative proteins location in a cell and cellular anatomy. For each of the GO enrichment analyses, the p-value is calculated to evaluate the probability predicted protein modules match the protein list annotated to the particular terms. Significant p-values indicate strong association of the proteins with a group. In this paper, we adopt the p-value provided by the R-package ClusterProfiler [36] to analyze the PPI network division.

Benchmark Network
The four OCD algorithms (CES, CNS, CPM, and LC) were tested using the three benchmark networks (Karate, Dolphin, and Football), and computational networks were evaluated by three criteria (EQ, NMI, and CR). The evaluation results of four OCD algorithms on three benchmark networks can been seen from Table 3. During the procedure of central edge selection, GF = 4.2× node num /edge num and prop = node num /edge num during the overlapping nodes pruning, where node_num represents the number of nodes in the network and edge_num represents the number of edges in the network. Figure 6 represents the selection of GF, which is based on the value of EQ on the three networks. GF is finally selected as 4.2 × node num /edge num . BCN refers to the number of categories on benchmark networks that are recorded in each publication, and evaluation category number (ECN) represents the number of the category which is produced from the algorithms. The bold numbers are the best values among all algorithms.
(Karate Network, Dolphin Network, and Football Network). During the procedure of central edge selection, 4.2 GF   nodenum/edgenum and prop  nodenum/edgenum during the overlapping nodes pruning, where node_num represents the number of nodes in the network and edge_num represents the number of edges in the network. Figure 6 represents the selection of GF, which is based on the value of EQ on the three networks. GF is finally selected as 4.2 × nodenum/edgenum. BCN refers to the number of categories on benchmark networks that are recorded in each publication, and evaluation category number (ECN) represents the number of the category which is produced from the algorithms. The bold numbers are the best values among all algorithms. For the three datasets, the CES method achieved high scores for all three evaluations, and most of them surpassed the CNS, CPM, and LC methods. Furthermore, the ECN described by CES were identical to the known BCN.
In the Karate Network, CES has a better result than CNS, CPM, and LC for all three evaluation methods. The EQ value is 0.37 and the NMI value is 0.92. In addition, CES has a total cover rate. Additionally, the division of the CES has two categories, which is the same as the standard category. In the Dolphin Network, CES has a better result than CNS, CPM, and LC in NMI, with a value of 0.76. CES's EQ value is 0.38, which is slightly lower than CNS, as a result of which the number of the category CES has is the same as the number of the standard category, while CNS is inconsistent. Therefore, CNS is inaccurate in getting the correct number of categories, and the high EQ value of CNS has no significance. In addition, CES has a total cover rate. In the Football Network, CES has a better result than CNS, CPM, and LC in EQ, with a value of 0.4. CES's NMI value is 0.52 which is slightly lower than CNS, as a result of which the number of the category CES For the three datasets, the CES method achieved high scores for all three evaluations, and most of them surpassed the CNS, CPM, and LC methods. Furthermore, the ECN described by CES were identical to the known BCN.
In the Karate Network, CES has a better result than CNS, CPM, and LC for all three evaluation methods. The EQ value is 0.37 and the NMI value is 0.92. In addition, CES has a total cover rate. Additionally, the division of the CES has two categories, which is the same as the standard category. In the Dolphin Network, CES has a better result than CNS, CPM, and LC in NMI, with a value of 0.76. CES's EQ value is 0.38, which is slightly lower than CNS, as a result of which the number of the category CES has is the same as the number of the standard category, while CNS is inconsistent. Therefore, CNS is inaccurate in getting the correct number of categories, and the high EQ value of CNS has no significance. In addition, CES has a total cover rate. In the Football Network, CES has a better result than CNS, CPM, and LC in EQ, with a value of 0.4. CES's NMI value is 0.52 which is slightly lower than CNS, as a result of which the number of the category CES has is the same as the number of the standard category, while CNS is inconsistent. Hence, CNS is inaccurate on getting the correct number of categories, and the high NMI value of CNS has no significance. In addition, CES has a 99% cover rate and is almost completely covered. The visualization of the four algorithms' (CES, CNS, CPM, and LC) results on the three benchmark networks (Karate Network, Dolphin Network, and Football Network) is shown in Table 4, and Cytoscape [37] is used to visualize the network division. In addition, the results of LC on the three benchmark networks have big differences in the number of categories from the benchmark, so the results are meaningless and we do not show the results of LC.     significance. In addition, CES has a 99% cover rate and is almost completely covered. The visualization of the four algorithms' (CES, CNS, CPM, and LC) results on the three benchmark networks (Karate Network, Dolphin Network, and Football Network) is shown in Table 4, and Cytoscape [37] is used to visualize the network division. In addition, the results of LC on the three benchmark networks have big differences in the number of categories from the benchmark, so the results are meaningless and we do not show the results of LC. significance. In addition, CES has a 99% cover rate and is almost completely covered. The visualization of the four algorithms' (CES, CNS, CPM, and LC) results on the three benchmark networks (Karate Network, Dolphin Network, and Football Network) is shown in Table 4, and Cytoscape [37] is used to visualize the network division. In addition, the results of LC on the three benchmark networks have big differences in the number of categories from the benchmark, so the results are meaningless and we do not show the results of LC. significance. In addition, CES has a 99% cover rate and is almost completely covered. The visualization of the four algorithms' (CES, CNS, CPM, and LC) results on the three benchmark networks (Karate Network, Dolphin Network, and Football Network) is shown in Table 4, and Cytoscape [37] is used to visualize the network division. In addition, the results of LC on the three benchmark networks have big differences in the number of categories from the benchmark, so the results are meaningless and we do not show the results of LC. significance. In addition, CES has a 99% cover rate and is almost completely covered. The visualization of the four algorithms' (CES, CNS, CPM, and LC) results on the three benchmark networks (Karate Network, Dolphin Network, and Football Network) is shown in Table 4, and Cytoscape [37] is used to visualize the network division. In addition, the results of LC on the three benchmark networks have big differences in the number of categories from the benchmark, so the results are meaningless and we do not show the results of LC. significance. In addition, CES has a 99% cover rate and is almost completely covered. The visualization of the four algorithms' (CES, CNS, CPM, and LC) results on the three benchmark networks (Karate Network, Dolphin Network, and Football Network) is shown in Table 4, and Cytoscape [37] is used to visualize the network division. In addition, the results of LC on the three benchmark networks have big differences in the number of categories from the benchmark, so the results are meaningless and we do not show the results of LC. significance. In addition, CES has a 99% cover rate and is almost completely covered. The visualization of the four algorithms' (CES, CNS, CPM, and LC) results on the three benchmark networks (Karate Network, Dolphin Network, and Football Network) is shown in Table 4, and Cytoscape [37] is used to visualize the network division. In addition, the results of LC on the three benchmark networks have big differences in the number of categories from the benchmark, so the results are meaningless and we do not show the results of LC.  Table 4, and Cytoscape [37] is used to visualize the network division. In addition, the results of LC on the three benchmark networks have big differences in the number of categories from the benchmark, so the results are meaningless and we do not show the results of LC.

PPI Network
Three PPI networks, from M. musculus, E. coli, and Cerevisiae, were used to test and compare the performance of the four OCD algorithms (CES, CNS, CPM, and LC). The GF values used for each dataset were chosen as 0.9, 0.8, and 0.5, respectively, and 0.1 prop for among the datasets. In each dataset, the CES method showed higher EQ and CR than CNS, CPM, and LC ( Table 5). The categories found by CES in the three datasets covered all nodes (proteins) in the population, while CNS only covered 65%, 72%, and 55%, respectively, LC only covered 78%, 60%, and 92%, respectively, and the CPM covered less. Table 5 displays all categories found by the four algorithms. The LC results show much more overlap among categories in each dataset, which induced higher network redundancy, and thus, is far away from the actual protein network structures. The visualization of the predicted PPI network using four algorithms can been seen from Figure 7. the performance of the four OCD algorithms (CES, CNS, CPM, and LC). The GF values used for each dataset were chosen as 0.9, 0.8, and 0.5, respectively, and 0.1 prop for among the datasets. In each dataset, the CES method showed higher EQ and CR than CNS, CPM, and LC ( Table 5). The categories found by CES in the three datasets covered all nodes (proteins) in the population, while CNS only covered 65%, 72%, and 55%, respectively, LC only covered 78%, 60%, and 92%, respectively, and the CPM covered less. Table 5 displays all categories found by the four algorithms. The LC results show much more overlap among categories in each dataset, which induced higher network redundancy, and thus, is far away from the actual protein network structures. The visualization of the predicted PPI network using four algorithms can been seen from Figure 7. The bold numbers represent the best result among all algorithms.  By performing GO enrichment analysis, the p-values of BP, MF, and CC were calculated to evaluate the connections between the predicted categories and biological functional protein groups (see details in Supplementary Table S1). Considering the overall performance among algorithms, we considered categories (protein modules) with a p-value < 0.001 as significant, and the total number of significant categories are summarized in Table 6. For most cases, the number of significant categories predicted by CES was more than those from CNS, CPM, and LC; the CPM showed a higher rate of significant categories while only presenting a relatively local relationship due to the low CR results, and the LC algorithm excessively categorized the nodes that lead to higher numbers of the total and significant categories with higher biases. Nevertheless, combining with the overall CR, the CES algorithm still showed the best results for community categories prediction. The individual p-values were log-normalized and are distributed in Supplementary Figures S1-S3 in order to showcase the overall comparison among algorithms and datasets. Two categories predicted by the CES algorithm, No. 3 in M. musculus and No. 1 in the E. coli dataset, were selected to showcase the investigation of the relationships among categories and overlapped nodes. For the No.1 significant category in E. coli, six proteins, iscA, ECs3391, ECs3395, HSCB, hscA, and ISCU, were included. Protein hscA, responsible for the transfer of iron-sulfur clusters, was considered as the central node and contributed to the enriched category function. The No. 1 category was found to overlap with the 10th and 13th categories, and shared a common overlapping protein, ISCU, which assembles the Fe-S clusters. The 1st and 10th categories overlapped at two more protein positions, ECs3391 and ECs3395, other than ISCU. ECs3391 is an iron-sulfur protein that helps the assembly of Fe-S clusters, and ECs3395 is a scaffold protein that works with ISCU in the formation of Fe-S clusters. The overall relationships of the three categories are shown in Figure 8. The individual protein functions can be found in Supplementary Table S2, along with the overlapping investigation in M. musculus.

Conclusions
In this study, a CES based OCD algorithm was introduced to construct community networks. The improved CES method applies the CMI algorithm in the traditional central node selection step, and combines with central edge selection to use both nodes and edge information for the main community construction. Then, the clustering procedure calculates the distance between the non-central edge and central edge to allocate the non-central edges into the right categories. Finally, an improved ONP algorithm is applied to assign the overlapping nodes into an appropriate community to complete the network construction. To evaluate the performance of network construction, the proposed CES method was used to test three benchmark networks and two protein-protein interaction networks, and compared with the CNS, CPM, and LC methods. The results indicated excellent performance of the CES algorithm in the community with moderate complexities. As a result, we believe our CES algorithm has the potential to achieve more accurate and sufficient networks for community studies, especially in sociology and the systematic biology area. Our future work will focus on improving the efficiency and accuracy of the CES algorithm, and adapting it to dynamic network analyses.
Supplementary Materials: Figure S1: Comparison of three levels on M. musculus Network, Figure S2: Comparison of three levels on E. coli Network, Figure S3: Comparison of three levels on Cerevisiae Network.