# Graph-Based Community Detection for Decoy Selection in Template-Free Protein Structure Prediction

^{1}

^{2}

^{3}

^{4}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

#### 1.1. Related Work

## 2. Results

#### 2.1. Evaluation Setup

#### 2.2. Evaluation of Community Structure from Community Detection Methods

#### 2.3. Evaluation of Community Selection Strategies for Decoy Selection

#### 2.3.1. Rank-Based Comparison of Selection Strategies

#### 2.3.2. Impact of Graph Density on Purity

#### 2.4. Entropy-Based Evaluation of Identified Communities

**C**. A low entropy value indicates that the near-native decoys are distributed among fewer communities, whereas a high entropy value indicates that the near-native decoys are spread over all or most of the communities. In the extreme, if all of the near-native decoys are present in a single community, then the value of entropy is 0 (the minimum possible value). A uniform distribution of near-native decoys over all communities yields the maximum value for entropy.

## 3. Discussion

## 4. Materials and Methods

#### 4.1. A Graph-Based Embedding of Decoys

#### 4.2. Community Detection Methods

**Edge betweenness (Girvan-Newman):**This approach was introduced to sidestep the drawbacks of hierarchical clustering. It operates based on the intuition that edges linking the communities are anticipated to possess high edge betweenness, which generalizes Freeman’s betweenness centrality [30] from vertices to edges. To reveal the underlying community structure of the network, the Girvan-Newman method successively removes edges with high edge betweenness. Measuring edge betweenness takes $O\left(\right|E|\xb7|V\left|\right)$ time. Since this step has to be carried out repeatedly (for each edge), the entire approach runs in $O\left(\right|E{|}^{2}\xb7\left|V\right)$ time.

**Leading Eigenvector (LE):**The prime objective of this method is modularity maximization (in terms of the eigenspectrum of modularity matrix) across possible subdivisions of a network [31]. With repeated divisions, the method discovers a leading eigenvector that partitions the graph into two subgroups; the goal of maximal improvement of modularity is achieved at every step. This process terminates when modification of modularity in the sub-network starts being negative. In fact, the method is associated with additional outcomes: a spectral measure of bipartite architecture in the network and a centrality measure to detect the vertices holding nuclear positions in communities. In general, the partitioning step takes $O\left(\right|V\left|\right(\left|E\right|+\left|V\right|\left)\right)$ time.

**Walktrap (WT):**This method employs random walks to take into account the architectural resemblance between vertices (or groups of vertices). The underlying intuition is that vertices that are within the same community are supposed to have shorter distance for random walks [32]. The methods administers an agglomerative approach which starts from $\left|V\right|$ communities (reduced to singleton clusters) and hierarchically merges two adjacent communities at each step. This is an effective approach to handle dense subgraphs of sparse graphs, which is most often the case for real-world complex networks. The method runs in time $O\left(\right|E\left|\right|V{|}^{2})$ and space $O\left(\right|V{|}^{2})$ in the worst case.

**Label Propagation (LP):**This method is based on the intuition that each vertex in the network is supposed to follow the majority of its neighbors while joining a community [33]. The method aims robust use of the network infrastructure instead of a predefined objective function (to optimize) or a-priori information on the communities. At the beginning, a unique label is assigned to each vertex; that is, the method initializes $\left|V\right|$ singleton communities. In progressive steps, adoption of a label comes into play for each vertex depending on the label possessed by the majority of its neighbors at that instant. This iterative process effectively performs the task of label propagation through the network and helps to form a consensus on a unique label for densely connected vertices. The process halts when each vertex and most its neighbors have an identical label. The algorithm takes linear time in the number of edges ($O\left(\right|E\left|\right)$).

**Louvain (Lo):**This is a heuristic-based method focusing on modularity optimization [34]. The method consists of iterative repetition of two stages. The first stage deals with the initial partition, where each vertex is assigned to a unique community (singleton communities). Modularity gain is measured by assigning a vertex to a neighbor community so as to exclusively search for the way to maximize positive gain. The order in which vertices are explored does not affect modularity but may increase computation time. The second stage commences with the construction of a new weighted network, whose vertices are the communities generated by the first phase. The above process continues until maximum modularity is achieved.

**InfoMap (IM):**This method identifies communities by using random walks along with information flow analysis [35]. The vertices and their connections are decomposed into modules to represent the network in such a way that maximizes the amount of information in the actual network. The method tries to assign codewords to vertices; the process is efficient in terms of the dynamics on the network. A signal is transmitted to a decoder (via a limited capacity channel) who tries to decode the message, as well as to form viable candidates for the actual network. The lower the number of candidates, the more information about the actual network has been transmitted. The method runs in $O\left(\right|E\left|\right)$ time.

**Greedy Modularity Maximization (GMM):**This is a hierarchical agglomeration method that makes use of a greedy optimization approach. The underlying assumption is that high values of modularity are associated with good communities [36]. Initially, each vertex itself forms a community. Then, vertices of two communities are combined together in a way that yields maximum modularity gain. This step is repeated $\left(\right|V|-1)$ times. The process is represented as a hierarchical tree-like structure (a dendrogram), whose end-nodes represent the vertices of the actual network, and the internal vertices correspond to the connections; that is, the dendrogram shows a hierarchical decomposition (level-wise) of the network into communities. The method runs in $O\left(\right|E|dlog|V\left|\right)$ time, where d is the depth of the dendrogram representing the network’s community architecture.

#### 4.3. Metrics for Evaluating Community Detection Methods

**Fraction Over Median Degree (fomd):**Let the degree of u for each vertex $u\in S$ be denoted by $d\left(u\right)$, and let ${d}_{m}$ be the median across the degrees $d\left(u\right)$. Then, fomd is determined as the fraction of vertices in S with an internal degree greater than ${d}_{m}$; that is, $f\left(S\right)=\frac{\mid \{u:u\in S,\mid \{(u,v):v\in S\}\mid >{d}_{m}\}\mid}{{n}_{S}}$. The denser and more cohesive the communities, the higher the associated fomd scores.

**Max odf (out degree fraction):**Max odf evaluates the maximum ratio of edges of a vertex in community S which point outward from S. That is, $f\left(S\right)={max}_{u\in S}\frac{\mid \left\{\right(u,v)\in E:v\notin S\}\mid}{d\left(u\right)}$. According to Max odf, a community is characterized as a set of vertices that connect to more vertices within the set than to vertices outside of it. As a result, better communities are associated with lower Max odf scores.

**Triangle(Triad) Participation Ratio (tpr):**Let ${T}_{c}$ denotes the number of vertices which form a triangle in S. The tpr metric measures the ratio of vertices belonging to a triangle and can be formulated as: $f\left(S\right)=\frac{\mid \{u:u\in S,\{(v,w):v,w\in S,(u,v)\in E,(u,w)\in E,(v,w)\in E\}\ne \varnothing \}\mid}{{n}_{S}}$. Better community clustering yields higher tpr scores.

**Internal Edge Density:**For a set S, let us denote the maximum number of possible edges by ${m}_{Smax}={n}_{S}({n}_{S}-1)/2$. The internal edge density is the ratio of the edges that are actually in S, denoted by ${m}_{S}$, over ${m}_{Smax}$; that is, $f\left(S\right)=\frac{{m}_{S}}{{n}_{S}({n}_{S}-1)/2}$. This metric represents the internal connectivity of a cluster (community) and a higher score indicates that there are more connections within the vertices of that community.

**Average Internal Degree:**This metric determines the average internal degree of the members of set S and can be formulated as: $f\left(S\right)=\frac{2{m}_{S}}{{n}_{S}}$. The denser a community, the higher its average internal degree score.

**Cut Ratio:**Let ${C}_{S}$ denotes the edges that are going outward from a set S. The cut ratio score measures the ratio of ${C}_{S}$ over all possible edges and is defined as: $f\left(S\right)=\frac{{C}_{S}}{{n}_{S}(n-{n}_{S})}$. Better communities are associated with lower scores.

**Expansion:**This metric calculates the number of edges (for each vertex) going out of a set S and can be formulated as: $f\left(S\right)=\frac{{C}_{S}}{{n}_{S}}$. Lower scores correspond to better communities.

**Edges Inside:**This metric measures the internal connectivity of a set S as $f\left(S\right)={m}_{S}$. Better communities are related with higher scores.

**Conductance:**This metric is based on the combination of internal and external connectivity and is measured as: $f\left(S\right)=\frac{{C}_{S}}{(2{m}_{S}+{C}_{S})}$. Lower scores relate with well-separated communities.

**Normalized Cut:**This metric is defined as: $f\left(S\right)=\frac{{C}_{S}}{(2{m}_{S}+{C}_{S})}+\frac{{C}_{S}}{2(m-{m}_{S})+{C}_{S}}$. The metric has the special property that concurrently meets the two following objectives: maximization of dissimilarity across communities and minimization of overall similarity (eschewing the unnatural bias for breaking up small sets). Lower values of normalized cut maintain balance between these two objectives.

**Coverage:**This metric measures the ratio of the number of intra-community edges to the number of edges in the graph and is defined as: $f\left(S\right)=\frac{\omega \left(C\right)}{\omega \left(G\right)}$. Here, $\omega \left(C\right)={\sum}_{i=1}^{k}\omega \left(E({v}_{x},{v}_{y})\right);{v}_{x},{v}_{y}\in {C}_{i}$. Higher coverage values indicate that there are more connections within communities rather than edges linking various communities. In fact, the ideal scenario is that communities are completely separated from one another, which would correspond to a coverage of 1 (the maximum possible value).

**Average odf**This metric provides the average ratio of edges that point outward of S over vertices in S and is defined as: $f\left(S\right)=\frac{1}{{n}_{S}}{\sum}_{u\in S}^{}\frac{\mid \left\{\right(u,v)\in E:v\notin S\}\mid}{d\left(u\right)}$. Lower values of average odf relate with better communities.

**Modularity:**This metric is based on the network model and determines the difference between the number of edges within S and the anticipated number of such edges in a random graph of exactly the same degree sequence. Modularity can be defined as: $f\left(S\right)=\frac{1}{4}({m}_{S}-E\left({m}_{S}\right))$. Higher values of modularity correspond to denser connections within a community than anticipated at random.

**Flake odf:**This metric combines internal and external connectivity and determines the fraction of the number of vertices with fewer connections within the community than with the outside. Flake odf is defined as: $f\left(S\right)=\frac{\mid \{u:u\in S,\mid \{(u,v)\in E:v\in S\}\mid <d(u)/2\}\mid}{{n}_{S}}$. Better communities are associated with higher values.

**Separability:**This is a community-goodness metric [37] based on the intuition that good communities are well-separated (have relatively few edges from set S to the rest of the network). Separability finds the ratio between edges pointing in and outside of the set S and is defined as: $f\left(S\right)=\frac{{m}_{S}}{{C}_{S}}$. Higher value indicate better communities.

#### 4.4. Community Selection for Decoy Selection

#### 4.5. Evaluating Selected Communities

#### 4.6. Implementation Details

## Supplementary Materials

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Abbreviations

GMM | Greedy Modularity Maximization |

IM | InfoMap |

LE | Leading Eigenvector |

Lo | Louvain |

LP | Label Propagation |

PDB | Protein Data Bank |

CASP | Critical Assessment of protein Structure Prediction |

lRMSD | least root-mean-squared-deviation |

ML | Machine Learning |

PC | Pareto Count |

PR | Pareto Rank |

SVM | Support Vector Machines |

WT | Walktrap |

## References

- Boehr, D.D.; Wright, P.E. How do proteins interact? Science
**2008**, 320, 1429–1430. [Google Scholar] [CrossRef] [PubMed] - Boehr, D.D.; Nussinov, R.; Wright, P.E. The role of dynamic conformational ensembles in biomolecular recognition. Nat. Chem. Biol.
**2009**, 5, 789–796. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Maximova, T.; Moffatt, R.; Ma, B.; Nussinov, R.; Shehu, A. Principles and overview of sampling methods for modeling macromolecular structure and dynamics. PLoS Comp. Biol.
**2016**, 12, e1004619. [Google Scholar] [CrossRef] [PubMed] - Leaver-Fay, A.; Tyka, M.; Lewis, S.M.; Lange, O.F.; Thompson, J.; Jacak, R.; Kaufman, K.W.; Renfrew, P.D.; Smith, C.A.; Sheffler, W.; et al. ROSETTA3: An object-oriented software suite for the simulation and design of macromolecules. Methods Enzymol.
**2011**, 487, 545–574. [Google Scholar] [PubMed] - Xu, D.; Zhang, Y. Ab initio protein structure assembly using continuous structure fragments and optimized knowledge-based force field. Proteins Struct. Funct. Bioinform.
**2012**, 80, 1715–1735. [Google Scholar] [CrossRef] [PubMed] - Olson, B.; Shehu, A. Multi-objective stochastic search for sampling local minima in the protein energy surface. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedical Informatics (BCB), Washington, DC, USA, 22–25 September 2013; pp. 430–439. [Google Scholar]
- Clausen, R.; Shehu, A. A multiscale hybrid evolutionary algorithm to obtain sample-based representations of multi-basin protein energy landscapes. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedical Informatics (BCB), Newport Beach, CA, USA, 20–23 September 2014; pp. 269–278. [Google Scholar]
- Kryshtafovych, A.; Fidelis, K.; Tramontano, A. Evaluation of model quality predictions in CASP9. Proteins
**2011**, 79, 91–106. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Kryshtafovych, A.; Barbato, A.; Fidelis, K.; Monastyrskyy, B.; Schwede, T.; Tramontano, A. Assessment of the assessment: Evaluation of the model quality estimates in CASP10. Proteins
**2014**, 82, 112–126. [Google Scholar] [CrossRef] [PubMed] - Hassan, L.; Rajabi, Z.; Akhter, N.; Shehu, A. Community detection for decoy selection in template-free protein structure prediction. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Washington, DC, USA, 29 August–1 September 2018; pp. 621–625. [Google Scholar]
- Moult, J.; Fidelis, K.; Kryshtafovych, A.; Schwede, T.; Tramontano, A. Critical assessment of methods of protein structure prediction (CASP)—ound X. Proteins Struct. Funct. Bioinform.
**2014**, 82, 109–115. [Google Scholar] [CrossRef] [PubMed] - Uziela, K.; Wallner, B. ProQ2: estimation of model accuracy implemented in Rosetta. Bioinformatics
**2016**, 32, 1411–1413. [Google Scholar] [CrossRef] [PubMed] - Liu, T.; Wang, Y.; Eickholt, J.; Wang, Z. Benchmarking deep networks for predicting residue-specific quality of individual protein models in CASP11. Sci. Rep.
**2016**, 6, 19301. [Google Scholar] [CrossRef] [PubMed] - Felts, A.K.; Gallicchio, E.; Wallqvist, A.; Levy, R.M. Distinguishing native conformations of proteins from decoys with an effective free energy estimator based on the opls all-atom force field and the surface generalized Born solvent model. Proteins Struct. Funct. Bioinform.
**2002**, 48, 404–422. [Google Scholar] [CrossRef] [PubMed] - Ben-Naim, A. Statistical potentials extracted from protein structures: are these meaningful potentials? J. Chem. Phys.
**1997**, 107, 3698–3706. [Google Scholar] [CrossRef] - Lorenzen, S.; Zhang, Y. Identification of near-native structures by clustering protein docking conformations. Proteins Struct. Funct. Bioinform.
**2007**, 68, 187–194. [Google Scholar] [CrossRef] [PubMed] - Zhang, Y.; Skolnick, J. SPICKER: A clustering approach to identify near-native protein folds. J. Comput. Chem.
**2004**, 25, 865–871. [Google Scholar] [CrossRef] [PubMed] - Jing, X.; Wang, K.; Lu, R.; Dong, Q. Sorting protein decoys by machine-learning-to-rank. Sci. Rep.
**2016**, 6, 31571. [Google Scholar] [CrossRef] [PubMed] [Green Version] - He, Z.; Alazmi, M.; Zhang, J.; Xu, D. Protein structural model selection by combining consensus and single scoring methods. PLoS ONE
**2013**, 8, e74006. [Google Scholar] [CrossRef] [PubMed] - Pawlowski, M.; Kozlowski, L.; Kloczkowski, A. MQAPsingle: A quasi single-model approach for estimation of the quality of individual protein structure models. Proteins Struct. Funct. Bioinform.
**2016**, 84, 1021–1028. [Google Scholar] [CrossRef] [PubMed] - Cao, R.; Wang, Z.; Wang, Y.; Cheng, J. SMOQ: A tool for predicting the absolute residue-specific quality of a single protein model with support vector machines. BMC Bioinform.
**2014**, 15, 120. [Google Scholar] [CrossRef] [PubMed] - Nguyen, S.P.; Shang, Y.; Xu, D. DL-PRO: A novel deep learning method for protein model quality assessment. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Beijing, China, 6–11 July 2014; pp. 2071–2078. [Google Scholar]
- Manavalan, B.; Lee, J.; Lee, J. Random forest-based protein model quality assessment (RFMQA) using structural features and potential energy terms. PLoS ONE
**2014**, 9, e106542. [Google Scholar] [CrossRef] [PubMed] - Mirzaei, S.; Sidi, T.; Keasar, C.; Crivelli, S. Purely structural protein scoring functions using support vector machine and ensemble learning. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2016**. [Google Scholar] [CrossRef] [PubMed] - Berman, H.M.; Henrick, K.; Nakamura, H. Announcing the worldwide Protein Data Bank. Nat. Struct. Biol.
**2003**, 10, 980. [Google Scholar] [CrossRef] [PubMed] - McLachlan, A.D. A mathematical procedure for superimposing atomic coordinates of proteins. Acta Crystallogr. A
**1972**, 26, 656–657. [Google Scholar] [CrossRef] - Akhter, N.; Shehu, A. From extraction of local structures of protein energy landscapes to improved decoy selection in template-free protein structure prediction. Molecules
**2018**, 23, 216. [Google Scholar] [CrossRef] [PubMed] - Fisher, R.A. On the interpretation of χ
^{2}from contingency tables, and the calculation of P. J. R. Stat. Soc.**1922**, 85, 87–94. [Google Scholar] [CrossRef] - Barnard, G.A. A new test of 2 × 2 tables. Nature
**1945**, 156, 177. [Google Scholar] [CrossRef] - Girvan, M.; Newman, M.E.J. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA
**2002**, 99, 7821–7826. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Newman, M.E. Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E
**2006**, 74, 036104. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Pons, P.; Latapy, M. Computing communities in large networks using random walks. In International Symposium on Computer and Information Sciences; Springer: Berlin, Germany, 2005; pp. 284–293. [Google Scholar]
- Raghavan, U.N.; Albert, R.; Kumara, S. Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E
**2007**, 76, 036106. [Google Scholar] [CrossRef] [PubMed] - Blondel, V.D.; Guillaume, J.L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech.
**2008**, 2008, P10008. [Google Scholar] [CrossRef] - Rosvall, M.; Axelsson, D.; Bergstrom, C.T. The map equation. Eur. Phys. J. Spec. Top.
**2009**, 178, 13–23. [Google Scholar] [CrossRef] [Green Version] - Clauset, A.; Newman, M.E.J.; Moore, C. Finding community structure in very large networks. Phys. Rev. E
**2004**, 70, 066111. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Yang, J.; Leskovec, J. Defining and evaluating network communities based on ground-truth. In Proceedings of the 2012 IEEE 12th International Conference on Data Mining (ICDM), Brussels, Belgium, 10–13 December 2012; pp. 745–754. [Google Scholar]

Sample Availability: Not available. |

**Figure 1.**Comparison of community detection methods (encoded by different colors) on directed nngraphs embedding each of the 10 decoy datasets along (

**a**) modularity, (

**b**) flake odf, (

**c**) conductance, and (

**d**) separability.

**Figure 2.**Comparison of the various selection strategies on the purity of the top community $C1$ selected over communities detected with the Louvain method on directed nngraph embeddings of decoy data in (

**a**), the Louvain method on undirected nngraph embeddings of decoy data in (

**b**), and the GMM method on undirected nngraph embeddings of decoy data in (

**c**).

**Figure 3.**Comparison of community detection methods based on the quality of the top community selected by Sel-S and Sel-S+E. In the legend, Lo-D refers to the Louvain method applied to directed nngraphs that embed the decoy datasets. The -S and -S+E refer to the Sel-S and Sel-S+E selection strategies.

**Figure 4.**Comparison of community detection methods based on the quality of the top three communities selected by Sel-S and Sel-S+E. In the legend, Lo-D refers to the Louvain method applied to directed nngraphs that embed the decoy datasets. The -S and -S+E refer to the Sel-S and Sel-S+E selection strategies.

**Table 1.**Column 2 shows the PDB ID of a known native structure for each test case. Columns 3 and 4 show the fold (* indicates native structures with a predominant $\beta $ fold and a short helix) and the length (number of amino acids), respectively. Column 5 shows the size of the decoy set $\mathsf{\Omega}$ generated via Rosetta, and column 6 shows the lowest lRMSD from the known native structure over the decoy ensemble.

PDB ID | Fold | Length (# aas) | $\left|\mathbf{\Omega}\right|$ | min_dist (Å) | ||
---|---|---|---|---|---|---|

Easy | $1.$ | 1dtdb | $\alpha +\beta $ | 61 | $57,839$ | $0.51$ |

$2.$ | 1tig | $\alpha +\beta $ | 88 | $52,099$ | $0.60$ | |

$3.$ | 1dtja | $\alpha +\beta $ | 74 | $53,526$ | $0.68$ | |

Medium | $4.$ | 1hz6a | $\alpha +\beta $ | 64 | $57,474$ | $0.72$ |

$5.$ | 1c8ca | ${\beta}^{\ast}$ | 64 | $53,322$ | $1.08$ | |

$6.$ | 1bq9 | $\beta $ | 53 | $53,663$ | $1.30$ | |

$7.$ | 1sap | $\beta $ | 66 | $51,209$ | $1.75$ | |

Hard | $8.$ | 2ezk | $\alpha $ | 93 | $50,192$ | $2.56$ |

$9.$ | 1aoy | $\alpha $ | 78 | $52,218$ | $3.26$ | |

$10.$ | 1isua | $coil$ | 62 | $60,360$ | $5.53$ |

**Table 2.**The s, n, p of the sets of communities selected by different selection strategies over communities identified by the

**Louvain**algorithm on decoy datasets embedded as directed ngraphs. We refer to this setting as

**Louvain${}_{Directed}$**. Recall that s stands for size (number of decoys), and n and p are the two performance metrics described above (and in more detail in Section 4).

Louvain${}_{\mathit{Directed}}$ | ||||
---|---|---|---|---|

Sel-S | Sel-S+E | Sel-PR | Sel-PR+PC | |

s, n, p (%) | s, n, p (%) | s, n, p (%) | s, n, p (%) | |

1dtdb | C${}_{1}$: 5.3, 23.4, 100 | C${}_{1}$: 5.3, 23.4, 100 | C${}_{1}$: 5.3, 23.4, 100 | C${}_{1}$: 0.005, 0, 0 |

C${}_{1-2}$: 10.3, 45.2, 100 | C${}_{1-2}$: 5.5, 24.3, 100 | C${}_{1-2}$: 5.3 24.3, 99.9 | C${}_{1-2}$: 0.009, 0, 0 | |

C${}_{1-3}$: 15, 65.7, 100 | C${}_{1-3}$: 10.2, 44.7, 100 | C${}_{1-3}$: 5.4, 23.4, 99.7 | C${}_{1-3}$: 5.3, 23.4, 99.8 | |

1tig | C${}_{1}$: 10.4, 10.1, 14.6 | C${}_{1}$: 10.4, 10.1, 14.6 | C${}_{1}$: 10.4, 10.1, 14.6 | C${}_{1}$: 10.4, 10.1, 14.6 |

C${}_{1-2}$: 18.7, 18.4, 14.8 | C${}_{1-2}$: 15.9, 15.6, 14.7 | C${}_{1-2}$: 10.4, 10.1, 14.6 | C${}_{1-2}$: 10.4, 10.1, 14.6 | |

C${}_{1-3}$: 24.2, 23.9, 14.9 | C${}_{1-3}$: 17.7, 17.3, 14.8 | C${}_{1-3}$: 10.4, 10.1, 14.6 | C${}_{1-3}$: 10.4, 10.1, 14.6 | |

1dtja | C${}_{1}$: 3.6, 16, 100 | C${}_{1}$: 0.9, 3.9, 100 | C${}_{1}$: 3.6, 16, 100 | C${}_{1}$: 0.004, 0, 0 |

C${}_{1-2}$: 6.6, 29.6, 100 | C${}_{1-2}$: 3.6, 16, 100 | C${}_{1-2}$: 6.3 28.1, 100 | C${}_{1-2}$: 0.007, 0, 0 | |

C${}_{1-3}$: 9.4, 41.7, 100 | C${}_{1-3}$: 5.2, 23, 100 | C${}_{1-3}$: 7.2, 32, 100 | C${}_{1-3}$: 0.01, 0.02, 33.3 | |

1hz6a | C${}_{1}$: 6.4, 6.7, 11.7 | C${}_{1}$: 3.9, 4.3, 12.6 | C${}_{1}$: 6.4, 6.7, 11.7 | C${}_{1}$: 0.02, 0.02, 9.1 |

C${}_{1-2}$: 12, 12.4, 11.7 | C${}_{1-2}$: 9.4, 8.9, 10.7 | C${}_{1-2}$: 11.9, 11.2, 10.6 | C${}_{1-2}$: 0.02, 0.02, 7.7 | |

C${}_{1-3}$: 17.5, 17, 11 | C${}_{1-3}$: 13.9, 13.1, 10.6 | C${}_{1-3}$: 15.8, 15.6, 11.1 | C${}_{1-3}$: 0.03, 0.02, 6.7 | |

1c8ca | C${}_{1}$: 4.6, 0.1, 0.1 | C${}_{1}$: 2.9, 8.3, 31.7 | C${}_{1}$: 4.6, 0.1, 0.1 | C${}_{1}$: 0.006, 0, 0 |

C${}_{1-2}$: 8.8, 6, 7.4 | C${}_{1-2}$: 6, 32.8, 59.8 | C${}_{1-2}$: 8.8, 6, 7.4 | C${}_{1-2}$: 2.9, 8.3, 31.6 | |

C${}_{1-3}$: 12.5, 13.3, 11.6 | C${}_{1-3}$: 7.5, 33.6, 48.8 | C${}_{1-3}$: 11.9, 30.4, 27.8 | C${}_{1-3}$: 6, 32.8, 59.7 | |

1sap | C${}_{1}$: 11.4, 0, 0 | C${}_{1}$: 8.5, 59.9, 16.3 | C${}_{1}$: 11.4, 0, 0 | C${}_{1}$: 8.5, 60, 16.3 |

C${}_{1-2}$: 21.7, 0, 0 | C${}_{1-2}$: 13.5, 62.5, 10.7 | C${}_{1-2}$: 21.7, 0, 0 | C${}_{1-2}$: 18.7, 60, 7.4 | |

C${}_{1-3}$: 30.6, 0.1, 0.01 | C${}_{1-3}$: 23.8, 62.5, 6.1 | C${}_{1-3}$: 30.2, 59.9, 4.6 | C${}_{1-3}$: 18.7, 60, 7.4 | |

1bq9 | C${}_{1}$: 11.3, 9.9, 1.4 | C${}_{1}$: 10.3, 11, 1.7 | C${}_{1}$: 11.3, 9.9, 1.4 | C${}_{1}$: 0.01, 0, 0 |

C${}_{1-2}$: 21.6, 20.9, 1.5 | C${}_{1-2}$: 21.6, 20.9, 1.5 | C${}_{1-2}$: 21.6, 20.9, 1.5 | C${}_{1-2}$: 0.02, 0, 0 | |

C${}_{1-3}$: 25.7, 25.2, 1.6 | C${}_{1-3}$: 25.7, 25.5, 1.6 | C${}_{1-3}$: 21.6, 20.9, 1.5 | C${}_{1-3}$: 10.3, 11, 1.7 | |

2ezk | C${}_{1}$: 30.1, 50.7, 22 | C${}_{1}$: 30.1, 50.7, 22 | C${}_{1}$: 30.1, 50.7, 22 | C${}_{1}$: 30.1, 50.7, 22 |

C${}_{1-2}$: 56.5, 56.2, 13 | C${}_{1-2}$: 56.5, 56.2, 13 | C${}_{1-2}$: 30.1, 50.7, 22 | C${}_{1-2}$: 30.1, 50.7, 22 | |

C${}_{1-3}$: 73, 94.5, 16.9 | C${}_{1-3}$: 60.8, 56.6, 12.1 | C${}_{1-3}$: 30.1, 50.7, 22 | C${}_{1-3}$: 30.1, 50.7, 22 | |

1aoy | C${}_{1}$: 38, 22.6, 6.5 | C${}_{1}$: 38, 22.6, 6.5 | C${}_{1}$: 38, 22.6, 6.5 | C${}_{1}$: 38, 22.6, 6.5 |

C${}_{1-2}$: 54.9, 29.7, 5.9 | C${}_{1-2}$: 52.5, 90.7, 18.9 | C${}_{1-2}$: 38, 22.6, 6.5 | C${}_{1-2}$: 38, 22.6, 6.5 | |

C${}_{1-3}$: 69.4, 97.9, 15.4 | C${}_{1-3}$: 59.5, 92.7, 17 | C${}_{1-3}$: 38, 22.6, 6.5 | C${}_{1-3}$: 38, 22.6, 6.5 | |

1isua | C${}_{1}$: 15.6, 57.3, 19.5 | C${}_{1}$: 5, 0.7, 0.8 | C${}_{1}$: 15.6, 57.3, 19.5 | C${}_{1}$: 0.003, 0, 0 |

C${}_{1-2}$: 26.3, 70.2, 14.2 | C${}_{1-2}$: 9.9, 2, 1.1 | C${}_{1-2}$: 22.5, 62.3, 14.7 | C${}_{1-2}$: 0.007, 0, 0 | |

C${}_{1-3}$: 33.2, 75.2, 12 | C${}_{1-3}$: 15.9, 3.6, 1.2 | C${}_{1-3}$: 28.5, 63.9, 11.9 | C${}_{1-3}$: 0.01, 0, 0 |

**Table 3.**The s, n, p of the communities selected by different selection strategies over communities identified by the

**Louvain**algorithm on decoy datasets embedded as undirected ngraphs.

Louvain | ||||
---|---|---|---|---|

Sel-S | Sel-S+E | Sel-PR | Sel-PR+PC | |

s, n, p (%) | s, n, p (%) | s, n, p (%) | s, n, p (%) | |

1dtdb | C${}_{1}$: 4.7, 20.4, 100 | C${}_{1}$: 2.5, 10.8, 100 | C${}_{1}$: 3.1, 13.7, 100 | C${}_{1}$: 0.01, 0, 0 |

C${}_{1-2}$: 8.2, 35.9, 99.9 | C${}_{1-2}$: 5.1, 22.2, 100 | C${}_{1-2}$: 7.8, 34.1, 100 | C${}_{1-2}$: 0.01, 0, 0 | |

C${}_{1-3}$: 11.3, 49.5, 99.9 | C${}_{1-3}$: 7.2, 31.5, 100 | C${}_{1-3}$: 10.2, 44.9, 100 | C${}_{1-3}$: 2.5, 10.8, 99.6 | |

1tig | C${}_{1}$: 12.6, 12.3, 14.7 | C${}_{1}$: 12.6, 12.3, 14.7 | C${}_{1}$: 12.6, 12.3, 14.7 | C${}_{1}$: 12.6, 12.3, 14.7 |

C${}_{1-2}$: 23, 23, 15 | C${}_{1-2}$: 16.4, 16, 14.7 | C${}_{1-2}$: 12.6, 12.3, 14.7 | C${}_{1-2}$: 12.6, 12.4, 14.8 | |

C${}_{1-3}$: 28, 28.1, 15.1 | C${}_{1-3}$: 18, 17.7, 14.8 | C${}_{1-3}$: 12.6, 12.3, 14.7 | C${}_{1-3}$: 12.6, 12.4, 14.8 | |

1dtja | C${}_{1}$: 3.2, 14.2, 100 | C${}_{1}$: 1.4, 6.2, 100 | C${}_{1}$: 3.2, 14.2, 100 | C${}_{1}$: 0.004, 0, 0 |

C${}_{1-2}$: 6.3, 28, 100 | C${}_{1-2}$: 3.1, 13.6, 100 | C${}_{1-2}$: 6.3, 28, 100 | C${}_{1-2}$: 0.007, 0, 0 | |

C${}_{1-3}$: 9, 40.1, 100 | C${}_{1-3}$: 5.8, 25.7, 100 | C${}_{1-3}$: 9, 40.1, 100 | C${}_{1-3}$: 0.01, 0.02, 33.3 | |

1hz6a | C${}_{1}$: 7, 7.1, 11.4 | C${}_{1}$: 5.8, 5.9, 11.5 | C${}_{1}$: 6, 6, 11.3 | C${}_{1}$: 0.02, 0.02, 9.1 |

C${}_{1-2}$: 13, 13.1, 11.3 | C${}_{1-2}$: 11.1, 10.7, 10.9 | C${}_{1-2}$: 11.8, 11.9, 11.4 | C${}_{1-2}$: 0.02, 0.02, 7.7 | |

C${}_{1-3}$: 18.8, 19, 11.4 | C${}_{1-3}$: 17.1, 16.6, 11 | C${}_{1-3}$: 18.8, 19, 11.4 | C${}_{1-3}$: 0.03, 0.02, 6.7 | |

1c8ca | C${}_{1}$: 4.7, 0.1, 0.2 | C${}_{1}$: 2.7, 8.5, 34.8 | C${}_{1}$: 2.7, 8.5, 34.8 | C${}_{1}$: 0.006, 0, 0 |

C${}_{1-2}$: 8.6, 32.4, 41.1 | C${}_{1-2}$: 6.6, 40.8, 67.5 | C${}_{1-2}$: 7.3, 8.6, 12.8 | C${}_{1-2}$: 2.7, 8.5, 34.7 | |

C${}_{1-3}$: 12.2, 42.8, 38.2 | C${}_{1-3}$: 9.9, 47.3, 51.8 | C${}_{1-3}$: 11.2, 40.9, 39.6 | C${}_{1-3}$: 6.6, 40.8, 67.4 | |

1sap | C${}_{1}$: 10.1, 0, 0 | C${}_{1}$: 8.8, 59.6, 15.6 | C${}_{1}$: 9.9, 0, 0 | C${}_{1}$: 8.8, 59.6, 15.6 |

C${}_{1-2}$: 20, 0, 0 | C${}_{1-2}$: 14.2, 64.1, 10.4 | C${}_{1-2}$: 20, 0, 0 | C${}_{1-2}$: 18.7, 59.6, 7.3 | |

C${}_{1-3}$: 28.8, 59.6, 4.8 | C${}_{1-3}$: 24.1, 64.1, 6.1 | C${}_{1-3}$: 28.8, 59.6, 4.8 | C${}_{1-3}$: 28.8, 59.6, 4.8 | |

1bq9 | C${}_{1}$: 11.3, 9.7, 1.4 | C${}_{1}$: 10.3, 11.2, 1.7 | C${}_{1}$: 10.3, 11.2, 1.7 | C${}_{1}$: 0.01, 0, 0 |

C${}_{1-2}$: 21.7, 20.9, 1.5 | C${}_{1-2}$: 21.7, 20.9, 1.5 | C${}_{1-2}$: 21.7, 20.9, 1.5 | C${}_{1-2}$: 0.02, 0, 0 | |

C${}_{1-3}$: 26.2, 25.6, 1.5 | C${}_{1-3}$: 26.2, 25.7, 1.6 | C${}_{1-3}$: 21.7, 20.9, 1.5 | C${}_{1-3}$: 10.4, 11.2, 1.7 | |

2ezk | C${}_{1}$: 26.9, 47.2, 22.9 | C${}_{1}$: 26.9, 47.2, 22.9 | C${}_{1}$: 26.9, 47.2, 22.9 | C${}_{1}$: 26.9, 47.2, 22.9 |

C${}_{1-2}$: 53.1, 52.3, 12.8 | C${}_{1-2}$: 53.2, 52.3, 12.8 | C${}_{1-2}$: 26.9, 47.2, 22.9 | C${}_{1-2}$: 26.9, 47.2, 22.9 | |

C${}_{1-3}$: 69.2, 81.8, 15.4 | C${}_{1-3}$: 58.1, 52.7, 11.8 | C${}_{1-3}$: 26.9, 47.2, 22.9 | C${}_{1-3}$: 26.9, 47.2, 22.9 | |

1aoy | C${}_{1}$: 36.8, 27.6, 8.2 | C${}_{1}$: 36.8, 27.6, 8.2 | C${}_{1}$: 36.8, 27.6, 8.2 | C${}_{1}$: 36.8, 27.6, 8.2 |

C${}_{1-2}$: 53.1, 33.7, 6.9 | C${}_{1-2}$: 52.4, 91.6, 19.1 | C${}_{1-2}$: 36.8, 27.6, 8.2 | C${}_{1-2}$: 36.8, 27.6, 8.2 | |

C${}_{1-3}$: 68.8, 97.7, 15.5 | C${}_{1-3}$: 65, 91.9, 15.5 | C${}_{1-3}$: 36.8, 27.6, 8.2 | C${}_{1-3}$: 36.8, 27.6, 8.2 | |

1isua | C${}_{1}$: 14.3, 56, 20.7 | C${}_{1}$: 5.5, 1.3, 1.2 | C${}_{1}$: 5.6, 1.5, 1.4 | C${}_{1}$: 0.003, 0, 0 |

C${}_{1-2}$: 23.9, 64.5, 14.3 | C${}_{1-2}$: 11.2, 2.8, 1.3 | C${}_{1-2}$: 13.5, 6.4, 2.5 | C${}_{1-2}$: 0.007, 0, 0 | |

C${}_{1-3}$: 31.8, 69.3, 11.6 | C${}_{1-3}$: 16.3, 6, 2 | C${}_{1-3}$: 27.8, 62.4, 11.9 | C${}_{1-3}$: 0.01, 0, 0 |

**Table 4.**The s, n, p of the communities selected by different strategies over communities identified by the Greedy Modularity Maximization (GMM) algorithm on decoy datasets embedded as undirected graphs.

GMM | ||||
---|---|---|---|---|

Sel-S | Sel-S+E | Sel-PR | Sel-PR+PC | |

s, n, p (%) | s, n, p (%) | s, n, p (%) | s, n, p (%) | |

1dtdb | C${}_{1}$: 11.1, 48.7, 100 | C${}_{1}$: 11.1, 48.7, 100 | C${}_{1}$: 11.1, 48.7, 100 | C${}_{1}$: 0.005, 0, 0 |

C${}_{1-2}$: 18.3, 80.1, 99.9 | C${}_{1-2}$: 14.8, 65, 100 | C${}_{1-2}$: 11.1, 48.7, 99.9 | C${}_{1-2}$: 0.009, 0, 0 | |

C${}_{1-3}$: 22, 96.4, 99.9 | C${}_{1-3}$: 15.3, 65, 96.8 | C${}_{1-3}$: 11.1, 48.7, 99.9 | C${}_{1-3}$: 0.02, 0, 0 | |

1tig | C${}_{1}$: 19.2, 18.7, 14.7 | C${}_{1}$: 19.2, 18.7, 14.7 | C${}_{1}$: 19.2, 18.7, 14.7 | C${}_{1}$: 19.2, 18.7, 14.7 |

C${}_{1-2}$: 31.6, 31.5, 15 | C${}_{1-2}$: 20.4, 20, 14.8 | C${}_{1-2}$: 19.2, 18.7, 14.7 | C${}_{1-2}$: 19.2, 18.7, 14.7 | |

C${}_{1-3}$: 43, 42.3, 14.8 | C${}_{1-3}$: 22.9, 22.3, 14.7 | C${}_{1-3}$: 19.2, 18.7, 14.7 | C${}_{1-3}$: 19.2, 18.7, 14.7 | |

1dtja | C${}_{1}$: 7.6, 33.7, 100 | C${}_{1}$: 0.03, 0.1, 100 | C${}_{1}$: 7.6, 33.7, 100 | C${}_{1}$: 0.004, 0, 0 |

C${}_{1-2}$: 14.5, 64.6, 100 | C${}_{1-2}$: 7.6, 33.9, 100 | C${}_{1-2}$: 7.6, 33.9, 100 | C${}_{1-2}$: 0.007, 0, 0 | |

C${}_{1-3}$: 17.6, 78.5, 100 | C${}_{1-3}$: 14.5, 64.7, 100 | C${}_{1-3}$: 7.6, 33.9, 100 | C${}_{1-3}$: 0.01, 0.02, 33.3 | |

1hz6a | C${}_{1}$: 24.7, 24.8, 11.3 | C${}_{1}$: 24.7, 24.8, 11.3 | C${}_{1}$: 24.7, 24.8, 11.3 | C${}_{1}$: 0.02, 0.02, 9.1 |

C${}_{1-2}$: 48.7, 48.8, 11.3 | C${}_{1-2}$: 25.7, 25.7, 11.3 | C${}_{1-2}$: 24.7, 24.8, 11.3 | C${}_{1-2}$: 0.03, 0.03, 12.5 | |

C${}_{1-3}$: 59, 59.2, 11.3 | C${}_{1-3}$: 28.3, 28.4, 11.3 | C${}_{1-3}$: 24.7, 24.8, 11.3 | C${}_{1-3}$: 0.03, 0.05, 15.8 | |

1c8ca | C${}_{1}$: 11.6, 6.9, 6.5 | C${}_{1}$: 6.4, 15.7, 26.5 | C${}_{1}$: 11.6, 6.9, 6.5 | C${}_{1}$: 0.006, 0, 0 |

C${}_{1-2}$: 23.1, 60.9, 28.7 | C${}_{1-2}$: 17.9, 69.7, 42.3 | C${}_{1-2}$: 23.1, 60.9, 28.7 | C${}_{1-2}$: 0.06, 0.5, 91.2 | |

C${}_{1-3}$: 29.5, 76.6, 28.3 | C${}_{1-3}$: 18.3, 69.7, 41.4 | C${}_{1-3}$: 29.5, 76.6, 28.3 | C${}_{1-3}$: 6.5, 16.2, 27.2 | |

1sap | C${}_{1}$: 26, 0, 0 | C${}_{1}$: 24.6, 99.3, 9.3 | C${}_{1}$: 26, 0, 0 | C${}_{1}$: 0.01, 0, 0 |

C${}_{1-2}$: 50.6, 99.3, 4.5 | C${}_{1-2}$: 40.5, 99.3, 5.7 | C${}_{1-2}$: 50.6, 99.3, 4.5 | C${}_{1-2}$: 24.6, 99.3, 9.3 | |

C${}_{1-3}$: 66.5, 99.3, 3.4 | C${}_{1-3}$: 66.6, 99.3, 3.4 | C${}_{1-3}$: 50.6, 99.3, 4.5 | C${}_{1-3}$: 24.6, 99.3, 9.3 | |

1bq9 | C${}_{1}$: 24.5, 23.8, 1.5 | C${}_{1}$: 24.5, 23.8, 1.5 | C${}_{1}$: 24.5, 23.8, 1.5 | C${}_{1}$: 0.01, 0, 0 |

C${}_{1-2}$: 32.6, 32.4, 1.6 | C${}_{1-2}$: 24.9, 24.1, 1.5 | C${}_{1-2}$: 24.5, 23.9, 1.6 | C${}_{1-2}$: 0.02, 0, 0 | |

C${}_{1-3}$: 38.7, 38.7, 1.6 | C${}_{1-3}$: 26.9, 26.1, 1.5 | C${}_{1-3}$: 24.5, 23.9, 1.6 | C${}_{1-3}$: 0.03, 0.1, 7.1 | |

2ezk | C${}_{1}$: 43.1, 7, 2.1 | C${}_{1}$: 22.4, 40, 23.3 | C${}_{1}$: 43.1, 7, 2.1 | C${}_{1}$: 22.4, 40, 23.3 |

C${}_{1-2}$: 76.6, 60, 10.2 | C${}_{1-2}$: 65.5, 47, 9.4 | C${}_{1-2}$: 65.5, 47, 9.4 | C${}_{1-2}$: 65.5, 47, 9.4 | |

C${}_{1-3}$: 99, 100, 13.2 | C${}_{1-3}$: 65.5, 47, 9.4 | C${}_{1-3}$: 65.5, 47, 9.4 | C${}_{1-3}$: 65.5, 47, 9.4 | |

1aoy | C${}_{1}$: 45.6, 66.8, 16 | C${}_{1}$: 35.6, 33, 10.1 | C${}_{1}$: 45.6, 66.8, 16 | C${}_{1}$: 35.6, 33, 10.1 |

C${}_{1-2}$: 81.2, 99.7, 13.1 | C${}_{1-2}$: 81.2, 99.7, 13.4 | C${}_{1-2}$: 81.2, 99.7, 13.4 | C${}_{1-2}$: 35.6, 33, 10.1 | |

C${}_{1-3}$: 96.8, 100, 11.3 | C${}_{1-3}$: 81.2, 99.7, 13.4 | C${}_{1-3}$: 81.2, 99.7, 13.4 | C${}_{1-3}$: 81.2, 99.7, 13.4 | |

1isua | C${}_{1}$: 39.6, 70, 9.4 | C${}_{1}$: 14.1, 1.9, 0.7 | C${}_{1}$: 39.6, 70, 9.4 | C${}_{1}$: 0.007, 0, 0 |

C${}_{1-2}$: 78.7, 97.7, 6.6 | C${}_{1-2}$: 14.8, 1.9, 0.7 | C${}_{1-2}$: 53.7, 71.9, 7.1 | C${}_{1-2}$: 0.01, 0, 0 | |

C${}_{1-3}$: 92.9, 99.5, 5.7 | C${}_{1-3}$: 54.4, 71.9, 7 | C${}_{1-3}$: 53.8, 71.9, 7.1 | C${}_{1-3}$: 0.02, 0, 0 |

**Table 5.**Rank (by Size(S), Size and Energy(S+E), Pareto rank(PR), Pareto rank and Pareto count (PR+PC)) of the community with the highest purity among those identified by Louvain (Lo), Louvain${}_{Directed}$ (Lo${}_{D}$) and GMM.

Rank by (Lo) | Rank by (Lo${}_{\mathit{D}}$) | Rank by (GMM) | |
---|---|---|---|

S, S+E, PR, PR+PC | S, S+E, PR, PR+PC | S, S+E, PR, PR+PC | |

1dtdb | 3, 4, 1, 9 | 1, 1, 1, 3 | 1, 1, 1, 8 |

1tig | 691, 396, 7069, 7073 | 229, 112, 2287, 2289 | 283, 44, 962, 963 |

1dtja | 71, 64, 26735, 26736 | 1, 7, 1, 12 | 1, 3, 1, 9 |

1hz6a | 647, 639, 10160, 10166 | 280, 49, 673, 670 | 337, 70, 748, 740 |

1c8ca | 818, 572, 9700, 9736 | 42, 31, 540, 542 | 15, 1, 4, 2 |

1bq9 | 1230, 267, 4816, 4836 | 1223, 268, 4810, 4827 | 1271, 269, 4826, 4853 |

1sap | 3301, 137, 538, 541 | 3298, 137, 538, 551 | 3369, 142, 566, 566 |

2ezk | 6, 5, 13, 12 | 3, 9, 14, 16 | 3, 1, 2, 1 |

1aoy | 3, 2, 12, 11 | 3, 3, 14, 13 | 1, 3, 1, 3 |

1isua | 135, 117, 1519, 1527 | 136, 117, 1520, 1525 | 194, 193, 1236, 1241 |

**Table 6.**Comparison of Size + Energy (S+E) to other selection strategies on best rank via 1-sided Fisher’s and Barnard’s tests. Top panel evaluates the null hypothesis that Sel-S+E does not provide the best rank (based on reported p-values), considering each of the other three selection strategies in turn. Similarly, the lower panel evaluates the null hypothesis that Sel-S+E does not provide a better rank with respect to another particular selection strategy, considering each in turn.

Best Rank | |||

Test | Sel–S | Sel–PR | Sel–PR+PC |

Fisher’s | 6.621 × 10${}^{-7}$ | 1.626 × 10${}^{-7}$ | 9.388 × 10${}^{-12}$ |

Barnard’s | 2.314 × 10${}^{-7}$ | 6.33 × 10${}^{-8}$ | 2.128 × 10${}^{-12}$ |

Better Rank | |||

Test | Sel–S | Sel–PR | Sel–PR+PC |

Fisher’s | 0.0001154 | 7.744 × 10${}^{-7}$ | 4.194 × 10${}^{-15}$ |

Barnard’s | 6.738 × 10${}^{-5}$ | 3.811 × 10${}^{-7}$ | 8.075 × 10${}^{-16}$ |

**Table 7.**Comparison of

**Size + Energy**to other selection strategies on best rank via

**2-sided**Fisher’s and Barnard’s tests. The tests evaluate the null hypothesis (based on reported p-values) that

**Sel-S+E**(or,

**Size+Energy**) provides similar ranking in comparison to other selection strategies.

Best Rank | |||

Test | Sel–S | Sel–PR | Sel–PR+PC |

Fisher’s | 1.324 × 10${}^{-6}$ | 3.252 × 10${}^{-7}$ | 1.878 × 10${}^{-11}$ |

Barnard’s | 4.629 × 10${}^{-7}$ | 1.266 × 10${}^{-7}$ | 4.255 × 10${}^{-12}$ |

Better Rank | |||

Test | Sel–S | Sel–PR | Sel–PR+PC |

Fisher’s | 0.000231 | 1.549 × 10${}^{-6}$ | 8.388 × 10${}^{-15}$ |

Barnard’s | 0.0001348 | 7.621 × 10${}^{-7}$ | 1.615 × 10${}^{-15}$ |

Entropy${}_{\mathit{Lo}\left(\mathit{Undirected}\right)}$ | Entropy${}_{\mathit{GMM}\left(\mathit{Undirected}\right)}$ | Entropy${}_{\mathit{Lo}\left(\mathit{Directed}\right)}$ | |
---|---|---|---|

1dtdb | 2.054332 | 1.335355 | 2.007146 |

1tig | 5.571811 | 5.19006 | 5.660448 |

1dtja | 3.679291 | 2.990441 | 3.670991 |

1hz6a | 4.2847 | 3.472866 | 4.400298 |

1c8ca | 3.323009 | 2.713008 | 3.480973 |

1bq9 | 4.920814 | 4.575263 | 4.92005 |

1sap | 0.961949 | 0.054711 | 0.914548 |

2ezk | 1.222933 | 0.888501 | 1.065169 |

1aoy | 0.905185 | 0.652051 | 0.873628 |

1isua | 1.725124 | 0.715579 | 1.680263 |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Kabir, K.L.; Hassan, L.; Rajabi, Z.; Akhter, N.; Shehu, A.
Graph-Based Community Detection for Decoy Selection in Template-Free Protein Structure Prediction. *Molecules* **2019**, *24*, 854.
https://doi.org/10.3390/molecules24050854

**AMA Style**

Kabir KL, Hassan L, Rajabi Z, Akhter N, Shehu A.
Graph-Based Community Detection for Decoy Selection in Template-Free Protein Structure Prediction. *Molecules*. 2019; 24(5):854.
https://doi.org/10.3390/molecules24050854

**Chicago/Turabian Style**

Kabir, Kazi Lutful, Liban Hassan, Zahra Rajabi, Nasrin Akhter, and Amarda Shehu.
2019. "Graph-Based Community Detection for Decoy Selection in Template-Free Protein Structure Prediction" *Molecules* 24, no. 5: 854.
https://doi.org/10.3390/molecules24050854