Selecting Near-Native Protein Structures from Predicted Decoy Sets Using Ordered Graphlet Degree Similarity

Effective prediction of protein tertiary structure from sequence is an important and challenging problem in computational structural biology. Ab initio protein structure prediction is based on amino acid sequence alone, thus, it has a wide application area. With the ab initio method, a large number of candidate protein structures called decoy set can be predicted, however, it is a difficult problem to select a good near-native structure from the predicted decoy set. In this work we propose a new method for selecting the near-native structure from the decoy set based on both contact map overlap (CMO) and graphlets. By generalizing graphlets to ordered graphs, and using a dynamic programming to select the optimal alignment with an introduced gap penalty, a GR_score is defined for calculating the similarity between the three-dimensional (3D) decoy structures. The proposed method was applied to all 54 single-domain targets in CASP11 and all 43 targets in CASP10, and ensemble clustering was used to cluster the protein decoy structures based on the computed CR_scores. The most popular centroid structure was selected as the near-native structure. The experiments showed that compared to the SPICKER method, which is used in I-TASSER, the proposed method can usually select better near-native structures in terms of the similarity between the selected structure and the true native structure.


Introduction
The human genome project was first proposed by American scientists in 1985 and officially launched in 1990 [1]. Its purpose is to determine the nucleotide sequence consisting of three billion base pairs contained in a human chromosome, thereby mapping the human genome and identifying the genes and their sequences to decipher humans. With the completion of the program, the gene sequence can be obtained by measuring the obtained map, and the sequence of the corresponding protein can also be inferred using the genetic central dogma [2]. Since the function of genes can be studied via the study of the corresponding proteins produced through gene expression, the use of bioinformatics to discover the function of a protein product of a gene becomes more and more significant. In fact, determining protein functions from genomic sequences is a central goal of bioinformatics [3]. Since the function of proteins is determined by its tertiary structure, the prediction of tertiary structure based on protein sequences is a very important problem.
It is known that the number of known protein structures increases exponentially. By the end of the decade, the PDB [4] database size will be more than 150,000 structures at the current rate. However, the newly published UniProtKB/TrEMBL [5] protein database in Jan, 2019 contains 139,694,261 sequence entries. Hence, only a very small part of them have experimentally solved structures.
(in terms of Euclidean distance) [20]. One of the typical flexible alignment methods is FATCAT (flexible structure alignment by chaining aligned fragment pairs with twists) [20]. FATCAT is an improvement based on CE. It first identifies the local AFPs and then produces an optimal combination of these AFPs using dynamic programming, where twists and gap penalty are used to allow flexible alignments.
Another type of the protein structure comparison methods is not based on the model superposition. One of the methods is Contact Area Difference (CAD) [21], which evaluates the structure similarities based on contacts. It computes the structure similarity by measuring the differences between the physical contacts of a model and a reference structure, without supposition of the two models. The local Distance Difference Test (lDDT) [22] is another superposition free score that evaluates local distance differences of all atoms in a model, including validation of stereochemical plausibility. The reference can be a single structure, or an ensemble of equivalent structures. It is computed over all pairs of atoms in the reference structure at a distance closer than a predefined threshold, and not belonging to the same residue.
There are also methods developed specially for evaluating predicted decoys using both energy functions and the structure information. The random forest-based model quality assessment (RFMQA) [23] predicts a relative score of a decoy model by using its secondary structure, solvent accessibility and knowledge-based potential energy terms. The support-vector-machine-based single-model quality assessment (SVMQA) [24] is trained to predict TM-score and GDT_TS score based on both statistical potential energy terms and structure consistency features.
In this article, a new protein structure similarity score, called the GR_score, was defined based on maximum Contact Map Overlap (CMO) [25] which is a superposition free protein structure alignment method defined by Godzik and Skolnick, and the ordered graphlet degree [26] which is a new systematic measure of a network's local structure similarity. The superposition free structure alignment methods based on contact maps may capture both the local structure similarities from contact maps and the global structure similarities using dynamic programming. Using the ordered graphlet degree can further improve the measuring of the local structure similarities by comparing the local topology structures. Thus, the proposed GR_score can help in measuring the decoy structure similarities, and in selecting the near-native models from a large number of predicted decoy models in ab initio structure prediction.

Maximum Contact Map Overlap (CMO)
A contact map is an ordered graph, CM = (V, E), where nodes V and edges E are defined as follows. Each node in V represents an amino acid of a protein. It leads to a strict total ordering of the nodes: for two different nodes u and w, either u < w if u is before w in the protein sequence or u > w otherwise. The two nodes u and w are connected by an edge (u, w) ∈ E, if and only if the Euclidean distance between the C α atoms of the corresponding amino acids is less than a given threshold . This is presented in Figure 1 [27].

Graphlets and Graphlet Degrees
Graphlets are small, connected, non-isomorphic and induced subgraphs of a larger graph G = (V, E) having n ≥ 2 nodes [27]. Some nodes are identical to each other topologically within each graphlet, which is considered to belong to the same automorphism orbit to represent that a graphlet can touch a node in V by different ways topologically. The concepts used to summarize the graphlets degree are: the graphlet degree of node n, represented by d i n , is the number of times a graphlet touches node n at orbit i. In the graph degree distribution protocol, the degree distribution is extended to 73 graph degree distributions by using all 2-5 nodes and their corresponding 73 automorphism orbits (the first of the 73 graph degree distributions is the degree distribution) [28]. The i th ordered graphlet degree of node u, represented by d i u, , is the number of times an ordered graphlet touches the node u at orbit i. To reduce the calculation times, the five 2-node and 3-node ordered graphlets have been chosen to define 14 orbits (see Figure 2) [27]. Therefore, a 14-dimensional vector (d 1 u, d 2 u , . . . , d 14 u ) could describe each node u of a contact map. For a given contact map CM = (V, E), there would be a limitation of the degree of a node by the number of residues that can fit in a sphere with radius . In fact, a linear worst time complexity could be led by using a distance threshold of 7.5 Å.
Genes 2018, 9, x FOR PEER REVIEW  4 of 13 Therefore, a 14-dimensional vector ( , , … , ) could describe each node of a contact map. For 145 a given contact map = ( , ), there would be a limitation of the degree of a node by the number 146 of residues that can fit in a sphere with radius ɛ. In fact, a linear worst time complexity could be led 147 by using a distance threshold ɛ of 7.5 Å.

157
The TM-score [19] is intended as a more accurate measure of the protein structure similarity

165
SPICKER is an iterative clustering method to identify near-native protein folds developed by

166
Zhang and Jeffery [29]. The procedure of selecting protein structure by this clustering method is as 167 follows. First, a self-adjusting cutoff between 7.5 to 12 Å is found in an iterative way to make sure 168 that the largest cluster contains less than 70% and more than 15% of total decoys. Second, another 169 iterated approach is applied to identify the cluster with the most neighbors under the cutoff

157
The TM-score [19] is intended as a more accurate measure of the protein structure similarity

165
SPICKER is an iterative clustering method to identify near-native protein folds developed by

166
Zhang and Jeffery [29]. The procedure of selecting protein structure by this clustering method is as 167 follows. First, a self-adjusting cutoff between 7.5 to 12 Å is found in an iterative way to make sure 168 that the largest cluster contains less than 70% and more than 15% of total decoys. Second, another 169 iterated approach is applied to identify the cluster with the most neighbors under the cutoff

TM-Score
The TM-score [19] is intended as a more accurate measure of the protein structure similarity than RMSD and GDT-TS. It gives the residue pairs at smaller distance higher weights than those at larger distances and normalized by the length of the target proteins, thus, it can represent the global structure similarities better than RMSD or GDT-TS measures. The TM-score is between 0 and 1, where 1 indicates a perfect match between two structures. Generally, scores below 0.2 correspond to randomly chosen unrelated proteins. The score of the structures roughly having the same fold is higher than 0.5.

SPICKER
SPICKER is an iterative clustering method to identify near-native protein folds developed by Zhang and Jeffery [29]. The procedure of selecting protein structure by this clustering method is as follows. First, a self-adjusting cutoff between 7.5 to 12 Å is found in an iterative way to make sure that the largest cluster contains less than 70% and more than 15% of total decoys. Second, another iterated approach is applied to identify the cluster with the most neighbors under the cutoff excluding the members of cluster found in the previous iterations. Finally, an averaging model, called final model, is built from all the decoy members of the cluster in the current iteration.

Ensemble Clustering
Using the ensemble clustering method as introduced in [30] can avoid local optimality. The most popular centroid structure identified in the ensemble clustering is selected as the near-native structure in the proposed method. The method includes two steps: constructing a distance matrix for the decoy set using a similarity score, and finding the most possible largest cluster centroid using an ensemble k-medoids. A confidence score as described in [30] is used to select the cluster centroid with the maximum score as the near-native structure.
Only C α atoms were used in the structure comparison in the proposed method. For two proteins A and B, u and w are the different C α atoms of the two proteins. Based on graphlet degrees, between two nodes u and w, the order graphlet degree similarity is defined as follows [27]: the range of the similarity score is from 0 to 1. The two nodes having similar local topologies will have a high similarity score.
The alignment between two structures having, respectively, n 1 and n 2 nodes was computed using the Needleman-Wunsch dynamic programming algorithm [31] as in the original CMO, where the score of mapping two nodes is their ordered graphlet degree similarity defined in (1). It corresponds to the following dynamic programming procedure: where the gap penalty g is defined as follows: where α is a constant parameter that will be discussed in Section 3.2.
The dynamic programming algorithm introduced in the above section produces the T[u, w] matrix, where u ∈ [1, n 1 ] and w ∈ [1, n 2 ]. Thus, the final similarity score of the two proteins is defined as follows: The range of the similarity score is also from 0 to 1. The closer the value of the GR_score is to 1, the higher the similarity of the two structures; the closer the value of the GR_score is to 0, the lower the structural similarity of the two proteins.

Constructing the Distance Matrix
To get the distance matrix for the clustering method, a similarity matrix for the decoys needed to be constructed, and then we can get the distance matrix by defining distance = 1 − similarity. The distance matrix is a symmetric matrix whose diagonal elements are all 0. The element in l th row and j th column represents the dissimilarity between two decoys l and j.
2.8. Select the Near-native Structure using Ensemble Clustering K-medoids was ran m = 500 times, which was enough to ensure statistical stability, with random initialization. The times a decoy became the centroid of the largest cluster was counted. It was found that a reasonable value for parameter k used in k-medoids was five. Finally, to consider both the size and the internal similarity of a cluster in selecting the near-native structure, a confidence score as defined in [30] was used. The centroid with the maximum confidence score within the cluster centroids whose count was more than 70% of the maximum count was selected as the near-native structure, where the count was the times a decoy became the centroid of the largest cluster.

Dataset
Up to 54 decoys sets (from CASP11) [12] and 43 decoys sets (from CASP10) [13], which are single-domain targets and have experimental native structures, were downloaded from Zhang Lab website [14]. These decoy sets contain structurally non-redundant set of protein structures from the raw decoy sets. The native structure, the generated model by SPICKER used in I-TASSER [32] sever, and the best TM-score for the target in the decoy set were also downloaded from the Zhang Lab website [14] (Supplementary materials).

Parameter Selection
In the dynamic programming, to select a good parameter α, four values of α, 0.2, 0.5, 1, and 2, were compared. For each decoy set, the similarity matrix was obtained by using the proposed GR_score in Section 2.6.3 using each α value. Then, the most popular centroid structure was selected as the near-native structure by the proposed method. The near-native structures selected by the proposed method and the corresponding native structures were compared using the TM-score.
In the experiments, 54 targets from CASP11 were used. For each target, four different TM-scores were produced from four α values, and the α value that produced the highest TM-score was recorded. Finally, for each α value, the number of the targets for which the highest TM-scores were produced using the α value was counted. The numbers of the targets with the highest TM-scores for four α values are shown in Figure 3. The range of the similarity score is also from 0 to 1. The closer the value of the GR_score is to 1, the 202 higher the similarity of the two structures; the closer the value of the GR_score is to 0, the lower the 203 structural similarity of the two proteins.

205
To get the distance matrix for the clustering method, a similarity matrix for the decoys needed 206 to be constructed, and then we can get the distance matrix by defining = 1 − .

207
The distance matrix is a symmetric matrix whose diagonal elements are all 0. The element in row 208 and column represents the dissimilarity between two decoys and . confidence score as defined in [30] was used. The centroid with the maximum confidence score 215 within the cluster centroids whose count was more than 70% of the maximum count was selected as

220
Up to 54 decoys sets (from CASP11) [12] and 43 decoys sets (from CASP10) [13], which are   The number of the targets with the highest TM-scores It can be seen from Figure 3 that when α = 0.5, the selected near-native structures were more similar to the corresponding native structure, compared to the other α values. Thus, the parameter α was set to 0.5 in the proposed method.

The Experimental Results for Datasets from CASP10.
For the proposed method, the GR_score was used to calculate the similarity matrix of the 43 decoy sets from CASP10. Then, the ensemble clustering was used to select the near native structures for each target. The near-native structure selected by the proposed method and the near-native structure generated by the SPICKER method used in I-TASSER sever were compared. The TM-sore and the GR_score between the selected near-native structures and the native structure were computed. The results are shown in the scatter plots in Figure 4, in which each target protein is represented as one point. The x-axis represents the GR_score or TM-score produced by the proposed method, and the y-axis represents the scores produced by the SPICKER method for the same target. The blue diagonal line in Figure 4 represents y=x. The same score does not necessarily mean the same model.

257
The details of the comparison can also be found in Table 1  The details of the comparison can also be found in Table 1, in which the first column is the ID of the target protein, the second column and the third column are the GR_scores of the selected near-native models by the proposed method and the SPICKER method, the fourth column and the fifth column are the TM-scores of the selected near-native model by the proposed method and the SPICKER method. All the scores were computed between the selected near-native model and the corresponding native structure.
To better understand the results, the number of the targets for which each method produced the better results was counted. The results are shown in Figure 5, where the white bar represents the number of decoy sets for which our method produces better results than SPICKER, the gray bar represents the number of decoy sets for which our method produces worse results than SPICKER, and the black bar represents the number of the similar results produced by the two methods. It can be seen from the left part of Figure 5 that the proposed method selected more near-native structures with higher GR_scores, compared to the SPICKER method. However, when measuring the similarity using the TM-score, the SPICKER method produced more near-native structures with higher scores, as can be seen from the right part of Figure 5, although the difference was smaller compared to the GR_score result on the left part of Figure 5. This may be due to fact that the similarity measure used in the proposed method is GR_score, instead of the TM-score.

282
To further evaluate the proposed method, it was also applied to the 54 decoy sets from CASP11.

283
The near-native structure selected by the proposed method and the near-native structure generated

289
Detailed results with scores for all the targets are shown in Table 2.   Figure 5. The comparison of the two methods using both GR_score and TM-score for datasets from CASP10.

The Experimental Results for Datasets from CASP11.
To further evaluate the proposed method, it was also applied to the 54 decoy sets from CASP11. The near-native structure selected by the proposed method and the near-native structure generated by the SPICKER method used in I-TASSER sever were compared. The results of the GR_score are shown in the left scatter plot in Figure 6, while the results of the TM-score are shown in the left scatter plot in Figure 6.

282
To further evaluate the proposed method, it was also applied to the 54 decoy sets from CASP11.

283
The near-native structure selected by the proposed method and the near-native structure generated

289
Detailed results with scores for all the targets are shown in Table 2.  Detailed results with scores for all the targets are shown in Table 2. To clearly represent the results, the number of the targets for which each method produces the better results was counted. The results are shown in Figure 7. It can be seen from the Figure 7 that the proposed method can select better near-native structures for more targets compared to the SPICKER method, evaluated either with GR_scores or with TM-scores. structure and the near-native structure found by the proposed method and the near-native structure 302 selected by SPICKER. The red model is the native structure and the blue is the structure selected by It can be seen from Figure 8

311
In this paper, we have proposed a new similarity score, GR_score, for comparing two protein 312 structures based on both CMO and order graphlet degrees. The introduced GR_score can serve as a 313 new assessment criterion for protein structure comparison. It is shown that the proposed GR_score 314 along with the ensemble clustering can be used to select the near-native structures from the decoy 315 sets. Compared to the state-of-the-art SPICKER method, the proposed method can select more high 316 quality near-native structures if evaluated using the GR_score for datasets from both CASP10 and 317 CASP11. In future work, we will continue to improve the computation of the similarity scores 318 between protein structures, and to evaluate the similarity scores from more aspects. Taking target T0851 as an example, Figure 8 shows the superposition between the native structure and the near-native structure found by the proposed method and the near-native structure selected by SPICKER. The red model is the native structure and the blue is the structure selected by the proposed method in Figure 8a, the other blue structure is generated by SPICKER in Figure 8b. It can be seen from Figure 8 that the SPICKER model has an obvious mismatch in the right half part of the protein.

300
Taking target T0851 as an example, Figure 8 shows the superposition between the native 301 structure and the near-native structure found by the proposed method and the near-native structure 302 selected by SPICKER. The red model is the native structure and the blue is the structure selected by 303 the proposed method in Figure 8(a), the other blue structure is generated by SPICKER in Figure 8(b).

304
It can be seen from Figure

311
In this paper, we have proposed a new similarity score, GR_score, for comparing two protein 312 structures based on both CMO and order graphlet degrees. The introduced GR_score can serve as a 313 new assessment criterion for protein structure comparison. It is shown that the proposed GR_score 314 along with the ensemble clustering can be used to select the near-native structures from the decoy 315 sets. Compared to the state-of-the-art SPICKER method, the proposed method can select more high 316 quality near-native structures if evaluated using the GR_score for datasets from both CASP10 and 317 CASP11. In future work, we will continue to improve the computation of the similarity scores 318 between protein structures, and to evaluate the similarity scores from more aspects.

Conclusions
In this paper, we have proposed a new similarity score, GR_score, for comparing two protein structures based on both CMO and order graphlet degrees. The introduced GR_score can serve as a new assessment criterion for protein structure comparison. It is shown that the proposed GR_score along with the ensemble clustering can be used to select the near-native structures from the decoy sets. Compared to the state-of-the-art SPICKER method, the proposed method can select more high quality near-native structures if evaluated using the GR_score for datasets from both CASP10 and CASP11. In future work, we will continue to improve the computation of the similarity scores between protein structures, and to evaluate the similarity scores from more aspects.