A Benchmark Dataset for Evaluating Practical Performance of Model Quality Assessment of Homology Models

Protein structure prediction is an important issue in structural bioinformatics. In this process, model quality assessment (MQA), which estimates the accuracy of the predicted structure, is also practically important. Currently, the most commonly used dataset to evaluate the performance of MQA is the critical assessment of the protein structure prediction (CASP) dataset. However, the CASP dataset does not contain enough targets with high-quality models, and thus cannot sufficiently evaluate the MQA performance in practical use. Additionally, most application studies employ homology modeling because of its reliability. However, the CASP dataset includes models generated by de novo methods, which may lead to the mis-estimation of MQA performance. In this study, we created new benchmark datasets, named a homology models dataset for model quality assessment (HMDM), that contain targets with high-quality models derived using homology modeling. We then benchmarked the performance of the MQA methods using the new datasets and compared their performance to that of the classical selection based on the sequence identity of the template proteins. The results showed that model selection by the latest MQA methods using deep learning is better than selection by template sequence identity and classical statistical potentials. Using HMDM, it is possible to verify the MQA performance for high-accuracy homology models.


Introduction
Protein structure prediction, which predicts the three-dimensional structure of a protein from its amino acid sequence, is important in structural bioinformatics. Various prediction methods have been developed to date [1,2], two major types of which are homology modeling and de novo modeling. Homology modeling predicts structures using the three-dimensional structure of a template protein, of which the amino acid sequence is similar to the target sequence. In contrast, de novo modeling predicts structures without using a template. While homology modeling cannot make predictions without the presence of a template structure, de novo modeling can make predictions regardless of the presence of a template. Furthermore, homology modeling is generally accurate when a good template exists [3,4], and the computational cost is much lower than that of de novo modeling. Therefore, homology modeling is often used in practical applications, such as drug discovery [5][6][7][8].
There are many methods for predicting protein structures. Each method is capable of generating multiple predicted structures in many cases. Therefore, it is necessary to select a structure from several structures for use in subsequent applications. Additionally, the user needs to judge whether the structure model has sufficient quality. However, CASP datasets include both single-domain and multi-domain proteins, most models are for single-domain proteins. The number of multi-domain proteins is insufficient.
The CAMEO dataset has also been often used as an evaluation dataset in recent MQA studies. The CAMEO dataset has more frequent updates and a larger number of targets than the CASP dataset. In addition, CAMEO's Model Quality Estimation category contains 1280 predicted structures with global local distance difference test (lDDT) score [29] greater than 0.8 out of 6690 structures in one year (19 February 2021-12 February 2022, which means that CAMEO has more structures with higher accuracy than CASP. However, CAMEO has the problem that the number of predicted structures per target is small. The number of predicted structures per target in CAMEO is about 10, and the performance of selecting the best structure from among the structures for a single target cannot be fully evaluated. In this study, we constructed a dataset named homology models dataset for model quality assessment (HMDM) for benchmarking MQA methods in practical situations. We used a single homology modeling method for tertiary structure prediction. The protein targets were selected to include the most accurate models. We then created two datasetsone containing single-domain proteins and another containing multi-domain proteins. After constructing the datasets, we compared the performance of the existing MQA methods using our datasets and determined their performance for highly accurate homology models.

Methods
We created a single-domain dataset and a multi-domain dataset to evaluate the MQA performance for single-domain and multi-domain proteins. We designed these datasets to contain a large number of high-quality models to evaluate the MQA performance in practical scenarios. To generate the high-quality models, we used a homology modeling method to predict the structure and selected target proteins with rich template structures. Then, to ensure an unbiased distribution of model quality for each target, structures were modeled using various templates, which were sampled to create the final dataset. Once the datasets were completed, we compared the MQA performance of the datasets using various MQA methods and indices of alignment quality, such as identity, between the target and template sequence.
The workflow for creating the datasets is shown in Figure 1. First, we selected template-rich entries as targets from the structural classification of proteins (SCOP) [30] and PISCES [31] databases, respectively. Then, the template was searched against protein data bank (PDB) [32] and modeled. Next, sampling was performed to ensure that the distribution of the model quality was not biased. Low-quality models were excluded. Finally, each target was confirmed to meet the criteria described later, and targets that did not meet the criteria were re-selected.

Target Selection
We selected 100 targets from the SCOP version 2 (SCOP2) (released on 30 March 2021) database for the single-domain dataset and from the subset of PISCES server (released on 25 February 2021) for the multi-domain dataset to avoid redundancy among the targets.
SCOP2 classifies protein domains based on their evolutionary and structural relationships. Because SCOP2 has entries for each protein domain, we used it as the target selection source for the single-domain dataset. We selected one target from each protein superfamily to avoid target redundancy. There are four protein types in SCOP: globular, fibrous, membrane, and intrinsically disordered. We selected only globular proteins as targets, because fibrous and membrane proteins are not stable in their own protein structure domain, and intrinsically disordered proteins are inappropriate targets for structure prediction. Furthermore, SCOP classifies entries into five classes based on their secondary structure: all alpha, all beta, alpha/beta, alpha+beta, and small proteins. We excluded small proteins because of few entries. Then, we selected 25 targets equally from the other four classes, with 100 targets in total. When choosing a superfamily for each class, we chose superfamilies with a high number of entries. When selecting targets from the superfamily entries, we selected an entry with the highest number of hits in homology searching using three iterations of PSI-BLAST [33] v2.9.0.

Homology modeling
Sampling models *** Figure 1. Workflow of dataset creation. First, we selected entries with rich templates from the SCOP and PISCES databases for both single-domain and multi-domain datasets as targets. Next, we performed a template search and homology modeling. In this process, templates with coverage of less than 60% are excluded. The models were sampled for each target so that the distribution of the quality of the models was not biased. When sampling, models with GDT_TS less than 0.4 were excluded. Finally, each target was confirmed to meet the criteria. Targets not meeting the criteria were re-selected.
PISCES is a server that can extract subsets of proteins using sequence identity and structural quality criteria. We used the subset which was precompiled using the following parameters: identity less than 20%, resolution less than 2.0, and R-factor less than 0.25. Since entries in PISCES are listed by amino acid sequence and contain both single-domain and multi-domain proteins, we extracted only multi-domain proteins based on the CATH [34] classification. To select template-rich entries as targets, we performed a homology search on each multi-domain entry in the same way as for the single-domain dataset, and selected 100 targets in the order of the number of hits.

Homology Modeling
We used MODELLER [35], which is a commonly used homology modeling method, as a structure prediction method. Although there are several homology modeling methods, we chose a single method to evaluate whether the MQA methods capture the quality of the model, rather than the characteristics of the predicted structure for each structure prediction method.
We performed three iterations of PSI-BLAST against PDB (released on 7 April 2021) and selected template structures from each iteration. We used two methods to select template structures from the results of each iteration of PSI-BLAST. One method was to simply select the hits in the order of their e-values to select the template structure that is most similar to the target sequence. The other method was to cluster the hit sequences using CD-HIT [36,37] with a 95% threshold and select hits in the order of their e-values so that the clusters do not overlap, which allowed us to select various template structures. Up to 10 templates were selected from each iteration by each of these two methods, totaling up to 60 templates. Note that the templates were selected in order starting from the first iteration, and the hits selected in the previous iteration were not re-selected. Even if the template structure was the same, if the sequence alignment changed, the template was selected.
We excluded some hits when selecting the templates. First, proteins with the same PDB ID as the target proteins were excluded. Next, hits with less than 60% coverage were excluded because high-quality model structures were not generated from such templates. Finally, hits with more than 95% identity to the target sequence were excluded because their structures could have been determined for the exact same protein or they may have the same structure with only a few residue differences.
After selecting the templates, we generated five models with different random seeds of optimization for each template. At that time, we set automodel.md_level to the default (refine.very_fast) for model refinement. Thus, up to 300 models were generated for a single target.

Sampling
After modeling, there was bias in the quality distribution of the model structures. Therefore, we sampled the models to reduce the bias. Before sampling, models with GDT_TS less than 0.4 were removed because their prediction quality was low and they were not suitable for evaluating the MQA performance.
Sampling was performed by limiting the number of models around the best model that had the best GDT_TS. If there were many models close to the best model, it was easy to select the model that is close to the best, and it was not possible to accurately evaluate the MQA performance when selecting the best model. Specifically, models whose GDT_TS difference from the best model was within 0.03 were defined as the models around the best. Up to 10 of these models were randomly sampled. Note that the best model was always included in the dataset separately from the models around the best model. The other models were randomly sampled so that the maximum number of models was 150, including the models already selected in the above procedure.

Verification
Finally, a subset of each target was evaluated to determine whether it met the following criteria. Targets that did not meet the criteria were re-selected because they were not suitable for the purpose of this study. First, targets with fewer than 50 models after sampling were replaced for both single-domain and multi-domain datasets. Then, targets whose GDT_TS of the best model was less than 0.7 were replaced. When replacing a single-domain target, the entry in the same superfamily that had the next highest number of hits in the PSI-BLAST template search was selected as a target. When replacing multi-domain targets, we selected the entry with the next highest number of hits among the multi-domain entries for a target. We iteratively replaced targets until all targets met the criteria.

MQA Performance Evaluation for the Constructed Datasets
We compared the MQA performance of several MQA methods and template quality metrics, such as identity, in the datasets created using the procedure described above.

Evaluation Metrics
We chose metrics that are important in practical situations as the main evaluation metrics. Some metrics that are important in practical situations included the ability to select the best model from a set of models and how accurately we could predict the value of GDT_TS. Therefore, we used the average of GDT_TS loss and the average of mean absolute error (MAE) per target as the main evaluation metrics. GDT_TS loss is the difference between the GDT_TS of the best model and the model with the highest MQA score. The lower the GDT_TS loss, the model that is closest to the best model is selected. If there were multiple highest MQA scores, we averaged their losses. The MAE is the average of the errors between the GDT_TS and MQA scores for each model. We also used the average Pearson and Spearman correlations between the GDT_TS and MQA scores for each target. GDT_TS was calculated using the TM-score [38]. To confirm statistical significance, Wilcoxon signed-rank tests were conducted using a significance level of 0.01.

Evaluation Methods
We compared three method types: indices of the alignment quality between target and template sequence, statistical potential function-based MQA methods, and machine learning-based MQA methods. All of these methods were run in our local environment.
We used three indices of the alignment quality between the target and template sequences: identity, positive, and coverage. Identity generally indicates the percentage of residues that match within the aligned sequence, but in this study, identity was calculated by dividing the number of residues that match between the template and target sequences by the length of the target sequence. Positive indicates the percentage of residues with a positive alignment score, which is also generally dependent on the length of the alignment. In this study, we calculated the positive percentage using the length of the target sequence. We calculated the coverage as the ratio of the template sequence coverage to the length of the target sequence.
For statistical potential function-based MQA methods, we used discrete optimized protein energy (DOPE) [39] and statistically optimized atomic potentials (SOAP) [40]. DOPE is a method based on the distance potential between atoms and is used as the internal score function in MODELLER. SOAP uses the atomic distance and orientation between a pair of covalent bonds.
As machine learning-based methods, we selected ProQ3D [41], SBROD [42], P3CMQA [43], and DeepAccNet [44]. ProQ3D is a standard MQA method and is a neural network-based method that uses sequence profiles and Rosetta energy as inputs. Several versions were trained with different labels. We used the S-score version. SBROD is a ridge regressionbased method that uses the structural features of the main chain as the input. P3CMQA is a three-dimensional convolutional neural network (3DCNN)-based method that uses atom-type features and sequence profiles as inputs. DeepAccNet is a method composed of 3DCNN and 2DCNN that uses distance-based and sequence-based features as inputs. Two versions of DeepAccNet were used: DeepAccNet, which is the standard version, and DeepAccNet-Bert, which uses bert-embeddings by ProtTrans [45]. ProQ3D and SBROD were selected because of their good performance in CASP13 [46]. P3CMQA and DeepAccNet were selected because they had good results in CASP14 [47].

Results
First, we described details on the constructed datasets. Then, we compared the prediction performance of the MQA methods on the datasets.

Constructed Datasets
We constructed single-domain and multi-domain datasets containing 100 targets each. The list of targets for the single-domain and multi-domain datasets are shown in Tables S1 and S2, respectively. Figure 2 shows the maximum GDT_TS value distribution for each target. For the both datasets, we generated structural models with GDT_TS greater than 0.9 for more than half of the targets. Compared to the CASP11-13 [16][17][18] dataset, both constructed datasets contain more accurate prediction structures (see Supporting Information). More detailed information, such as the distribution of GDT_TS of models for each target, is available in the supporting information.

Correlation between GDT_TS and the Alignment Quality
In homology modeling, the alignment quality between the target and template sequences is an important factor that affects the model quality. Therefore, the template is often selected based on the alignment quality. In this study, we used various template structures to model a single target and examined the relationship between the alignment and model quality. Figure 3 shows a scatter plot of GDT_TS and the three indicators of alignment quality: identity, positive, and coverage. There was a positive correlation between the alignment quality and GDT_TS. However, even if the identity and positive values were high, GDT_TS was low in some cases. In contrast, there were cases where the GDT_TS was high, even if the identity was low. Therefore, it was difficult to judge the quality of the predicted model only on the alignment quality.

MQA Performance Evaluation for the Constructed Datasets
Previous MQA studies generally evaluated their performance using the CASP dataset, and their performance for high-quality homology models was unclear. Thus, we benchmarked the performance of the existing MQA methods. Additionally, most practical studies using homology modeling often used sequence identity or the e-value to select the best template and structure model. We therefore compared the selection performance of alignment quality indices with those of the MQA methods of the constructed datasets.
The results for the single-domain dataset are listed in Table 1. Among the three indices of alignment quality (identity, positive, and coverage), the performance of identity was the best in terms of loss. The performance of the positive was better in terms of the Pearson and Spearman correlations. The performance of the coverage was lower than those of the other indices. Statistical potential-based methods (DOPE and SOAP) performed slightly better than identity for all indices. Classical machine learning-based methods (ProQ3D and SBROD) performed worse to loss. This is because structures with high identity are often accurate, which makes this is a good index for selecting the best model. In contrast, structures with low identity were not always inaccurate and had decreased performance for the Pearson and Spearman correlations. Newer deep learning-based methods (P3CMQA and DeepAccNet) performed better than alignment quality, both for loss and correlations. However, the improved performance of the latest deep learning-based methods compared to alignment quality (identity) was not significantly different (p > 0.01) for loss. From the viewpoint of MAE, the performances of classical machine learning-based methods were comparable to those of the latest deep learning-based methods, likely because most of these methods were not designed to output GDT_TS itself. Instead, most methods estimate an average lDDT. The MAE values were not sufficiently small, but they could be used as guidelines to estimate the absolute quality of the structure models.
The results for the multi-domain dataset are listed in Table 2. There was no significant differences in the results for the single domain dataset. However, all MQA methods performed better than alignment quality for loss. Indeed, the latest deep learning-based methods were significantly different from alignment quality (identity) in terms of all metrics.
The detailed results with p-values are shown in Tables S4 and S5. In addition, the results when RMSD is used as a label are shown in Tables S6 and S7.

Differences in the Quality of Models with the Same Template
In this study, we generated multiple predicted structures from a single template structure. It is possible to select a template that is most likely to be similar to the target from multiple template candidates based on the alignment quality between the template and the target sequences. However, the quality of the template alignment cannot be used to rank multiple structures predicted from the same template and alignment. Thus, we first analyzed the extent to which the quality differed among the model structures predicted from the same template. We also examined whether the MQA method can rank model structures predicted from the same template.
The difference between the best and worst template models in the single-domain dataset is shown in Figure S4. Note that we only show the difference in quality between the structures predicted from the template for which the best model was derived. The difference for most of the templates was small (less than 0.025), but some were large, with a maximum difference of approximately 0.175. Therefore, it is important to rank multiple structures predicted from the same template. The results for all templates are shown in Figure S5. The same trend was observed for all templates.
We tested whether the MQA method can select the best model among the model structures generated from the same template. For testing purposes, we used the seven templates in Figure S4 where the difference between the best and worst models was greater than 0.05. GDT_TS loss was used as the evaluation metric. To compare with the case of random selection, we calculated and compared the expected value. We also compared the case where the worst model was selected. The results of this test are shown in Figure 4. As shown in Figure 4, most MQA methods tend to select better models than random selection. However, there were some cases where the worst model was selected, and there was no method that could always select a model close to the best model.

Situation-Specific MQA Performance Analysis
The results of the MQA performance comparison show that the current MQA methods perform better overall. However, in some cases, it is better to choose a model based on alignment quality rather than MQA methods. Therefore, we analyzed when the alignment quality was effective in selecting a model and when the MQA method was effective.
First, we created categories based on the distribution of the alignment quality of the templates for each target and then compared the performance for each category. We created the following three categories. Here, identity Nth denotes the Nth highest identity, and len(templates) is the number of the template structures for a target. These categories were created based on the assumption that it would be better to select a model based on its identity when there is a template with an outstandingly high identity among the templates and to select a model based on the MQA score when there are multiple templates with high identity. The results for each category in the single-domain dataset are shown in Table 3, and the results for the multi-domain dataset are listed in Table S9. From the loss of each category, it was found that it is best to select a model based on identity when there is a template with exceptionally high identity. In the other cases, MQA methods were found to better select a structure model. Therefore, although identity is an important indicator that affects the model quality, it does not represent the detailed model quality, so model selection using MQA method is better when the difference in identity is less than approximately 10%.
In addition to the distribution of identity for each target, the following four categories were created based on the maximum identity value, and the performance of each category was compared. The results for each category in the single-domain and multi-domain datasets are shown in Tables 4 and S11, respectively. In the single-domain dataset, identity-based selection was best when the identity was 0.8 or higher. Otherwise, MQA method-based selection was better. In many cases, when sequence identity is quite high (over 0.8), the quality of the predicted structure is high, thus if we select the model with the best identity, the model is quite close to the best model. If not, it is better to use the MQA method because it is difficult to identify which models are highly accurate.  We also compared the performance for each class of targets in the single-domain dataset and for each number of domains of the targets in the multi-domain dataset, but there was no significant difference in performance. Therefore, there is little difference between MQA methods that are superior in estimating the quality of a particular structure (e.g., a structure with many alpha helix). This may be due to the many methods using the same dataset for training. As for the number of domains, it is difficult to discuss the performance difference between the MQA methods because the domain number of most targets was two. Detailed results are shown in Tables S12 and S13.

Conclusions and Future Work
In this study, we constructed two MQA benchmark datasets: a single-domain dataset named HMDM-single and a multi-domain dataset named HMDM-multi. Homology modeling was used as the modeling method, and the datasets contained high-quality models for each target. We compared the performance of the existing MQA methods using the constructed datasets and showed that the latest deep learning-based methods performed better. Thus, it is better to consider the score of the MQA method when selecting one structure from multiple predicted structures. However, the deep learning-based method is not superior for all targets. The sequence identity is a good indicator for selecting the best model when there are templates whose identity is much higher than other templates or when there are templates whose identity is quite high. Therefore, it is better to use the MQA method or sequence identity to select the best predicted structural model depending on the situation. The constructed datasets can be downloaded from http://www.cb.cs.titech.ac.jp/hmdm (accessed on 12 March 2022).
One of the future works is to improve the dataset construction protocol. The current number of targets is acceptable considering the target selection method, but increasing the number of targets in the future will enable more reliable evaluation. The sampling method of the model structure can also be improved. There are no significant problems with the current sampling method, but since it is a simple method, further reducing the bias in accuracy will enable better evaluation. Other improvements could be achieved in the selection of multi-domain targets. In homology modeling, we assume that there is no significant difference in orientation between domains in a family, and we do not assign any specific restrictions to the targets. However, there are cases where the orientation differs, thus it is required to select targets based on the strength of the interaction between domains. Furthermore, the addition of targets with no high-identity templates can be considered as future work. There are few targets in this dataset for which only low identity (<30%) templates exist. The ability to select the best structure for such targets is important in real-life applications, thus we will need to add such targets in the future.
Another possible future work is to verify the accuracy estimation performance for AlphaFold2 structures. Recently, the release of AlphaFold2 has made it possible to predict 3-dimensional structures with high accuracy without using homology modeling. Since it is expected that AlphaFold2 structures will be used more in practical applications, it is important to verify the accuracy estimation performance for AlphaFold2 structures.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/bioengineering9030118/s1, Figure S1: Sequence length distribution of targets in the datasets; Figure S2: The distibution of the GDT_TS for each target in the singledomain dataset; Figure S3: The distibution of the GDT_TS for each target in the multi-domain dataset Figure S4: The distribution of the difference in GDT_TS between the best and the worst models from the same template and alignment in the single-domain dataset (For only the best template); Figure S5: The distribution of the difference in GDT_TS between the best and the worst models from the same template and alignment in the single-domain dataset (For all templates); Table S1: List of 100 targets in the single-domain dataset; Table S2: List of 100 targets in the multi-domain dataset; Table S3: Number of targets for which the maximum GDT_TS value exceeds the threshold in each dataset compared with CASP dataset; Table S4: MQA performance for the single-domain dataset; Table S5: MQA performance for the multi-domain dataset; Table S6: MQA performance for the single-domain  dataset with RMSD as a label; Table S7: MQA performance for the multi-domain dataset with RMSD  as a label; Table S8: MQA performance for each category based on the distribution of identity for the  single-domain dataset; Table S9: MQA performance for each category based on the distribution of identity for the multi-domain dataset; Table S10: MQA performance for each category based on the maximum identity for the single-domain dataset; Table S11: MQA performance for each category based on the maximum identity for the multi-domain dataset; Table S12: MQA performance for each class of the single-domain dataset; Table S13: