A Computational Method to Predict Effects of Residue Mutations on the Catalytic Efﬁciency of Hydrolases

: With scientiﬁc and technological advances, growing research has focused on engineering enzymes that acquire enhanced efﬁciency and activity. Thereinto, computer-based enzyme modiﬁcation makes up for the time-consuming and labor-intensive experimental methods and plays a signiﬁcant role. In this study, for the ﬁrst time, we collected and manually curated a data set for hydrolases mutation, including structural information of enzyme-substrate complexes, mutated sites and Kcat/Km obtained from vitro assay. We further constructed a classiﬁcation model using the random forest algorithm to predict the effects of residue mutations on catalytic efﬁciency (increase or decrease) of hydrolases. This method has achieved impressive performance on a blind test set with the area under the receiver operating characteristic curve of 0.86 and the Matthews Correlation Coefﬁcient of 0.659. Our results demonstrate that computational mutagenesis has an instructive effect on enzyme modiﬁcation, which may expedite the design of engineering hydrolases. these results show that our method does have a noticeable ability to predict the effect of residue mutation on hydrolysis reaction, especially on glucoside hydrolysis, phosphohydrolysis, lactam hydrolysis, etc. However, there is still one question that the number of increasing-mutation predicted by our method was much less than decreasing-mutation because of our imbalanced data set. Therefore, we will measure our model in more cases that contain increasing-mutations in the further work to make our method more rigorous.


Introduction
Enzymes, acting as biological catalysts, have a remarkable capacity to accelerate chemical reactions by 10 7 -10 19 fold [1], and make important contributions to shaping and controlling cellular life. For instance, kinases are indispensable for signal transduction and cell regulation [2]. ATPases have to do with activating transport to export toxins, wastes, and solutions, which block the cellular process [3]. Proteases implement a breakdown of large molecules into smaller ones to be absorbed, related to digestion of ingested proteins and protein catabolism [4,5]. Glycoses [6] and oxidoreductases [7] collaborate to create metabolic pathways and maintain a normal life. By virtue of their highly efficient catalytic activity, awesome biodegradability, extremely strong environmental tolerance, and exquisite substrate selectivity, enzymes are widely employed in the biofuel industry [8,9], biological detergents [10], brewing industry [11,12], food processing [13,14], etc. However, enzymes have finite productivity in vitro with a limitation that they only convert intrinsic substrates at high rates in vivo. Not only that, but wild-type enzymes also have challenges catalyzing non-natural substrates in the application of industry.
On the theoretical basis that amino acid sequence determines protein tertiary structure [15], a wide range of amino acid residues that form temporary bonds with substrates and make the catalytic reaction faster by lowering the activation energy, make up the functional sites and mediate various functions of enzymes. For this reason, the alteration of residues may give rise to the functional change of enzymes. Valine in position 56 of glutathione S-transferases (GSTs) formed hydrogen bonds with its substrate glutathione and a dramatic decrease in the thermodynamic stability would be seen when mutating the residue to alanine [16]. Mutations of the catalytic site of Candida Antarctica lipase B (CALB) particularly meaningful to enhance the activity of enzymes, especially those beneficial to the environment like PETases. Moreover, these methods are requisite for the reason that it is quite labor-intensive and time-consuming to verify the effect of mutants by wet experiments. However, as shown above, computational methods to improve and tailor enzymes are mainly focused on enzyme stability based on thermodynamics calculation or the enzyme activity executed by molecular dynamics. Little attention is paid to the catalytic efficiency (Kcat/Km value) of the enzyme catalytic reaction. Hence, in this study, we proposed a computational method to predict the effect of residue mutations on the catalytic efficiency of hydrolases by comparing the generally acknowledged used assessment metric Kcat/Km ratio, where Kcat represents rate constants for the catalytic conversion of substrate into product and Km represents Michaelis constant. Here, we built a classification model based on historical data that described the catalytic effectiveness change between wildtype and mutated enzymes and then predicted new output (increasing-mutation and decreasing-mutation). The method may contribute to speeding up the engineering process of hydrolases.

Data Set
The completed data set was composed of 314 mutations (77 increasing-mutation and 237 decreasing-mutation, see Supplementary File), distributed across 65 kinds of proteins from 33 organisms, combined with 68 kinds of various reaction substrates. To prevent bias on the method, we made the proportion of increasing-mutation and decreasing-mutation on training and test set, respectively, which was consistent with that on the primitive data set when separating the data. Eventually, the training set contained 257 mutations (65 increasing-mutation and 192 decreasing-mutation) and the test set embodied 57 mutations (12 increasing-mutation and 45 decreasing-mutation). Meanwhile, the total data were primarily distributed in 13 categories based on its specific hydrolysis reaction type (Figure 1), such as phosphohydrolysis (27.4%), proteolysis (8.3%), glutathione hydrolysis (0.3%), and dehalogenation (0.3%).
stitutions of butyrylcholinesterase (BChE) successfully accomplished the improvement of drug metabolism and clearance rate [44].
The above research illustrates that the directed evolution of hydrolases is indeed of essential significance yet to modify hydrolases and computational methods seem to be particularly meaningful to enhance the activity of enzymes, especially those beneficial to the environment like PETases. Moreover, these methods are requisite for the reason that it is quite labor-intensive and time-consuming to verify the effect of mutants by wet experiments. However, as shown above, computational methods to improve and tailor enzymes are mainly focused on enzyme stability based on thermodynamics calculation or the enzyme activity executed by molecular dynamics. Little attention is paid to the catalytic efficiency (Kcat/Km value) of the enzyme catalytic reaction. Hence, in this study, we proposed a computational method to predict the effect of residue mutations on the catalytic efficiency of hydrolases by comparing the generally acknowledged used assessment metric Kcat/Km ratio, where Kcat represents rate constants for the catalytic conversion of substrate into product and Km represents Michaelis constant. Here, we built a classification model based on historical data that described the catalytic effectiveness change between wild-type and mutated enzymes and then predicted new output (increasing-mutation and decreasing-mutation). The method may contribute to speeding up the engineering process of hydrolases.

Data Set
The completed data set was composed of 314 mutations (77 increasing-mutation and 237 decreasing-mutation, see Supplementary File), distributed across 65 kinds of proteins from 33 organisms, combined with 68 kinds of various reaction substrates. To prevent bias on the method, we made the proportion of increasing-mutation and decreasing-mutation on training and test set, respectively, which was consistent with that on the primitive data set when separating the data. Eventually, the training set contained 257 mutations (65 increasing-mutation and 192 decreasing-mutation) and the test set embodied 57 mutations (12 increasing-mutation and 45 decreasing-mutation). Meanwhile, the total data were primarily distributed in 13 categories based on its specific hydrolysis reaction type (Figure 1), such as phosphohydrolysis (27.4%), proteolysis (8.3%), glutathione hydrolysis (0.3%), and dehalogenation (0.3%).

Model Performance
A classification model based on the descriptors was constructed to predict whether the mutated residue would increase or decrease catalytic activity. To begin with, we trained several classification algorithms to compare their performance on a five-fold cross-validation data set: Ensemble algorithm like the random forest, decision tree algorithm, support vector machine algorithm, clustering algorithm that is K-nearest neighbors. Bayesian algorithms involve naive Bayes and Gaussian naive Bayes and neural net like multilayer perceptron. It is apparent from Figure 2a and Table 1 that the choice of algorithms does affect the performance of the model, where the AUC, accuracy, and MCC tend to be larger in ensemble modeling techniques (random forest) than as compared to the other models. Random forest classifier constructs a multitude of decision trees, of which every decision tree will give a class prediction, and afterward, the class with the most votes will be the final prediction result of the method [45]. Obviously, the random forest has the best result in our study, with the highest MCC of 0.382 and the maximum AUC of 0.8, which shows a significant advantage in prediction. We thus perceived the random forest as a suitable model to predict the effect of residue mutation on the catalytic efficiency of hydrolases.

Case Study
To analyze the prediction capability of the method, we performed several case studies on increasing-mutation or decreasing-mutation predicted by this method. Figure  3 displays the prediction effects of residue mutation on glycoside hydrolysis reaction efficiency, which revolved with the hydrolysis of glycosidic bonds in complex sugars [46]. Enzymes represented here are Beta-D-glucosidase from Maize (PDB: 1HXJ) and Rye (PDB: 3AIU). We noted that the method predicted correctly with two increasing-mutations (V205L and P377A) and three decreasing-mutations (F198V, D261N, and M263F) of Beta-D-glucosidase from Maize [47], plus with two increasing-mutations (G464F and S465L) and two decreasing-mutations (F198A, Y378A) of Beta-D-glucosidase from Rye were predicted by the method [48], among which F198V and D261N resulted in an almost inactive enzyme. Insights into the structure of the enzymes, mutation at position 198 motivated the rearrangement of Phe205, Phe466, and Glu464, the three residues that all related to the glycoside and aglycone binding pocket and mutation of residue Asp261 destabilized the protonation states of the acid-base catalyst. Phosphohydrolysis reactions catalyzed by Phosphonoacetaldehyde hydrolase (Figure 4a) and RNA helicase (Figure 4b) are the process of hydrolysis of organic phosphate [49]. From graphs, we uncovered the model predicted rightly with five decreasing-mutations of Phosphonoacetaldehyde hydrolase (C22A, M49L, G50A, H56A, and Y128F) [50], and three decreasing-mutations of RNA helicase (S228A, T230A, and H375A) [51]. In Phosphonoacetaldehyde hydrolase, the Kcat/Km ratio of M49L was reduced 15,000-fold, along with Kcat/Km of G50A was reduced 11,000-fold. On the view of protein structure, Met49 was located in the catalytic site and bound with a water molecule, which was proved to be linked with the formation of the hydrogen bond with the carbonyl oxygen of substrate and the transfer of protons. However, Gly50 enabled to aid the hairpin turn  Nevertheless, just as the data ( Figure 2a and Table 1) revealed, the recall, AUC, and MCC of the model were not optimal, especially the MCC. Therefore, we successively adopted several measures to achieve a higher-quality prediction model. On the one hand, to balance the model's prediction ability of different classes of samples, we harmonized the slightly unequal dataset (the ratio of positive and negative samples was close to 1:3) by oversample way synthetic minority oversampling technique (SMOTE), which synthesizes new examples for the minority class. On the other hand, we did hyper parameter optimization to improve the model's performance targeting numbers of trees in the forest, the depth of every decision tree, the numbers of samples allowed in the leaf node, and the numbers of samples when placed in the node before the node is split. Eventually, we improved the prediction ability of this random forest model on the aspect of MCC (Table 2), increasing to 0.448 from 0.382, which meant the model had a better prediction result on both increasing-mutation and decreasingmutation. The final model was exploited to predict the blind test set ultimately, and what stands out in Figure 2b and Table 2 is the model was unexpectedly satisfactory on the test set, with AUC of 0.86 and MCC of 0.659.

Case Study
To analyze the prediction capability of the method, we performed several case studies on increasing-mutation or decreasing-mutation predicted by this method. Figure 3 displays the prediction effects of residue mutation on glycoside hydrolysis reaction efficiency, which revolved with the hydrolysis of glycosidic bonds in complex sugars [46]. Enzymes represented here are Beta-D-glucosidase from Maize (PDB: 1HXJ) and Rye (PDB: 3AIU). We noted that the method predicted correctly with two increasing-mutations (V205L and P377A) and three decreasing-mutations (F198V, D261N, and M263F) of Beta-D-glucosidase from Maize [47], plus with two increasing-mutations (G464F and S465L) and two decreasing-mutations (F198A, Y378A) of Beta-D-glucosidase from Rye were predicted by the method [48], among which F198V and D261N resulted in an almost inactive enzyme. Insights into the structure of the enzymes, mutation at position 198 motivated the rearrangement of Phe205, Phe466, and Glu464, the three residues that all related to the glycoside and aglycone binding pocket and mutation of residue Asp261 destabilized the protonation states of the acid-base catalyst. Phosphohydrolysis reactions catalyzed by Phosphonoacetaldehyde hydrolase (Figure 4a) and RNA helicase (Figure 4b) are the process of hydrolysis of organic phosphate [49]. From graphs, we uncovered the model predicted rightly with five decreasing-mutations of Phosphonoacetaldehyde hydrolase (C22A, M49L, G50A, H56A, and Y128F) [50], and three decreasing-mutations of RNA helicase (S228A, T230A, and H375A) [51]. In Phosphonoacetaldehyde hydrolase, the Kcat/Km ratio of M49L was reduced 15,000-fold, along with Kcat/Km of G50A was reduced 11,000-fold. On the view of protein structure, Met49 was located in the catalytic site and bound with a water molecule, which was proved to be linked with the formation of the hydrogen bond with the carbonyl oxygen of substrate and the transfer of protons. However, Gly50 enabled to aid the hairpin turn at the helix-turn-helix motif and stabilize the closed conformation state of enzyme, which could interpret the drastic descend of G50A mutant catalytic efficiency.
All these results show that our method does have a noticeable ability to predict the effect of residue mutation on hydrolysis reaction, especially on glucoside hydrolysis, phosphohydrolysis, lactam hydrolysis, etc. However, there is still one question that the number of increasing-mutation predicted by our method was much less than decreasing-mutation because of our imbalanced data set. Therefore, we will measure our model in more cases that contain increasing-mutations in the further work to make our method more rigorous.  Table 3.
All these results show that our method does have a noticeable ability to predict the effect of residue mutation on hydrolysis reaction, especially on glucoside hydrolysis, phosphohydrolysis, lactam hydrolysis, etc. However, there is still one question that the number of increasing-mutation predicted by our method was much less than decreasing-mutation because of our imbalanced data set. Therefore, we will measure our model in more cases that contain increasing-mutations in the further work to make our method more rigorous.  shown in Table 3.
All these results show that our method does have a noticeable ability to predict the effect of residue mutation on hydrolysis reaction, especially on glucoside hydrolysis, phosphohydrolysis, lactam hydrolysis, etc. However, there is still one question that the number of increasing-mutation predicted by our method was much less than decreasing-mutation because of our imbalanced data set. Therefore, we will measure our model in more cases that contain increasing-mutations in the further work to make our method more rigorous.

Discussion
Research on the engineering of hydrolases has always been the hotspot in enzyme engineering. In this paper, we exhibited a computational approach to predict the change of catalytic efficiency of hydrolases after residue mutations. Importantly, the dataset employed in this method was a manually curated, literature-derived dataset, comprising 314 single residue mutations, and associated for the first-time experimental information on alterations in catalytic hydrolysis reaction activity with three-dimensional structures of protein-ligand complexes. Nevertheless, there are still some unanswered questions. As the data (Supplementary File) show, we only focused on individual residue mutations of enzymes, whereas iterative mutations are more integrant in protein engineering. Previous studies assumed that mutation effects of proteins have additivity [58,59], and the ProSAR algorithm [60] was also applied to identify synergistic effects by statistical analysis way. Therefore, it is still a rough but worthwhile road to study the effects of it-

Discussion
Research on the engineering of hydrolases has always been the hotspot in enzyme engineering. In this paper, we exhibited a computational approach to predict the change of catalytic efficiency of hydrolases after residue mutations. Importantly, the dataset employed in this method was a manually curated, literature-derived dataset, comprising 314 single residue mutations, and associated for the first-time experimental information on alterations in catalytic hydrolysis reaction activity with three-dimensional structures of protein-ligand complexes. Nevertheless, there are still some unanswered questions. As the data (Supplementary File) show, we only focused on individual residue mutations of enzymes, whereas iterative mutations are more integrant in protein engineering. Previous studies assumed that mutation effects of proteins have additivity [58,59], and the ProSAR algorithm [60] was also applied to identify synergistic effects by statistical analysis way. Therefore, it is still a rough but worthwhile road to study the effects of iterative mutation on proteins. Besides, features utilized in our methods were focused on sequence, structure, and assay environment, which were predominantly molecular-level descriptors. Nevertheless, lots of studies have confirmed that the hydrophobic effect and hydrogen bonding are the dominant force in protein folding, which contributes to the stability of proteins [61][62][63] and the reduction of free energy and may impact the transition of proteins state and the kinetic properties of enzymes. Therefore, more attention should be paid to the atomic-level characteristics [64], which facilitate the establishment of causal relationships between protein-ligand complexes and catalytic activity. Meanwhile, chemical reactions catalyzed by enzymes are dynamic processes associated with activation energy, bond break, and transformation of chemical groups. Hence, comprehensive understanding of reaction mechanisms, dynamics, structures and sequences is essential for in-silico enzyme engineering. At the same time, just as shown in Figure 1, hydrolases are an enzyme family containing diverse enzymes that catalyze various hydrolytic reactions, e.g., lipases break ester bonds and phosphatases act analogously upon phosphate. Thus, different subsets of hydrolyses-class may catalyze the substrate with different attributes and characters. Hence, prediction models on the catalytic efficiency of hydrolyses should make a difference between certain subsets of hydrolyses. Future studies on the topics mentioned above are, therefore, recommended.

Method Workflow
As schematically represented by Figure 8, our method was made up of four major steps, including data collection, data preparation, feature construction, and model prediction.

Method Workflow
As schematically represented by Figure 8, our method was made up of four major steps, including data collection, data preparation, feature construction, and model prediction.

Data Collection and Preparation
The availability of high-quality data is essential for computational methods because its confidence and attribute determine the universality of the model. However, current kinetic data and structural information of protein-ligand complexes of both wild and mutant proteins are unsystematic and nonstandard. To fill this gap, we collected and manually curated experimental-based data from MuteinDB [65] and SABIO-RK [66], two databases that include information about reaction substrate, products, kinetic parameters, and enzymatic reactions directly with variants of enzymes, and also checked it for accuracy from the literature. We manually aggregated exceed 300 data from MuteinDB, including protein name, mutated site, reaction substrate, reaction condition and crawled about 60,000 data by Python (Python Software Foundation, Scotts Valley, CA, USA, 2009) from SABIO-RK, and then removed some data against criteria that include requir-

Data Collection and Preparation
The availability of high-quality data is essential for computational methods because its confidence and attribute determine the universality of the model. However, current kinetic data and structural information of protein-ligand complexes of both wild and mutant proteins are unsystematic and nonstandard. To fill this gap, we collected and manually curated experimental-based data from MuteinDB [65] and SABIO-RK [66], two databases that include information about reaction substrate, products, kinetic parameters, and enzymatic reactions directly with variants of enzymes, and also checked it for accuracy from the literature. We manually aggregated exceed 300 data from MuteinDB, including protein name, mutated site, reaction substrate, reaction condition and crawled about 60,000 data by Python (Python Software Foundation, Scotts Valley, CA, USA, 2009) from SABIO-RK, and then removed some data against criteria that include requiring enzyme kinetics data of single residue mutation and the three-dimensional structure had been determined. About 3000 data pieces of ultimately matched our criteria.
All the data were distributed at different reaction types and the data related to hydrolysis reaction were taken out and thereafter, corresponding mutated residues were localized onto protein structures deposited in the Protein Data Bank (PDB) [67]. Meanwhile, the reaction substrate was also mapped onto experimentally solved protein-ligand complexes if the crystal complex exists. Otherwise, the ligand was docked into the catalytic sites of proteins by SMINA [68]. Then, we divided data into two groups based on the catalytic efficiency change of a given single-point residue mutation on a logarithm scale: where Kcat/Km (mut) indicates the catalytic efficiency of the mutant protein-ligand complex and Kcat/Km (wt) indicates the catalytic efficiency of the wild-type complex. The residue mutation was classified as increasing-mutation (to increase catalytic efficiency) when efficiency change was positive and the decreasing-mutation (to decrease catalytic efficiency) when the change was negative.

Feature Construction
In an attempt to predict increasing-mutation or decreasing-mutation with the computational method, we turned it into a classification problem and constructed 50 features of protein and ligand at the molecule and atomic level to describe mutation information. Overall, features were categorized as sequence features, structural features, and assay conditions. According to our previous research [69], sequence features that have been the most preferred descriptor applied in computational methods, here, included Shannon entropy to depict conservation of mutated residue, position-specific scoring matrix to describe mutated residue evolutionary distance and other physicochemical properties of residues like hydropathy, hydrophobicity, and polarity. All the sequence features were computed by Prody (Bahar Lab, Pittsburgh, PA, USA, 2011) [70], Biopython (Open Bioinformatics Foundation, Toronto, Canada, 2020) [71] toolkit, and KAlign (Stockholm Bioinformatics Centre, Sweden, 2011) [72] software.
Structural features were divided into four categories.
(1) Basic structural features of proteins consisted of geometric characterization, secondary structure assignments, residue solvent-accessible surface area calculated by DSSP (Wolfgang Kabsch and Chris Sander, Heidelberg, Germany, 2003) [73] software and Biopython (Open Bioinformatics Foundation, Toronto, Canada, 2020) [71] toolkit, along with B-factor [74] attained from protein 3D structure file to reflect displacement fluctuation of atoms.
(2) Residue network features were also included in this method, which involved the change of non-covalent interactions of mutated amino acids based on residue network, generated by RING [75] software (BiocomputingUP Lab, Italy, 2016).
(4) In order to describe the difference of characterization of interactions in proteinligand complexes, we drew on Protein-Ligand Interaction Profiler (PLIP) [76] tool (Michael Schroeder group, Tatzberg, Dresden, Germany, 2015) to excavate relevant to non-covalent contacts in protein-ligand structure and calculated quantity changes of seven interaction types (pi-stacking, pi-cation interactions, hydrogen bonds, hydrophobic interactions, salt bridges, water bridges, hydrogen and halogen bonds) for mutated residues.
Considering in vitro assays of enzyme catalytic activity is easily affected by experimental conditions that could lead to different Kcat values in different conditions, we added pH and experimental temperature as two descriptors to address this issue.

Comparison of Different Classifiers
With the purpose of improving the accuracy of the method, we compared seven different classifiers exploited to select the best model. They were random forest classifier, Gaussian process classifier, neural net classifier, naive Bayes, nearest neighbor classifier, decision tree, and support vector machine (SVM). All the classifiers were trained on the training set, and according to the performance metrics, we chose the best method to construct the model in the future. All the machine learning models mentioned above were implemented by Python 3 and the scikit-learn [77] library.

Model Evaluation
Commonly, the model will be fitted on a training set and then utilized to predict unknown data of the test set. However, due to the limitation of our data set and to prevent overfitting, we employed five-fold cross-validation on the training set. The training set was randomly split into five complementary subsets, and each of the subsets would be exploited as the validation set to validate analysis or find the best parameters after the model was fitted on the other four subsets left. Afterward, the independent and brand-new test set was used to provide an impartial assessment of the final model. The performance of the model was evaluated by several metrics, including accuracy, precision, recall, Matthews Correlation Coefficient (MCC), the receiver operating characteristic (ROC) curve, along with false positive rate (FPR), true positive rate (TPR, also defined as recall), and the area under the curve (AUC).
The measurements above were defined as: where TP, FP, TN, FN represent true positive, false positive, true negative, and false negative. The ROC curve is plotted with TPR against the FPR at diverse threshold values, and AUC provides a cumulative measure of performance across all possible classification thresholds.
When it comes to classification problems, accuracy, precision, recall, and AUC are commonly applied to assess the model. Whereas just as the equation is shown, when the dataset was imbalanced, which means the distribution of data across the known classes is skewed, accuracy became invalid. Precision, recall, and even the ROC curve were in the same situation as they just focused on the single category (positive or negative class) which we were interested in. Thereby, to estimate the model roundly, we took into account other metrics MCC [78], which took into account all four values FN and regarded the true class (whether it is a positive or negative class) and the predicted class as two variables to calculate their correlation coefficient. The higher the correlation between true and predicted values, the better the prediction for all classes.

Conclusions
Engineering of hydrolyses to strengthen their activity by mutating certain residues has always been the hotspot for researchers. Here, in our study, to discriminate the effects (increase and decrease) of residue mutation on the catalytic activity of hydrolyses, we constructed a classifier executed by Random Forest algorithm on 257 single mutations of hydrolyses with properties of enzymes on sequence, structure, and assay condition and to predict the residue effect by assessing the Kcat/Km of enzymes. After training on the fivefold cross-validation set and optimization of parameters, ultimately, the independent test set had a dazzling result with the AUC of 0.86 and the MCC of 0.659. Our study not only provides a solid theoretical basis for related research but also expedites the engineering process of hydrolyses.