1. Introduction
The global protein engineering market size was valued at USD 2.16 billion in 2022 and is expected to grow at a compound annual growth rate (CAGR) of 16.60% from 2023 to 2030 [
1]. The increasing demand for efficient, selective and stable enzymes at industrial and research levels during the past decade has contributed to the blooming of the computational protein engineering (CPE) field. Rational design using computational methods assists experimental approaches to speed up protein engineering campaigns aimed at developing novel biocatalysts.
With the increasing availability of software, web servers and methods for CPE [
2,
3,
4,
5], the decision on which of these methods to use has become a challenge for scientists from many different disciplines. Moreover, the selection of suitable CPE tools strongly depends on the degree of knowledge of the protein structure and the enzymatic mechanism [
4].
Overall, CPE strategies consist of the following three differentiated steps [
6]: (i) mutation selection (library design), (ii) mutant model generation (modeling software), and (iii) target-property evaluation of mutant enzyme, typically changes in stability, substrate affinity or reactivity. There exists a great availability of software and methods accounting for the three steps mentioned above, including some capable of performing more than one. However, selecting the appropriate software from this extensive range can result in a difficult decision task for researchers outside the computational biology field, especially because new methods are often presented as advancements in the theoretical framework they rely on, rather than on the property they optimize. In this sense, we think that the key parameters (criteria) driving the software/method selection, from a practical perspective, should be the biochemical properties to be optimized. This involves the classification/ranking of mutant enzymes according to the desired property. The process requires the definition of a suitable scoring function that measures and ranks the target property with great reliability and confidence. Enzymatic properties of biotechnological interest evaluated by the most common scoring functions include affinity/selectivity for a given substrate, catalytic efficiency, stability at different operation conditions (mainly thermostability and performance at specific pH) and solubility for the optimization of recombinant biocatalyst production and its application in homogenous systems. Following this criterion, the motivation of the present review is to group the available software, tools and strategies according to the measured biochemical property and categorize these groups by the methodology used to evaluate the scoring function. It is not our intention to exhaustively review the exact methodology behind the different software and scoring functions but to give insights into the method/theory they are based on and more importantly, their possible usage. The main objective of this review is to assist both dry and wet lab researchers—with no or hardly any experience using CPE tools—in the use of these scoring functions for protein engineering research. To simplify the extensive methodology, we grouped the CPE tools and scoring functions according to the following properties: (i) protein–ligand affinity, (ii) enzyme reactivity, (iii) thermostability and (iv) solubility (see
Figure 1). Notice that de novo design is excluded from the classical CPE strategies. This approach is based on creating a new enzyme from scratch and thus, the strategies used are very different from those used for tuning the enzyme’s catalytic performance. In this sense, de novo protein design is out of the scope of this review.
2. Engineering Enzyme–Substrate Recognition
In CPE campaigns aimed at evolving biocatalysts, substrate binding affinity is the property that scientists try to modulate the most. Binding affinity can be directly correlated to enzymatic efficiency because it accounts for the molecular recognition of the substrate and its availability for the enzymatic reaction to take place. General methods to evaluate molecular recognition include computational docking and molecular dynamics, as well as machine learning. The most relevant computational tools to engineer enzyme–substrate recognition, grouped by methodology, are listed in
Table 1.
2.1. Virtual Docking Tools to Assess and Predict Enzyme–Substrate Complexes
Molecular docking has been the most used computational method for improving protein–ligand binding affinity during the past two decades [
47,
48]. Docking is based on the generation of large ensembles of ligand poses followed by their evaluation by a suitable scoring function capable of ranking these binding modes. Molecular docking has traditionally been used for the virtual screening of ligand libraries in drug design. Nevertheless, docking can also be used for CPE in the same way as virtual screening; the difference is that instead of docking thousands of possible ligands upon the same receptor, hundreds or thousands of mutant protein structures are docked with the same target ligand/substrate. Structure-based docking may be a useful tool to annotate enzyme function [
49].
There are many different approaches to computational docking, the key ingredient being the theoretical framework on which the ligand binding affinity is evaluated. Liu and Wang [
50] made a remarkable effort by classifying the vast amount of scoring functions to evaluate the binding affinity present in the literature (more than 100 have been published in the past two decades according to their research). These can be classified into five categories: physics-based, empirical, semi-empirical, knowledge-based and descriptor-based scoring functions. We provide a selection of the most used scoring functions (
Table 2) in the field of CPE. These are described together with the computational tool they are implemented in.
2.1.1. Physics-Based Scoring Functions
These scoring schemes rely on force field-based molecular mechanics (MM), solvation models and even quantum mechanics (QM). Docking software using force field-based scoring functions such as DOCK [
72] and GOLD [
11] calculate the protein–ligand binding affinity summing up the van der Waals and electrostatic direct atom pairwise interactions. DOCK uses the AMBER force field [
51], which contains parameters for the nonbonded interactions but does not include a parameter for the hydrogen bonds. DOCK has been extensively used in enzyme design for different purposes, including the identification of candidate enzymes for the construction of a synthetic pathway for Acetyl-CoA production [
8] and deciphering herbicide-resistance mutations on Acetohydroxyacid synthase (AHAS) [
9]. In this last example, DOCK 6.7 was used to dock herbicides into AHAS structures and then several hybrid methods were tested (e.g., MM, MM-PBSA, QM/MM-GBSA) for predicting the resistance-leading mutations. The MM-PBSA [
73] method resulted in notably high accuracy in the prediction of mutants, experimentally altering herbicide resistance.
GOLD Docking Software [
11] is included within CSD-Discovery software as a third-party application. GOLD is based on the Tripos force field [
52], which lacks hydrogen-bonding terms but it is possible to overcome this lack in those cases where it is necessary with a hydrogen-bond term extracted from SYBYL 8.1 software (see empirical scoring functions).
ICM [
36,
74] Docking (MolSoft) uses an ECEPP/3 force field [
53] with the addition of solvation-free energy and entropic contribution terms as the scoring functions. The software uses Monte Carlo-derived movements and minimization of interaction potentials for ligand pose generation. ICM has been widely used for the virtual screening of protein structures including the MOR42-3 receptor [
37], LSD1 inhibitors [
75] and the world’s largest virtual screening assay, in which 10 million chemical compounds were screened in 11 h, which led to the identification of three lead compounds [
76].
2.1.2. Empirical Scoring Functions
These are usually developed by linear regression using a training dataset to reduce the function to a linear equation accounting for only a few physicochemical descriptors. Discovery Studio [
17] is a comprehensive suite of validated science applications built on BIOVIA Pipeline Pilot. It presents two sibling scoring functions named LigScore1 and LigScore2, both are linear equations based on three descriptors: van der Waals interaction, a polar attraction term and a desolvation penalty.
Chemscore and ChemPLP are empirical scoring functions provided by GOLD that are derived empirically from regression to experimentally determine the binding affinities of protein–ligand complexes [
54]. GOLD has been successfully used for inverse virtual screening [
77], lead optimization [
16] and identifying the correct binding mode of molecules [
78].
FlexX [
25] is one of the most used docking software historically. It presents an automatic method for docking organic ligands using an empirical scoring function derived from Böhm’s work [
79]. The scoring function accounts for H-bonds, ionic interactions, aromatic groups interactions, lipophilic contact and the number of rotatable bonds, all of them being adjustable parameters except for the number of rotatable bonds. FlexX scoring function also presents a scaling function penalizing deviations from the ideal geometry. FlexX4 performs incredibly fast in virtual screening assays [
80,
81].
Other less extended virtual docking tools that implement empirical scoring functions are Surflex and MolDcok (Molegro Virtual Docker). Surflex [
55] is a fully automatic flexible molecular docking algorithm that provides a scoring function able to effectively model protein–ligand noncovalent interactions. The scoring function is the sum of hydrophobic and polar complementarity (dominant terms), entropic and solvation terms. Molegro Virtual Docker [
56] (MVD) scoring function is based on piecewise linear potential (PLP) extended with an H-bond directionality term. MVD was tested using the GOLD dataset [
82] for docking accuracy. The results showed better performance for MVD (87% accuracy) [
56] than GLIDE, GOLD, FlexX and SurfleX. MVD presents an easy-to-use interface.
2.1.3. Semi-Empirical Scoring Functions
The most popular and versatile scoring functions are those that combine both physical terms (force fields) and experimentally derived parameters (empirical scores). These are known as semi-empirical scoring functions.
GLIDE [
20] (Schrödinger) uses GlideScore 2.5, a semi-empirical scoring function that combines empirical and force field-based terms. It presents two distinct scores derived from the ChemScore function: Standard-Precision (SP) and Extra-Precision (XP) Glide. SP can be used in virtual screening assays to minimize false negatives, as it is more tolerant than the XP version. Oppositely, the XP score is a stricter function presenting severe penalties for poses violating physical chemistry principles, thus minimizing false positives.
GOLD also implements force field-based scoring functions complemented with empirical terms (termed semi-empirical scoring functions). The GoldScore scoring function presents an H-bond term, a pairwise dispersion potential describing contribution to the hydrophobic energy of binding and an MM term accounting for the ligand’s internal energy.
AutoDock4 [
29] is the last version of AutoDock [
83] original software, the crown jewel among docking software with more than 10 thousand citations to date. The high applicability of AutoDock relies on the continuous upgrade and addition of new functionalities in each released version. Autodock4 scoring function [
58] uses a semi-empirical force field including an improved thermodynamic model of the binding process, an empirical method to estimate surrounding water contribution and a full desolvation model. The presence of the thermodynamic model allows the incorporation of moderate protein flexibility and the use of the scoring function for protein–protein docking. AutoDock4 provides high-quality predictions of ligand conformations and good correlations between predicted inhibition constants and experimental data [
84]. In addition to Autodock4 and AutodockTools [
29], Autodock Vina [
30] is two orders of magnitude faster than Autodock4 because it automatically calculates grid maps, and it is better at finding accurate binding poses [
84]. Vina scoring function is very similar to the X-Score [
57] scoring function (Xtool v1.2 software) except for including intramolecular contributions apart from intermolecular. Both scoring functions are considered empirical but they were also calibrated using experimental affinity measurements from the PDBbind database. X-Score scoring is commonly a combination of three individual scoring functions named HPScore, HMScore and HSScore, and it is possible for the user to modify this combination. Nevertheless, X-score does not perform molecular docking by itself, so it is necessary to apply it combined with a molecular docking program as a generator of binding poses. X-score can be used to re-rank binding poses from other docking software.
Autodock and GOLD use the Lamarckian genetic algorithm during the pose prediction procedure, which highly increases the computational cost. On the other hand, GLIDE uses anchor and growth strategies to reduce time costs and make them affordable for virtual screening assays.
2.1.4. Knowledge-Based Scoring Functions
These are derived from statistical information of frequently observed intermolecular close contacts by using the Potential of Mean Force (PMF [
59] principle defined by the inverse Boltzmann relation [
85].
The PMF [
59] score is a measure of the binding free energy of a protein–ligand complex calculated as the sum of the atom pair interactions as a function of distance derived from the Brookhaven PDB database. A revisited version of PMF (PMF04 [
60]) was derived from 7152 protein–ligand complexes, a 10-fold increase compared to the original PMF score (PMF99) derived from 697 complexes.
The Drugscore
PPI web server [
62] performs alanine scanning for protein–protein interactions. The scoring function includes distance-dependent pair-potentials derived from 851 complex structures and 309 experimental results from alanine scanning.
Other less extended virtual docking tools that implement knowledge-based scoring functions are FRED, HYBRID and POSIT implemented in OEDocking (OpenEye [
40]. FRED is a docking software that uses multiple knowledge-based scoring functions at different stages during exhaustive search including Shapegauss, PLP, Chemgauss, Chemscore, Screenscore, Chemical Gaussian Overlay (CGO), and Chemical Gaussian Tanimoto (CGT). However, the default scoring function used in FRED to rank poses is Chemgauss, which uses a Gaussian-smoothed potential [
61] for measuring ligand pose complementarity to the active site. One successful application of FRED is the discovery of BChE inhibitors at the nano-molar range [
41]. Similarly, HYBRID makes use of the same scoring functions as FRED except during the exhaustive search, where it uses the CGO ligand-based scoring function. Interestingly, HYBRID allows for the use of multiple conformations of the protein where the best structure is selected and used based on the docking database. Finally, POSIT [
86] is a ligand-guided docking method that uses existing information about bound ligands for improving pose prediction. Interestingly, POSIT can determine the best-suited protein structure when provided with multiple structures, a potential application for CPE.
2.1.5. Machine Learning-Based Scoring Functions
These scoring functions are based on a nonlinear fitting of the experimental measures of binding affinity and the features describing the protein–ligand complex. These scoring functions tend to perform better than previously described scoring systems given that linear fitting may not describe the real relationship of the score with the binding affinity. In addition, the computational cost of using ML algorithms for protein design is extremely reduced in comparison to using traditional rational design methods. However, the accuracy of an ML scoring function strongly relies on the training dataset used. In this sense, there exists a vast number of available datasets to benchmark binding affinity results derived from experimental studies of mutant proteins. The PDBbind Database [
87] is the most widely used with binding affinity data for 19,588 complexes or compounds. Other examples of available databases are BindingDB [
88], PubChem [
89] and ChEMBL [
90], which are commonly used for developing scoring functions based on specific protein–ligand complex targets, whereas the DUD [
91], DUD-E [
92] and MUV [
93] databases were designed for virtual screening purposes. However, using ML-derived scoring functions for engineering catalytic properties is challenging given the extremely large diversity of reaction types, mechanisms, cofactors, experimental reaction conditions, substrate specificities and promiscuities [
94].
ML scoring functions for docking are based on different methods such as Random Forest, Support Vector Machine, Artificial Neural Networks and Deep Learning Neural Networks [
95]. Random forests, or random decision forests [
96] consist of a learning method that constructs a multitude of decision trees giving as an output a categorical/classification response (i.e., active or inactive) or an average continuous prediction (numerical value for regression scoring) of the individual trees. Importantly, Random Forest-based scoring functions have been seen to perform poorly for binding pose prediction and virtual screening [
97] but still outperform classical scoring functions [
98] (e.g., RF-Score-v3 [
63], RF-Score-VS [
64] and Δ
vinaRF
20 [
65]). An accurate prediction of the binding mode is necessary to use these scoring functions. An example of application is the use of Δ
vinaRF
20 for deciphering enzyme promiscuity of cytochrome c in the formation of cytochrome C-cyclo [6]aramide binding complex which exhibited higher activities than unmodified cytochrome c in the oxidation of benzhydrol to benzophenone [
99].
Support Vector Machines were originally designed for classification and have been mainly employed for discriminating active and non-active ligand poses. A derived method named support vector regression can be used for regression analysis [
100]. These ML methods rely on supervised learning by using nonlinear kernel functions, which can describe the covariance structure of the fitness landscape similarly to Gaussian processing [
101]. Examples of this type of scoring functions are SVMGen [
66] (classification of protein kinases), ID-Score [
67] (used for re-scoring and predicting the sensitivity spectrum of various serine hydrolases to OP pesticides [
102]) and PLEIC-SVM [
68] (virtual screening of protein kinases, proteases and GPCR outperforming GLIDE predictions).
Artificial Neural Networks consists of a simulation of a brain functioning model with neurons organized as layers. DDFA [
69], BgN-Score/BsN-Score [
70] and NNScore 2.0 [
71] are examples of these types of scoring functions for docking. DDFA (docking data feature analysis) was developed for virtual screening and re-ranking purposes with minimal extra computing time. It consists of five types of features derived from Autodock, Vina and Rosetta Ligand showing a similar performance to other docking software such as ICM, Vina and Glide using DUD dataset [
71].
Overall, these ML approaches have been used in drug discovery, but we foresee the use of this type of scoring functions for the engineering of substrate specificity by coupling them with experimental library generation methods, binding affinity measurements and re-scoring the most promising binding poses. Moreover, these methods are recommended to be used for a more precise description of substrate true binding pose and the subsequent analysis of a more accurate pharmacophore for rational design of the binding pocket.
Su et al. performed a comparative assessment of scoring functions using a set of 285 protein–ligand complexes. This study revealed that VinaRF
20 (a Random Forest-tuned version of Vina) [
65] performs the best for computing binding score correlated with experimental binding constants (scoring power). The work of Su et al. revealed that scoring functions such as those implemented in X-Score, ChemPLP (GOLD), ChemScore (SYBYL) and Discovery Studio perform very well in terms of scoring power while London-dG (MOE), PMF (SYBYL) and PMF04 (Discovery Studio) decrease in performance. When assessing the ranking power (the ability to rank correctly known ligands), VinaRF
20 performs the best again and ChemPLP (GOLD), DrugScore (CSD), LigScore (Discovery Studio) and X-score also correlate very well the ranking of known ligands.
2.2. Examples of Substrate Specificity Engineering Using Virtual Docking
One of the main issues when it comes to in silico enhancing substrate specificity is the generation of multiple mutant protein structures on which to measure binding affinity of the targeted ligand and compare it to WT. When the enzymatic reaction is well known and established, there are two main approaches for computationally measuring mutational effects on protein–ligand affinity/selectivity. The first approach is to rationally propose mutations by visually inspecting protein–ligand complex [
23,
38]. Usually this is performed using crystallographic structures of WT proteins or mutants (if available) or with homology models. By docking target substrate and refining binding poses with a suitable scoring function, a pharmacophore can be established, and then rational mutations are proposed (rational design). The second approach involves the generation of a smart or exhaustive mutant protein library to measure the affinity and/or selectivity for a given substrate with a precise orientation.
2.2.1. Engineering of a Lipase for Omega-3 Fatty Acid Selectivity [32]
Two different approaches were tested for improving fatty acid selectivity (using two substrates, EPA and DHA) of a Geobacillus thermovalorans Lipase (GTL), rational and semi-rational design. Rational design involved the identification of binding-responsible residues present at one of the four pockets (acyl binding-site) based on the literature. V171 and L183 were replaced by bulkier amino acids resulting in a double mutant V171L/L183F, named DM-GTL. A two-step approximation was used consisting of (i) calculating the best binding mode presenting reactive orientation and (ii) calculating binding energy with no further flexibility by activating Autodock Vina “score-only” option. Semi-rational design consisted of identifying interacting residue for both substrates by virtual docking using Autodock Vina, with flexible side chain method. The 10 best binding modes for each substrate were analyzed. Most frequently observed (at least 15 out of 20 modes) to interact with the substrates were important for substrate selectivity. Amino acids playing an important role in the activity of the enzyme were excluded and the remaining positions were selected for site saturation mutagenesis (SSM). Six positions were screened: 170, 171, 244, 319, 358 and 359. From these six positions, 960 individual mutants were generated where 210 colonies showed lipolytic activity and, from these, 28 showed improved fatty acid hydrolysis. After sequencing, 13 different mutations were revealed at positions 170, 171 and 359.
2.2.2. Rational Re-Design of Candida Antarctica Lipase B (CALB) Towards Diels–Alder Activity [13]
Lipases are versatile biocatalysits commonly used to catalyze a myriad of organic reactions. Candida Antarctica Lipase B (CALB) was engineered for a Diels–Alder reaction. Six enzyme variants of CALB were considered: WT, S105A, I189A, S105A/I189A, I285A and S105A/I285A. These variants were selected based on previous studies and insights after visual inspection. CALB PDB structure (1LBT) was used for building homology models with SwissPDB. GOLD suite (Genetic Algorithm) was used to generate different ligand poses on each variant and were evaluated with ChemScore scoring function (H-bond restrictions between dienophile carbonyl oxygen and T40-NH, T40-OỿH and Q106-NH). In total, 50 different poses were generated for each molecule and were scored using composite scoring function. After numerical and visual analysis, the best poses were saved for docking of the diene. Enzyme variants were MD-relaxed and docked in the same way as non-relaxed structures. Near attack conformation analysis (NAC) and DFT calculations were used to elucidate activation energy barriers. Results showed that S105A/I189A variant gives up to 5% ‘loose’ NAC geometries in acetonitrile for one given simulation, and even higher in water (8.5%). The redesigned enzyme showed effective Diels–Alder activity in accordance with these computational results.
2.2.3. Identification and Engineering of the Key Residues at the Crevice-like Binding Site of Lipases Responsible for Activity and Substrate Specificity [34]
The process of molecular docking was performed by AutoDock 4.0 to explore the binding space of the enzyme–substrate complex. Flexible docking was carried out to evaluate ligand binding energies over the conformations search space using the Lamarckian genetic algorithm. The synergistic effects between Phe207 and Phe259 led to higher activity of the double mutant P207F/L259F than that of the single mutants. Amino acid residues located at the crevice-like binding sites of four representative lipases were rationally engineered and the obtained double mutants exhibited significantly improved activity towards p-nitrophenyl esters.
2.2.4. Constructing a Synthetic Pathway for Acetyl-Coenzyme a from One-Carbon Through Enzyme Design [8] (Exhaustive Library Search)
A synthetic Acetyl-CoA (SACA) pathway was constructed by repurposing a glycolaldehyde synthase and an acetyl-phosphate synthase. First, a theozyme was constructed including thiamine diphosphate (ThDP), glycoladehyde and glutamic acid (acid/base). A PDB search using this theozyme resulted in the identification of 37 non-redundant protein structures including the ThDP ligand. Then, the distance between C2 and glycolaldehyde was computed, as this distance plays a critical role for the catalytic reaction using DOCK 6 software. The docking procedure revealed six enzyme candidates presenting short distances with clear function annotations. Experimentally, three of the six candidates exhibited desired reaction activity and were selected for a directed evolution study. The engineered glycolaldehyde synthase exhibited more than 70-fold increased catalytic activity.
2.2.5. Creating Space for Large Acceptors: Rational Biocatalyst Design for Resveratrol Glycosylation in an Aqueous System [48]
Polyphenols display several interesting properties but their low solubility limits practical applications. Sucrose phosphorylase (SP) can produce α-glucosides through a transglycosylation reaction with sucrose as donor substrate. Glycosylation of resveratrol can dramatically improve its solubility and bioavailability. However, resveratrol binding to SP is hindered by an active-site loop, according to docking and modeling studies. Indeed, the unbolted loop variant R134A showed useful affinity for resveratrol (Km = 185 mM) and could be used for the quantitative production of resveratrol 3-α-glucoside in an aqueous system. In silico mutagenesis and docking studies indeed indicated that substitution of R134 with smaller residues (e.g., alanine) would leave an opening in the enzyme’s closed conformation, enabling the second ring of resveratrol to be accommodated.
2.3. Machine-Learning Tools for In Silico Enzyme Engineering
Shen et al. thoroughly reviewed the topic of Machine-Learning (ML) developments for protein–ligand docking [
100]. Moreover, Mazurenko et al. also reviewed ML methods and databases for enzyme engineering [
94]. In this section, we give a comprehensive overview of the available tools and databases as well as remarkable examples of ML use for biocatalyst design.
Similarly to QSAR models for lead optimization, ProSAR (Protein Sequence Activity Relationship) models can be used to infer the contributions of mutational effects on protein function coupled with efficient and minimal mutational experimental data [
44]. Fox [
42] developed a partial least-squares (PLS) regression ML methodology using a genetic algorithm (GA) for the directed evolution of proteins, which has been a major inspiration for other ProSAR models [
44,
45]. Interestingly, these models are sequence-based and do not need a three-dimensional structure assuming that phenotypical information is encoded at the protein sequence. This method was further developed with Halohydrin dehalogenase for improving the volumetric productivity of ethyl (R)-4-cyano-3-hydroxybutyrate(HN) [
43] up to 4000-fold with respect to WT. In this case, after 18 ProSAR-driven iterative cycles of directed evolution and subsequent HTS activity assay, a 99.9% HN R-enantiomer was obtained with 99.5% purity. It was demonstrated that using ProSAR approximation can be useful for individual or multi-objective engineering of biocatalyst properties such as enantioselectivity, activity, thermostability and others. It is worth noting that only additive effects were considered when formulating the equation correlating mutations with enzyme function and so, only linear terms were computed. If necessary, other nonlinear terms can be added to account for synergic mutation interactions. Following ProSAR models aiding enzyme selectivity engineering, Berland et al. developed a web tool for the rational screening of mutant libraries using ProSAR [
44] based on the previously established Fox strategy for ProSAR model building. This method was successfully tested for the engineering of (i) dextransucrase synthetic specificity towards α(1 → 3) or α(1 → 6) linkages in polysaccharide products and (ii) cytochrome P450 thermostability. In both cases, the model was demonstrated to be reliable enough to enable the prediction of new sequences: R
2 = 0.60 for the dextransucrase and R
2 = 0.94 for the cytochrome P450. Later, Berland and co-workers used this same strategy for the engineering of a transglucosylase for the production of kojibiose with controlled selectivity [
45]. The semi-rational mutagenesis strategy resulted in a double mutant (L341I/Q345S) with 95% selectivity for kojibiose production and final purity of >99.5%.
Another example of ML methods successfully applied to specificity engineering is the GT-predict [
46] tool for the identification of Glycosyl Transferase Superfamily 1 (GT1) potential novel substrates and functional annotation of uncharacterized GT1 members. The method is based on a decision tree approach trained on a varied combination of physicochemical properties and structural parameters. Its use in conjunction with structural approaches allowed for the identification of possibly important structural motifs and their roles within active sites. However, this method required a small but broad dataset of GT1’s activity performance on different substrates.
In order to reduce the necessity of a large dataset, Duan and Sun [
103] developed an ML workflow to generate mutant libraries with a high enrichment ratio for the recognition of specific substrates using
M. jannaschii tyrosyl-tRNA synthetase (TyrRS). In this case, the use of Rosetta modeling in combination with target-specific scoring function and ML (lightGBM) model calibration, the library enrichment ratio was increased by 11-fold compared with random mutation. By using the Rosetta EnzymeDesign method (de novo) to model the backbone changes and amino acid side chain packing upon reported mutations, they were able to predict the binding specificity of unnatural amino acids for every TyrRS mutant pair complex. The results showed that D158G/P mutants strongly influence backbone disruption of the α-helix at residues 158–163, opening the pocket to accommodate bulky unnatural amino acid.
3. Optimization of the Catalytic Efficiency of Enzymes
Simulating and numerically predicting enzymatic reactivity is a complex multi-objective challenge because it depends on different properties such as substrate binding selectivity [
104], electrostatic environment (redox potential and electron transfer [
104,
105]), pocket hydrophobicity, and even substrate surface diffusion [
106]. QSAR/QSPR (Quantitative structure−activity/property relationships) is one of the most common prediction models used for computational catalyst design. These models try to correlate hundreds of descriptors of the catalytic reaction with target properties to modulate, such as reactivity and selectivity [
21,
107]. Mainly, these models are built using regression analysis or ML methods for the description of a chemical space region accommodating the reaction. However, building QSAR/QSPR models for specific reactions requires high expertise and thus, its use is out of the scope of this review. Moreover, the use of QSAR models is not especially suitable for the automated design of catalysts. Several examples of QSAR have been described [
108].
Computational reactivity modeling and prediction are extremely challenging given the high complexity of the electronic structure of the catalyst, as well as the conformational and configurational landscapes of the reaction’s transition state. Several protocols have been designed based on different simulation approaches, such as MD, QM and hybrid MM/QM [
109]. In terms of scoring function for predicting enzymatic reactivity, there exist multiple approaches on how to measure it, ranging from protein–ligand geometric conformation favoring reaction, electron transfer probability and electron density prediction. Geometrical approaches rely on the prediction of the enzyme–substrate complex structure and measuring angles or distances between the designated reactant atoms fulfilling substrate catalytic requirements. The Empirical Valence Bond (EVB) theory calculates the reaction-free energies in the condensed phase. It uses potential surfaces for calculating the probability of electron transfer, using a calibrated Hamiltonian (operator corresponding to the total energy of the system in QM). The use of a Hamiltonian allows the approximation of the potential energy surface of a given reaction. To simplify, the catalytic reaction is modeled using two states corresponding to the reactants and products. Thus, EVB requires the reaction mechanism to be well characterized. Density-Functional Theory (DFT) is a QM modeling method capable of calculating electronic structure by means of functionals of the spatially dependent electron density. Although DFT calculations are sometimes unaffordable, given the necessary timescale of the simulations, their use in enzyme engineering has risen in recent years [
110]. The combination of DFT + MD is a promising strategy to study structure and reactions [
111]. The benchmark case of the example of citrate synthase [
110] illustrates the applicability of DFT for engineering enzyme reactivity. However, the use of QM methods for engineering enzyme reactivity still requires great handling expertise and careful system model building [
112]. In this section, we give an overview of protocols and methods successfully developed and applied to engineer enzyme catalytic efficiency in a comprehensive way. A list of the common computational tools and selected applications is collected in
Table 3.
3.1. Computational Methods to Engineer the Catalytic Efficiency of Enzymes
The CASCO [
113,
115] protocol (CAtalytic Selectivity by COmputational design) uses high-throughput-multiple independent MD (HTMI-MD) simulations to engineer Limonene epoxide hydrolase enantioselective transformation of cyclopentene oxide [
115], making it possible to replace experimental assays. This approach involves the design of a mutant enzyme with RosettaDesign [
127] for the identification of low energetic structures. The scoring function approximating the enzyme’s reactivity consists of measuring the fraction of time of the MD simulation that the complex presents Transition State-like structures (pro-RR or pro-SS). Mutant structures are evaluated in terms of Near Attack Conformations (NAC), which satisfy geometrical-based restraints, such as the angle of nucleophilic attack and the distance between reactant atoms. This protocol also allows approximating protein–ligand binding affinity by measuring the ratio of NAC frequencies for each enantiomer. The use of HTMI-MD (ultra-short simulations) allows for increasing the protein conformational search space by screening thousands of Rosetta Design mutants while reducing computational cost, as demonstrated with epoxide hydrolase [
115].
Houk et al. [
119] used Density-Functional Theory (DFT) calculations and subsequent MD simulations to study the substrate binding mechanism of P450 monooxygenase. DFT calculations approximate electron density by means of QM (theozyme), which allows for the prediction of enzymatic site selectivity. The MD simulations (0.5 µs) were compared to the ideal geometry (H-O distance and O-H-C angles) of the stabilized TS via DFT calculations in order to propose rational mutations. Previously, Houk et al. had already used the DFT calculation for the design and optimization of a new dirhodium catalyst with high enantioselectivity [
120] for the most accessible primary C-H bond by using ONIOM calculations. In this case, the “inside-out” protocol was used, which already had been applied to the so-called spiroligozymes. In this example, the protocol consisted of the de novo design of a transesterification catalyst and subsequently mutations improving its catalytic performance. Following QM calculations for computational reactivity, Cerqueira et al. reviewed these types of approaches and proposed a protocol based on catalytic geometry optimization [
109]. The strategy includes locating the TS of the enzyme by generating intermediate structures of the catalytic pathway, which can be obtained by restraining one or more internal coordinates. Then, the potential energy surface is calculated for this ensemble of structures which allows for the determination of the TS.
Sherman’s group [
21] combined MD simulations, docking and MM-GBSA scoring to approximate the catalytic reactivity of mutant enzymes. An MD simulation was used to generate an ensemble of bound configurations, which were scored by means of Induced Fit Docking (IFD) using GLIDE (GlideScore). IFD allows to account for protein flexibility while docking. In this case, the protocol was applied for the optimization of a ω-aminotransferase, identifying mutations increasing reactivity up to 20–60-fold for an imagabalin precursor with respect to WT. The protocol allows the binary predictive classification of mutant enzymes as active or inactive. Moreover, they developed a tuned IFD protocol including multiple iterations to be able to filter poses based on a distance cutoff between reactive PMP amine and substrate ketone group, accounting for reactive poses.
Maranas et al. developed IPRO and OptZyme [
122] (derived from the IPRO suite of programs), a computational procedure for the redesign of
E. coli β-glucuronidase (GUS) towards the use of novel substrate pNP-Gal. The protocol allows enzyme redesign in those cases where the TS structure of the reaction is unknown. In this case, it makes use of QM calculations to approximate a TS analogue for the identification of the rate-limiting step of the reaction. The idea behind this approach is to design mutations that lower the TS analogue energetic barrier. Results validated the correlation of the Interaction Energy upon a substrate (IE
s) with Km (R
2 = 0.960) and the IE
TSA with kcat/KM (R
2 = 0.864). Moreover, this procedure is particularly useful for systems where solute entropy is negligible. IEs were calculated using IPRO.
The IPRO [
124,
125,
128] suite of programs has been extensively used for different enzyme redesign purposes. It incorporates OptZyme (improvement of catalytic properties), OptGraft (design of the novel binding site) and OptCDR (antibody novel-complementarity design). The core functionality of IPRO is to randomly perturb the protein’s backbone around mutated residues for the identification of a new design with lower binding energy than the WT enzyme based on Interaction Energy calculations. IPRO allows for an iterative search of the mutations enhancing enzymatic activity/specificity. (Currently, IPRO only supports the use of the CHARMM force field). IPRO requires users to provide extensive information on how to run the experiment.
PELE [
116] combines a Monte Carlo stochastic algorithm using a localized steered perturbation with side-chain prediction and energy minimization based on Metropolis acceptance/rejection criteria. The acceptance criteria ensure that perturbation does not lead further along the coordinates of a given reaction and/or large interaction potential energy increase, resulting in a series of local minima with a high structural correlation. This approach enables a large sampling of configurational space and thus permits efficient CPE towards target-property. The scoring function used for ranking is an OPLS-AA force field in which only the ligand and the backbone of the protein are considered. Desolvation effects are not considered in this case, which may be necessary for some CPE campaigns. PELE structure prediction capability reproduces long time scale processes efficiently reducing computational time–cost. In this way, PELE enables obtaining an atomic detailed mechanism of the protein–ligand-induced fit of its recognition process and of the ligand migration. PELE could also have been introduced in the previous section as a computational tool to optimize enzyme–substrate interactions. We present this method here given the possibility to perform single-point mutations mixed QM/MM calculations to update the charges of complex ligands or to obtain quick estimates of a biochemical reaction [
111], which makes it useful for CPE. PELE was benchmarked by studying (i) aspirin binding to phospholipase A2 and Nuclear hormone receptors as a ligand refinement [
117]. Example cases of CPE for reactivity enhancement are presented in the following section.
FuncLib [
118] extends the PROSS protocol (see later) by designing stable networks of interacting residues within the active-site pocket of an enzyme aimed at increasing both protein stability and catalytic efficiency. Unlike other methods, FuncLib does not target specific substrates nor relies on models of enzymatic transition states. Rather, it exhaustively enumerates combinations of three to six mutations, and models each mutant using Rosetta. Designs are ranked by all-atom energy, prioritizing those that encode diverse stereochemical complementarities for alternative substrates, which do not need to be predefined. The method’s output is a repertoire of stable, highly efficient enzymes amenable to low-throughput experimental screening for desired activities, offering a practical solution for enzyme engineering and functional diversification. The generality of the methods was demonstrated by broadening the substrate selectivity of Acyl-CoA synthetases [
118] towards larger aliphatic acids.
CADEE [
123] (Computer-Aided Directed Evolution of Enzymes) is a computational framework used for the screening of thousands of enzyme variants based on the EVB approach. EVB can be used for large screening assays as it is fast and efficient, allowing us to obtain free energy calculations describing chemical reactivity in a physically meaningful way. CADEE requires a well-characterized system to obtain reliable results thus, a good quality EVB force field. CADEE can introduce mutations via alanine scanning. The CADEE framework was validated by comparing experimental results of Triosephosphate isomerase (
S. cerevisiae) Kcat to calculated values of free energy, showing a correlation with activation free energies [
123].
3.2. Examples of Enzymatic Reactivity Engineering Using Computational Methods
3.2.1. Computational Design of Enantio-Complementary Epoxide Hydrolases for Asymmetric Synthesis of Aliphatic and Aromatic Diols (CASCO) [115]
Limonene epoxide–hydrolase substrate was docked in the active site and placed in a reactive configuration (NAC) using orientation/distance restraints. Next, the Rosetta Monte Carlo search algorithm was used to optimize side chain geometries of amino acids surrounding the active site for either pro-RR or pro-SS attack of the nucleophilic water on the epoxide carbon. A large number of parallel MD simulations with independently assigned initial atom velocities (HTMI-MD) were performed. The reactivity and selectivity of each mutant were predicted by scoring the fraction of snapshots in which the enzyme–substrate complex is in a pro-RR or pro-SS near-attack conformation (NAC).
3.2.2. Insights into Laccase Engineering from Molecular Simulations: Toward a Binding-Focused Strategy (PELE) [104]
The objective of this study was to computationally design an evolved laccase with increased reactivity. Pycnoporus cinnabarinus laccase (PcL) and the substrates employed to screen activity were 2,2′-azino-bis(3-ethylbenzo- thiazoline-6-sulfonic acid) (ABTS) and 2,6-dimethoxyphenol (DMP). The CPE strategy is based on a combination of conformational sampling and quantum-chemical reactivity scoring based on changes in substrate’s spin density (electron transfer). The conformational space of the binding pocket is sampled using PELE. Subsequently, 20 mutant structures showing low binding energy poses were selected and their reactivity was scored by evaluating the amount of spin density localized on the substrate (evaluated using Mulliken partitioning method) with hybrid quantum mechanics−molecular mechanics (QM−MM) calculations. The QM region consisted of the substrate and residue’s first shell while the rest of the protein structure was treated with an OPLS-AA51 force field (classical MM). Desolvation effects were neglected to speed up calculations as the main objective was to screen large amounts of protein mutants with feasible computation time–cost. Mutant structure “hits” can be visualized as a bi-dimensional plot showing binding energy versus copper-substrate distance (substrate’s center of mass). Two different substrate-binding modes were used for the DMP substrate (resulting from docking studies using GLIDE). The evolved laccase carries five mutations: P394H and N208S, located in the T1 pocket, N331D and D341N, relatively close to the substrate entrance, and R280H, located far away on the protein surface for both substrates (kcat 13-fold improvement for ABTS and ~19-fold for DMP substrate). Correlation studies between the rate constant (kcat) with the redox potential difference (ΔE°) suggested the reduction is the rate-limiting step of the catalytic process determined by the free energy difference between products and reactants.
3.2.3. Computational Redesign of Acyl-ACP Thioesterase with Improved Selectivity Towards Medium-Chain-Length Fatty Acids (IPRO) [126]
The IPRO algorithm was used to design thioesterase (TesA) variants with enhanced C12 or C8 specificity while maintaining high activity. After four rounds of structure-guided mutagenesis, we identified three variants with enhanced production (reactivity) of dodecanoic acid (C12) and 27 variants with enhanced production of octanoic acid (C8). The top variants reached up to 49% C12 and 50% C8 while exceeding native levels of total free fatty acids. The potential of the IPRO algorithm to aid in protein engineering efforts was demonstrated using a Design–Build–Test–Learn approach to alter the substrate preference of TesA.
4. Engineering Protein Stability
Protein folding is mainly driven by intramolecular interactions between residues and hydrophobic effects leading to a well-defined native protein structure [
129]. However, the native conformation co-exists with misfolded and unfolded states. Free energy differences between the different conformational states of a protein determine which of the states is most populated. Protein conformational stability is thus defined as the free energy equilibrium between folded and misfolded states. Protein structures presenting lower energies in misfolded states can lead to aggregation.
Protein stability also refers to the resistance capacity of the protein’s native structure to high temperatures, denaturant agents, proteases, and non-physiological pH. The overall stability of a protein is determined by non-covalent interactions (e.g., hydrophobicity, van der Waals interactions, hydrogen bonding, and electrostatics) forming interaction networks that stabilize the native structure.
The topic of engineering biocatalysts for improved stability was profoundly reviewed by Bommarius and Paye [
130] and more recently by Musil et al. [
5] from a computational perspective. In the following sections, we present a selection of computational methods and scoring functions for the rational and automated computational design of biocatalysts with enhanced stability as well as presenting successful examples of use (see the list in
Table 4).
4.1. Computational Methods to Engineer the Protein Stability
4.1.1. Phylogenetic Analysis-Based Methods
Ancestral sequence reconstruction (ASR) is based on the assumption that ancestral enzymes existed in a much hotter environment billions of years ago with thermophilic organisms present on the earliest branches of the tree of life. In this way, searching for ancestral sequences using phylogenetic analysis must reveal thermostable enzymes. BAli-Phy [
131] implements ASR for enzyme optimization. The application of this method to adenylate kinase (Adk) resulted in the improvement of thermostability at 35 °C and near 2-fold catalytic activity enhancement [
132].
E. coli expression of ancestral and modern Adk sequences revealed salt bridges as the primary source for differential stability. In a similar manner, Damborsky et al. used ASR theory for the improvement of Haloalkane dehalogenase thermostability (∆Tm up to 24 °C) [
134]. On the other hand, consensus design (CD) relies on the assumption that the consensus residue at a given position in a multiple sequence analysis must be contributing the most to protein stabilization (not considering catalytic residues) compared to non-conserved residues [
156]. CD differs from ASR in the way that it does not try to reconstruct ancestral sequences but performs Multiple Sequence Alignment (MSA) to extract the most conserved residues rather than reconstructing the entire phylogeny.
4.1.2. Rational Design by Molecular Modeling
Rational design involves the study and characterization of the contribution of each residue to protein stability. One approximation is the modification [
157] or even deletion [
158] of flexible loops or residues, which can lead to the improvement of enzyme thermostability. A remarkable strategy is ‘loop grafting’, which stands for the accommodation/transfer of validated thermostable loops from other proteins to the target [
159]. This strategy was successfully applied to enhance the thermostability of subtilisin E-S7 (SES7) peptidase [
160] and proline 4-hydroxylase [
161]. Other rational approaches for the enhancement of thermostability involve protein surface-charge optimization [
162,
163], mutation of surface residues following the proline rule [
164,
165] and the introduction of disulfide bonds [
166,
167,
168]. Rational design is also applied for the enhancement of enzyme stability for detergent formulation, which is a major challenge in laundry industries. For example,
Bacillus stearothermophilus neopullulanase [
169] (bsNpl) was rationally engineered for improved activity at elevated temperatures and high surfactant concentrations. Protein structure was visually inspected for determination of internal cavities and residue positions for which an amino acid exchange could be beneficial. This rational design resulted in a drastic stabilization of bsNpl against inactivation by heat and detergents derived from five mutations. Importantly, the catalytic activity of the enzyme remained identical to the WT enzyme.
Available software for protein stability prediction based on Gibbs free energy calculation include FoldX [
140], ERIS [
138], PoPMuSiC [
135]. FoldX is an empirical force field developed for the prediction of mutational effects on the stability, folding and dynamics of proteins. The force field consists of a linear combination of empirical terms, including non-bonded terms (H-bonds, VdW and electrostatics), solvent interactions accounting for (de)solvation effects and explicit treatment of water molecules with persistent interactions (more than two hydrogen bonds). A unique feature of FoldX among other force fields is the estimation of the entropy derived from statistical analysis of the phi–psi distribution of a given amino acid throughout a set of high-resolution crystal structures.
Another powerful tool for stability prediction is ERIS [
138], a web server using a physical force field with atomic modeling and implemented backbone flexibility capabilities, allowing for higher predictive power on “small-to-large” mutations. The scoring function is expressed as a weighted sum of van der Waals forces, solvation, hydrogen bonding and backbone-dependent statistical energies. ERIS showed a correlation of 0.75 with experimental ∆∆G for 595 mutants of five proteins [
138]. It also presents a pre-relaxation option for low-resolution structures; therefore, its use is recommended for homology modeling-derived protein structures.
PoPMuSiC [
135] is a web server presented as a Protherm [
170] subset, for the prediction of mutational effects on protein stability based on the use of statistical potentials (knowledge-based). It uses a force field equation based on 13 physical and biochemical terms, including amino acid type, solvent accessibility, torsion angles, backbone conformation and distance between geometric centers of the side chains for every pair of atoms. PopMuSic only requires the WT protein or peptide structure in a PDB format as an input.
4.1.3. Knowledge-Based Scoring Functions
DFIRE [
143] (Distance-scaled, Finite-Ideal gas REference state) is a knowledge-based potential for the prediction of folding stability. It is an all-atom, distance-dependent, pairwise statistical energy function used to calculate the Potential of Mean Force (PMF) for mutations with a decreased number of atoms (avoiding small-to-large mutation predictions). The predicted free energy change due to mutation is calculated by assuming no structural relaxation after mutations. An extension of DFIRE called dDFIRE (dipolar DFIRE) was developed by Yang and Zhou [
129] based on the orientation angles involved in dipole–dipole interactions which significantly improved DFIRE performance in segment refolding. DFIRE has been successfully implemented into DMUTANT [
144].
4.1.4. Machine-Learning Methods and Scoring Functions
I-Mutant 3.0 [
145] (an extension of I-Mutant 2.0 [
171]) is a support vector machine (SVM)-based tool. It presents two different capabilities: (i) discrimination between stabilizing, destabilizing and neutral effects upon single point mutations and (ii) a regression estimator for predicting ∆∆G. I-Mutant 3.0 can use both protein sequence and structure with a prediction power of 56% and 61%, respectively, using data extracted from the Protherm database. The 3.0 version uses an input vector consisting of 42 values, including temperature, pH, residue type and residue environment. The last value accounts for the spatial environment when structure is available and for the nearest sequence neighbors when only using sequence data. On the other hand, MAESTRO [
146] (multi-agent stability prediction upon point mutations) is a more complex ML-based software for protein stability prediction. MAESTRO is structure-based and was also trained using data from the Protherm [
170] database. It combines neural networks with SVM, regression analysis and statistical potentials providing additional sequence and structural information (such as protein size or solvent accessibility) which can be used to select specific mutation sites. Individual results from the different agent predictors are combined in order to provide a consensus prediction for point mutations resulting in the multi-agent method. Moreover, MAESTRO software also presents running modes for disulfide-bond introduction-site prediction and multiple point mutation greedy scan. The results are presented as ∆∆G prediction with associated confidence estimation.
Deep learning methods have become increasingly prominent in enzyme engineering due to their ability to learn complex patterns from data to address not only protein stability but catalytic efficiency and substrate specificity as well. These are out of the scope of this review but have excellently been reviewed by many authors [
94,
172].
4.1.5. Hybrid Approaches
Computational methods combining different theoretical frameworks also exist aimed at enhancing protein stability. PROSS [
154,
173] is a web server combining multiple sequence alignment analysis and Rossetta modeling for the calculation of energy differences upon single-point mutation to define a space of potentially stabilizing protein mutations. From these, the optimal combination of mutations is identified by combinatorial sequence design with Rosetta.
The FRESCO [
152] (Framework for Rapid Enzyme Stabilization by Computational Libraries) strategy uses FoldX and Rosettaddg for the prediction of free energy ∆∆G derived from point mutations. It then uses the Dynamic Disulfide Discovery (DDD) algorithm (based on an ensemble of structures from an MD simulation) to search for the introduction of stabilizing disulfide bonds on limonene epoxide hydrolase. On the other hand, the use of the FRESCO strategy on glucose oxidase [
142] was reported to enhance its thermostability by 8.5 °C with increased pH tolerance (up to pH 8) where the WT becomes inactive. Moreover, the combination of these stabilizing mutations resulted in a 2-fold activity increase for gluconic acid production at industrial viable conditions.
FireProt [
153] is a web server combining energy- and evolution-based approaches for predicting highly stable multiple-point mutants. For the energy-based approach, FireProt performs a conservation and correlation analysis with subsequent filtering using Rosetta and FoldX predictions. On the other hand, the evolution-based approach performs back-to-consensus analysis and then uses FoldX for filtering.
4.2. Examples of Protein Stability Engineering Using Computational Methods
4.2.1. Computation-Aided Engineering of Starch-Debranching Pullulanase from Bacillus Thermoleovorans for Enhanced Thermostability [141]
In this work, authors combined FoldX, DFIRE and I-Mutant 3.0 resulting in a 3.8 °C increased Tm and a 2.1-fold longer half-life than the wild type at 70 °C. First, FoldX was used to perform Site Saturation Mutagenesis on a list of MD-predicted flexible residues. Subsequently, DFIRE and I-Mutant 3.0 were used to verify predicted stable mutants. The procedure resulted in six experimentally confirmed mutants enhancing thermostability from 17 computational designs.
4.2.2. Engineering a Thermostable Fungal GH10 Xylanase [137]
PoPMusic was used to predict potential key regions that might be crucial for enhancing Xyn10A_ASPNG thermostability. The feature of flexibility for each residue of the modeled Xyn10A was evaluated from the computation of protein folding free energy changes (−∆∆G) resulting from all possible amino acid substitutions. Four rounds of iterative saturation mutagenesis generated a quintuple mutant 4S1 (R25W/V29A/I31L/L43F/T58I) which exhibited thermal inactivation half-life (t1/2) at 60 °C that was prolonged by 30 folds in comparison with the wild-type enzyme. Furthermore, the mutant melting temperature (Tm) increased by 17.4 °C compared to the wild type. The notorious improvement of enzyme thermostability of 4S1 was attributed to the synergistic effects of the five mutations.
4.2.3. Thermostability Improvement of the Glucose Oxidase from Aspergillus Niger for Efficient Gluconic Acid Production [142]
FRESCO workflow was used to design variants of a glucose oxidase from Aspergillus niger for industrial applications with minimal experimental screening. Energy calculations with FoldX, Rosetta_ddg and ABACUS were performed to identify the potentially stabilizing mutations for further evaluation. The relative folding free energy changes (ΔΔGFold) were predicted by the FoldX and Rosetta_ddg algorithms. To enrich the beneficial mutations in the in silico library, the mutations were subsequently screened by visual inspection and molecular dynamics (MD) simulation. For each mutant, five independent 100-ps MD simulations with different random set initial atom velocities were performed using the Yamber3 force field. The combined mutant AnGOD-m containing five stabilizing mutations (T10K, A36M, R145N, G274S and E374Q) showed a +8.5 °C higher Tm value compared to the wild-type enzyme. When the temperature was 40 °C, the variant maintained 85% residual activities at pH 5.5 and 6.0 and 75% residual activities at pH 7.0, while the wild type maintained approximately 75% residual activities at pH 5.5, and 60% residual activities at pH 6.0 and 7.0.
4.2.4. Disulfide Bond Engineering of an Endoglucanase from Penicillium Verruculosum to Improve Its Thermostability [168]
A structure-based design of disulfide bonds was performed through Cys scanning to identify potential mutations that can result in disulfide bonds using Schrödinger’s BioLuminate software. Two improved enzyme variants, S127C-A165C (DSB2) and Y171C-L201C (DSB3), were obtained. Both engineered enzymes displayed a 15–21% increase in specific activity against carboxymethylcellulose and β-glucan compared to the wild-type. After incubation at 70 °C for 2 h, they retained 52–58% of their activity, while EGLII-wt retained only 38% of its activity.
5. Improving Protein Solubility
Protein solubility is a complex feature involving different physical and biological properties. Solubility is mainly related to the aggregation or self-association propensity of proteins which can be explained as an alternative and thermodynamically stable protein folding [
174]. Solubility property can be described quantitatively by measuring protein expression (expression yield) or qualitatively (soluble/insoluble). There exists little knowledge about which descriptors can be used to predict protein solubility. It is known that negative surface charge correlates with increased solubility [
175] and that protein aggregation is directly correlated to the number of aggregation-prone regions (APRs) present in protein sequences [
174,
176,
177,
178]. APRs are short 10–15 residue-long stretches in proteins with self-associate aggregation tendency into ordered intermolecular beta-sheet or “cross-beta” spines [
179]. In fact, proteins have evolved to be soluble in native physiological culture conditions and as a result, recombinant protein over-expressed for industrial or therapeutic uses present high aggregation rates. Globular proteins present higher aggregation rates with approximately 2–4 APRs per domain [
180] given their need for a hydrophobic core for secondary structure organization, thus generating aggregation-prone amino acid sequences. On the other hand, monoclonal antibodies (mAb) also present high aggregation rates promoted by APRs mainly located at complementary determining regions [
181,
182,
183]. However, only solvent-accessible APRs can form stable interactions leading to protein aggregation. In contrast, buried APRs often contribute to protein structure and function. The disruption of these buried APRs without knowledge of their contributions can lead to protein destabilization and/or loss of function. Regarding negative surface charge driving aggregation, a key strategy for reducing the aggregation propensity is to modulate the isoelectric point [
184], reducing the protein’s total net charge, which decreases the protein–protein repulsion and thus, increases integration chance probability.
5.1. Computational Methods to Improve Protein Solubility
Given the difficulty of obtaining other quantitative solubility measurements, no other measurable properties apart from APRs influencing solubility have been characterized to date [
185]. In this sense, solubility prediction tools are mostly based on machine learning approaches, ranging from simple statistical approaches to modern nonlinear methods such as support vector machines, random forests, or deep neural networks for APR detection developed using available data [
5]. For a more detailed review of the re-design of proteins for increased solubility, we recommend reviews by Navarro and Ventura [
177,
186].
The most common software tools for protein solubility design are listed in
Table 5 grouped by the theoretical framework they rely on. Independent of the methodology, these tools can be divided into three categories: (i) tools based on the analysis of protein primary sequences, (ii) methods based on the evaluation of sequence solubility profiles and (iii) based on measuring the effect of point mutations into protein solubility. The first group comprises tools scoring protein sequences with a single value. Examples of these are SolubiS (APRs identification and stability prediction), ESPRESSO (expression and solubility estimation), Periscope (
E. coli soluble expression in the periplasm), SoluProt (training dataset restricted to
E.coli expression) and other tools trained using the TargetTrack [
187] database such as SOLpro, PROSO II, ccSOL omics and DeepSol, which are very similar. The second group consists of tools scoring each protein residue with a single score indicating its contribution to the whole solubility of the protein. These tools can present aggregation-prone predictions or non-dimensional scores. Zyggregator and AGGRESCAN3D 2.0 are aggregation-prone predictive software tools while TANGO, WALTZ and PASTA 2.0 provide probability scores based on the training with amyloid-aggregate formation proteins. Finally, in the third group are those software tools specifically designed for measuring the effect of mutations on protein solubility. However, independently of the group, the outputs of these software tools are typically expressed as non-dimensional arbitrary scores with no correlation with measurable physical properties. Even so, generating quantitative scores for single-residue or fixed-size fragments is very useful for the rational design of soluble proteins, whereas whole-protein single solubility scores are useful for genomic projects [
5]. Following this, the last group of software tools is the main object of discussion in this section as we intend to present a comprehensive review of software tools for CPE. The different software and methods presented in this section are divided according to the method they use for generating solubility prediction (
Table 5): sequence analysis, structure analysis, machine learning or hybrid approaches.
5.1.1. Sequence-Based Analysis
The GAP [
188] (Generalized Aggregation Proneness) method is one of the first methods developed for the identification of APRs. It evaluates and classifies peptide sequences into amyloid fibril or amorphous β-aggregate-forming hexapeptides. The method relies on the observation that hexapeptides present position-specific amino acid propensities distinct from those of the amorphous β-aggregating. Although the GAP method takes protein sequence as input, it was developed using statistical analysis of computed frequencies of residue pair types from a dataset consisting of 139 amyloids and 168 amorphous peptides. The scoring function is a thermodynamic energy difference potential for each residue pair type (i,j) occurring at alternate and adjacent positions. First, residue pairs at nine different positions of hexapeptides were converted into propensity equations by normalization with overall residue pair composition in globular proteins. Then, these propensities were treated as partition functions and converted into thermodynamic energy potentials. The scoring function for predicting hexapeptides as fibril forming or amorphous β-aggregating is highly accurate for most of the peptides. Peptides showing high aggregation propensities are identified as APRs and thus, it is suitable for the rational design of mutant proteins with enhanced solubility.
TANGO [
179] calculates the partition function of the conformational phase-space following the principle that any peptide segment can populate any of the structural states (β-turn, α-helix, β-sheet aggregation and α-helical aggregation) according to a Boltzmann distribution. This partition function is relative to the energy of each state for a given peptide segment and permits the identification of APRs from the protein primary sequence. TANGO results are normalized between 0 and 1 and presented as a non-dimensional score over protein segments, which was validated to have an aggregation prediction accuracy of 0.7 with experimental data [
179]. TANGO incorporates four conformational states and different energy terms, considering hydrophobicity and solvation energetics, electrostatic interactions and hydrogen bonding. The model used by the TANGO algorithm is designed to predict β-aggregation in peptides and proteins and consists of a phase-space encompassing the random coil and the native conformations as well as other major conformational states, namely β-turn, α-helix and β-aggregate. Every segment of a peptide can populate each of these states according to a Boltzmann distribution. Therefore, to predict β-aggregating segments of a peptide, TANGO simply calculates the partition function of the phase space. TANGO is one of the first methods developed for computational prediction of protein aggregation and it has been widely applied stand-alone. For example, TANGO was used as the core algorithm for the design of anti-amyloid cyclic peptides with increased solubility for therapeutic use as an anti-Alzheimer treatment [
190]. It was also applied for the identification of Src homology 2 (SH2) domain mutations increasing the yield of soluble TSAd-SH2 domains [
191]. A 9-residue-long sequence (SAVTFVLTY) was identified as the key factor leading to beta-sheet aggregation. The TFV to GYT sequence mutation doubled the yield of soluble protein expression.
SODA [
193] is a sequence-based method combining PASTA [
195] aggregation energy score, ESpritz [
211] intrinsic disorder propensity score, the negative Kyte–Doolittle hydrophobicity profile [
212] and FESS [
213] estimated secondary structure propensities for α-helix and β-strand. Each score difference ∆S is summed and normalized for the full sequence. SODA is very efficient at predicting solubility decreases with a prediction accuracy of 72%. SODA provides two types of analysis, namely ‘mutation mode’ and ‘full-protein mode’. The first provides the solubility change on sequence mutation. The second generates a profile describing the contribution to the solubility of each sequence position deduced from the effect of all possible mutations. SODA was used in the structural analysis of the STN1 gene involved in Coat Plus Syndrome for the solubility prediction of pathogenic mutations [
194]. The results showed that 10 out of 30 pathogenic mutations decrease protein solubility. These findings are expected to be useful for the development of novel strategies for the therapeutic treatment of Coat Plus Syndrome.
PASTA 2.0 [
195] is a web server that evaluates the stability of putative β-strand inter-molecular pairings between different sequence stretches and predicts amyloid formation regions from protein sequences. PASTA discriminates the orientation of β-strands between parallel and anti-parallel. The PASTA 2.0 scoring function was calibrated using the TESE dataset extending the energy parameters of the previous PASTA version (hydrogen bonding statistics on β-strands). In this version, an ML method was implemented for the detection of secondary structure while maintaining a disorder predictor. These two outputs provide structural information that is easy to interpret which can be related to aggregation prediction. PASTA 2.0 performance upon aggregation assignment (to a sequential stretch) was tested using an AMYLPRED2 test set consisting of 33 proteins with 1260 annotated aggregating residues. Results were compared to other aggregation predictors such as Aggrescan and TANGO. PASTA 2.0 showed 41% sensitivity with 85% in contrast with 35% and 14% of sensitivity for Aggrescan and TANGO. Moreover, PASTA 2.0 reached the highest MCC (Matthews correlation coefficient measuring the quality of binary classifications) with 0.24 in front of Aggrescan (0.13) and TANGO (0.14). These results allowed us to establish PASTA2.0 as a ‘hot spot’ predictor with high confidence, Triosephosphate isomerase [
196] and Human helicase [
197] being examples of use. In the first case, eight regions from human triosephosphate isomerase (HsTPI) were predicted as fibrillogenic-forming regions—with five of them located at β-strands regions—and were selected for experimental aggregation studies. Among these regions, four were experimentally validated as fibrillogenic, corresponding to the β3, β6, β7 and α8 of the TIM barrel. Likewise, PASTA 2.0 was used along with Aggrescan and FoldAmyloid for the prediction of aggregation regions in RNA Polymerase Interacting Helicase (HelD) from
B. subtilis where 20 regions spread across sequence were suggested to aid the formation of amyloid-like fibrils.
ESPRESSO [
198] (EStimation of PRotein ExpreSsion and Solubility) is a web server developed for the estimation of protein expression and solubility in
E. coli and wheat germ. The training datasets consisted of 5100 proteins (1774 soluble and 3326 insoluble) and 2939 (1941 soluble and 998 insoluble) for
E. coli and wheat germ, respectively. For the
E. coli dataset, SVM was used for training statistical models for expression and solubility prediction, while a sequence pattern-based method was used for wheat germ. ESPRESSO presents two prediction methods: (i) sequence/predicted structural property-based and (ii) sequence pattern-based method. The second method enables users to mutate candidate predicted regions for improving expression and/or solubility. For this method, the scoring function is calculated as the difference between frequencies of sequence patterns in positive (soluble) and negative (insoluble) datasets and is presented as a normalized adimensional 0 to 1 value with an associated
p-value for each score. In this way, predictions can discriminate soluble from insoluble regions or motifs. For insoluble regions, ESPRESSO suggests point mutations for enhancing solubility and so, it can be used for rational design of insoluble proteins. An example of this use is the selection for experimental testing of different anthranilate phosphoribosyltransferase (AnPRT) protein variants with potentially high solubility [
199].
5.1.2. Structure-Based Analysis
The CamSol [
200] web server consists of three algorithms that can be used individually for specific tasks or together to rationally design protein variants with enhanced solubility: (i) a sequence-based predictor of intrinsic solubility profiles and solubility scores, (ii) an algorithm exploiting knowledge of the native structure to perform structural corrections to the intrinsic solubility profile, and (iii) an algorithm analyzing the solubility profile to identify suitable sites for amino acids substitution or insertion. Increasingly negative profiles represent increasingly insoluble regions, while positive profiles represent increasingly soluble ones. The first algorithm can be used to screen protein-variant libraries for enhanced solubility. When combining the three algorithms, CamSol performs a systematic screening of thousands of substitutions and insertions while preserving fundamental properties to identify the most soluble variant. Interestingly, CamSol accepts (i) low-resolution and homology modeling-derived structures and (ii) a list of non-mutable residues (catalytic or structurally important). Pascal et al. used the CamSol web server for the assessment of the solubility profile and posterior rational design of a plant rhabdovirus glycoprotein for the production of immunoreactive murine anti-sera [
201]. Lettuce necrotic yellow virus (LNYV) rhabdovirus glycoprotein native signal peptide was substituted with that of Rabies virus glycoprotein based on CamSol solubility predictions. His
6 and FLAG-tags were added at N and C-termini, respectively, which were also predicted to enhance solubility. In this case, an increased glycoprotein solubility had been previously related with higher expression yields. Another example is the rational design of β-2 microglobulin (β-2m) mutants with reduced aggregation propensity, which is calculated as the inverse of CamSol solubility score. Selected substitution sites were mutated to all possible amino acids. The results identified the V85E mutation to be aggregation-resistant but with reduced thermostability (with a Tm value decreased by about 3 °C relative to WT). More recently, CamSol was used for the solubility prediction of a multi-epitope vaccine against SARS-CoV-2 developed by Martin and Cheng [
203].
AGGRESCAN3D [
204,
205] (A3D) is a structure-based solubility prediction method, developed from the combination sequence-based AGGRESCAN [
214] residue aggregation propensity and structural information. A3D allows for the detection of APRs while incorporating a mutation module that allows the design of proteins with increased solubility by mutating the detected aggregation-prone residues or their surroundings. The aggregation propensity is calculated for spherical regions centered on every residue Cα carbon. Moreover, A3D presents a ‘Dynamic Mode’ to analyze the impact of structural fluctuations on the aggregation propensity by minimizing the structure using FoldX, followed by CABS-flex [
215] simulation of protein structure flexibility. The resulting trajectory is automatically processed to provide a set of protein models (in an all-atom resolution) reflecting the most dominant structural fluctuations in the near-native ensemble. Recently, an extension of A3D was released as a 2.0 version with a larger analysis range for proteins of up to 4000 residues long, a feature for simultaneous prediction of changes in protein solubility and stability and an ‘automated mutations’ tool for suggesting protein variants with optimized solubility.
A more recent solubility prediction tool is AggScore [
206]. The algorithm is entirely based on a three-dimensional structure and is able to identify APRs by quantifying the energetic contribution of each residue to respective hydrophobic and electrostatic surface patches. After the identification of APRs, these surface patches are further classified into three categories, based on their surface potential values: hydrophobic, positive and negative regions. The hydrophobic potential is calculated for each atom based on logP parameters projected onto the interaction surface. On the other hand, positive and negative hydrophilic APRs are calculated using atom partial charges for the surface projections. The scoring function for aggregation propensity (AggScore) calculates the intensity and relative orientation of these respective APRs as the sum of the aggregation propensity values at each residue.
5.1.3. Machine Learning
ML methods have historically been the most widely used for predicting protein solubility [
94,
216], mainly for the enhancement of recombinant overexpression in
E. coli [
216]. PON-sol [
207] web server is a 2-layer Random Forest (ML) predictor that classifies mutant protein variants into three classes: increased, decreased and no effect on solubility. The software performance was tested and compared against CamSol and OptoSolmut [
208], where it showed higher predictive accuracy using blind test, with 43% of correct predictions against 35% and 28% for CamSol and OptoSolmut, respectively, although the dataset consisted of less than 400 protein variants. PON-sol was applied to Interleukin-1β as a case of example, resulting in 1030 mutations predicted to increase solubility from a total set of 2907 variants (35.4%). Among these, seven positions were characterized as hotspots contributing the most to increase the solubility predicted score when mutated.
OptSolmut [
208] is an ML-derived scoring function capable of measuring the “degree of buriedness” for three body contacts under the framework of Delaunay Tesselation (DT) (a geometry–based construct defining clusters of nearest neighboring points or four body contacts. It was used for the successful identification of mutants improving stability and enzyme reactivity. The degree of buriedness is a coarse-grained estimation of surface exposure with no measure of surface areas allowing for a better definition of neighboring residues, with the assumption that solubility is predominantly a surface property. The scoffing function for solubility prediction is based on measuring the frequencies of amino acid triplets presenting low ‘buriedness degree’, meaning that these triplets are located predominantly at the protein surface. The three-body contacts or triangles are classified as buried when forming part of two tetrahedrons in DT. The total score of a protein structure conformation is the sum of individual scores of amino acid triplets. Importantly, this scoring function based on groups of surface residues allows for the prediction of single and multiple-point mutational effects on solubility in a unified way, in contrast with most solubility prediction tools. The scoring function was trained using the ML Linear Programming (LP) approach for binary classification of amino acid triangles as buried and non-buried. The training was carried out using a dataset consisting of 137 single- and multiple-point mutants for changes in solubility extracted from the literature. Cross-validation studies were compared with two classification methods, SVM and Lasso, where OptSolmut outperformed both with an 81% overall accuracy. However, results should be taken with care given the small dataset used for validation.
One of the most recently developed ML tools for the prediction of solubility enhancement is Cordax [
209], a structure-based machine learning approach that explores sequence determinants of amyloid propensity. Cordax explores amyloid sequence beyond the identification of APRs. First, a curated dataset was built consisting of 78 short-segment fibril core high-resolution structures from PDB, which were grouped into distinct classes based on topology and their overall structural properties. Cordax followed the same initial approximation as used in GAP, dividing the amino acid sequences into hexapeptides for the training dataset, yielding 179 peptide fragment structures. Then, amyloid interaction interfaces were analyzed in detail following energy refinement by the FoldX force field. Free energies calculated with Foldx were used to train a logistic regression model with binary classification. The prediction output of Cordax is multiple: First, there is the prediction from the logistic regression whether the segment is an amyloid core sequence. Second, for the sequences classified as amyloid core-forming, the most likely amyloid core model is provided. Hexapeptides presenting scores equal to or above the aggregation propensity threshold (0.71) are considered APRs. In this way, Cordax enables the prediction of APRs with different feature properties such as high solubility, high net charge, surface exposure in protein native folds, composition similarity to phase transition sequences and disorder or α-helix propensity (conformational switches). Cross-validation accuracy studies for Cordax were compared with different methods for amyloid aggregation propensity prediction such as TANGO, PASTA, AGGRESCAN and GAP, previously explained in this section. The receiver operating characteristic (ROC) curves generated showed that Cordax performance is the best among all other predictors tested, with an accuracy of 0.81. This high accuracy in identifying amyloid fibril-forming regions can be explained, as Cordax is able to detect APRs not presenting typical sequence propensities detected by sequence-based predictors, such as hydrophobicity or β-structure tendency. Cordax accuracy was further validated by synthesizing a subset of 96 peptides from detected APRs of the protein initial training dataset with more than half (55.3%) being predicted specifically by Cordax. Apart from a large cluster corresponding to sequences found in the hydrophobic core of globular proteins, Cordax also found clusters corresponding to surface-exposed amyloid sequences, small aliphatic functional amyloids, N/Q/Y prions, strongly helical and intrinsically disordered sequences which could be compatible with liquid–liquid phase responsive sequences.
5.1.4. Hybrid Approaches
SolubiS [
210] is a hybrid method for the prediction of mutants reducing aggregation tendency implemented as a plugin in YASARA [
217] software (molecular graphics, modeling and simulation program). It combines TANGO [
179] (detailed in the sequence-based section) and FoldX [
218] tools to guide the design of aggregation-resistant protein sequences. The implementation of FoldX allows for the calculation of the overall energetic effects on protein stability for every APR detected. Importantly, SolubiS presents functionality that allows mutating one or several residues to calculate the effects (increase or decrease) over aggregation tendency presenting results on an easy-to-interpret ‘stretch-plot’. It is worth noting that, while TANGO can detect multiple APRs in a protein, SolubiS score is able to discriminate which of these APRs are solvent-accessible, determining aggregation propensity and thus, mutation targetable. In addition, SolubiS allows for the evaluation of the influence of temperature, ionic strength and pH on the aggregation prediction. SolubiS methodology has successfully been used for decreasing the aggregation propensity of human lysosomal hydrolase α-galactosidase (α-Gal) and protective antigen (PA) of
Bacillus anthracis (anthrax vaccine formulation) [
180], as well as for the engineering of human antibody variable domains with increased aggregation resistance [
182].
5.2. Examples of Protein Solubility Enhancement Using Computational Tools
5.2.1. Computational Design and Biophysical Characterization of Aggregation-Resistant Point Mutations for Human γD Crystallin [219]
Human γD crystallin is a stable protein expressed in the eye and responsible for lens transparency. However, this protein is susceptible to aggregation during the refolding process. RosettaDesign in combination with aggregation-propensity calculations (AGGRESCAN, PASTA, and TANGO) was used to predict mutants that are resistant to aggregation by measuring the effect of protein mutations on relative unfolding free energies (ΔΔGun) and intrinsic aggregation propensity (IAP). Despite being the least conformationally stable mutation, S130P was the most resistant to aggregation variant of Human γD crystallin, indicating a significant decrease in its aggregation propensity compared to WT.
5.2.2. Prediction of Hotspots for the Reduction of Aggregation Propensity of Human α-Galactosidase and Protective Antigen of Bacillus Anthracis
Schymkowitz et al. [
180] used SolubiS methodology to decrease the aggregation propensity of human lysosomal hydrolase α-galactosidase (α-Gal) and protective antigen (PA) of
Bacillus anthracis (anthrax vaccine formulation). α-Gal deficiency causes Fabry disease, which can be treated by enzyme replacement therapy. α-Gal was engineered using SolubiS methodology to decrease its aggregation tendency by suppressing APRs. Mutational scanning of gatekeeper residues (charged or proline residues reducing aggregation [
220]) resulted in the identification of A348R and A368P stabilizing APR without lowering intrinsic aggregation. In addition, an exhaustive mutation scan was performed looking for enhanced thermodynamic stability. The results showed S405L tightening the interaction of the edge β-strand. Overall, single mutants showed an experimental solubility increase of up to 80–90% compared with 70% of WT protein. Moreover, double and triple mutants showed a decrease in insoluble fraction by Western blot analysis. On the other hand, the rational design of aggregation-resistance PA using Solubis demonstrated that domain 3 is responsible for in vitro aggregation by identifying the double mutant T576E/S559L presenting improved thermodynamic stability and increased aggregation resistance at 40 °C. Overall, Schymkowitz et al.’s [
180] research demonstrated that alteration in the number of APRs has a direct correlation on protein solubility and abundance. Another example of SolubiS use is the engineering of human antibody variable domains with increased aggregation resistance [
182]. In this study, APRs present at complementary determining regions (CDRs) of monoclonal antibodies were demonstrated to determine the aggregation behavior under mild temperatures. SolubiS was applied to vascular endothelial growth factor (VEGF) for the rational design of mutations targeting APRs resulting in the identification of different mutant antibodies with improved aggregation resistance under temperature stress.
6. Concluding Remarks and Outlook
This review has presented an overview of the computational methods used to enhance key biocatalytic properties of enzymes, focusing specifically on stability, solubility, substrate specificity, and catalytic efficiency. These methods have proven to be robust and versatile, enabling detailed exploration of enzyme behavior and properties at atomic and molecular levels. They were grouped by the catalytic property they optimize with the aim of facilitating the selection by non-expert users to facilitate theoretical to experimental workflows for enzyme optimization.
Despite the advancements in the field, challenges remain. The accuracy of the current computational predictions often relies on the availability of high-quality structural and mechanistic data, which is not always accessible. Additionally, modeling enzyme behavior under non-standard or highly variable conditions, such as extreme temperatures or complex environments, continues to be a limitation. Improvements in scoring functions, particularly those capable of capturing dynamic properties like allosteric effects or solvent interactions, are essential for furthering the utility of these tools. Moreover, greater integration of experimental data into computational frameworks will enhance the reliability of the predictions.
This review deliberately excluded discussions on deep learning methods and de novo protein design, both of which have been reviewed extensively elsewhere. These areas represent a paradigm shift in computational enzyme design. Deep learning approaches, in particular, have demonstrated the ability to process large datasets, uncover hidden patterns in protein sequences, and generate innovative designs beyond the reach of traditional methods. Similarly, de novo design allows for the creation of entirely new enzymes, expanding the boundaries of what is possible in protein engineering.
Looking ahead, the future of computational enzyme design lies in the integration of traditional molecular modeling approaches with the transformative potential of artificial intelligence (AI). Combining the mechanistic insight and interpretability of molecular modeling with the predictive power and data-driven nature of AI offers an unprecedented opportunity to overcome current limitations. For instance, AI-driven tools could refine molecular dynamics simulations by identifying key conformational changes or accelerate sequence space exploration by prioritizing mutations with high potential for success.
Additionally, hybrid approaches that leverage both rational design principles and machine learning models could improve the accuracy of scoring functions, allowing for better prediction of enzymatic properties under diverse conditions. These synergies could also facilitate the design of multi-functional enzymes or enzymes tailored to highly specific industrial applications, such as green chemistry, pharmaceutical synthesis, or bioenergy production.
In conclusion, while traditional computational methods remain a cornerstone of enzyme engineering, the rapid evolution of AI and deep learning is set to redefine the field. The integration of these approaches will enable more efficient, accurate, and innovative enzyme designs, unlocking new possibilities in biotechnology.
Author Contributions
Conceptualization, X.B. and A.P.; methodology, A.V.; investigation, A.V.; writing—original draft preparation, A.V.; writing—review and editing, X.B. and A.P.; supervision, X.B. and A.P.; project administration, A.P.; funding acquisition, A.P. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by “Ministerio de Ciencia, Innovación y Universidades” (MICIU), Spain, with grant numbers GLYCODESIGN (PID2019-104350RB-I00) and GLYCOENGIN (PID2022-138252OB-I00) to A.P. The APC was funded by MDPI.
Acknowledgments
A.V. acknowledges a predoctoral fellowship from IQS.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
CPE | Computational protein engineering |
ML | Machine learning |
MM | Molecular mechanics |
QM | Quantum mechanics |
NAC | Near attack conformation analysis |
DFT | Density functional theory |
ASR | Ancestral sequence reconstruction |
APRs | Aggregation-prone regions |
References
- Protein Engineering Market Size and Share Report, 2024–2030. Available online: https://www.grandviewresearch.com/industry-analysis/protein-engineering-market (accessed on 30 December 2024).
- Verma, R.; Schwaneberg, U.; Roccatano, D. Computer-Aided Protein Directed Evolution: A Review of Web Servers, Databases and Other Computational Tools for Protein Engineering. Comput. Struct. Biotechnol. J. 2012, 2, e201209008. [Google Scholar] [CrossRef]
- Falivene, L.; Cao, Z.; Petta, A.; Serra, L.; Poater, A.; Oliva, R.; Scarano, V.; Cavallo, L. Towards the Online Computer-Aided Design of Catalytic Pockets. Nat. Chem. 2019, 11, 872–879. [Google Scholar] [CrossRef]
- Ali, M.; Ishqi, H.M.; Husain, Q. Enzyme Engineering: Reshaping the Biocatalytic Functions. Biotechnol. Bioeng. 2020, 117, 1877–1894. [Google Scholar] [CrossRef]
- Musil, M.; Konegger, H.; Hon, J.; Bednar, D.; Damborsky, J. Computational Design of Stable and Soluble Biocatalysts. ACS Catal. 2019, 9, 1033–1054. [Google Scholar] [CrossRef]
- Romero-Rivera, A.; Garcia-Borràs, M.; Osuna, S. Computational Tools for the Evaluation of Laboratory-Engineered Biocatalysts. Chem. Commun. 2017, 53, 284–297. [Google Scholar] [CrossRef]
- Allen, W.J.; Balius, T.E.; Mukherjee, S.; Brozell, S.R.; Moustakas, D.T.; Lang, P.T.; Case, D.A.; Kuntz, I.D.; Rizzo, R.C. DOCK 6: Impact of New Features and Current Docking Performance. J. Comput. Chem. 2015, 36, 1132–1156. [Google Scholar] [CrossRef]
- Lu, X.; Liu, Y.; Yang, Y.; Wang, S.; Wang, Q.; Wang, X.; Yan, Z.; Cheng, J.; Liu, C.; Yang, X.; et al. Constructing a Synthetic Pathway for Acetyl-Coenzyme A from One-Carbon through Enzyme Design. Nat. Commun. 2019, 10, 1378. [Google Scholar] [CrossRef]
- Li, Y.; Netherland, M.D.; Zhang, C.; Hong, H.; Gong, P. In Silico Identification of Genetic Mutations Conferring Resistance to Acetohydroxyacid Synthase Inhibitors: A Case Study of Kochia Scoparia. PLoS ONE 2019, 14, e0216116. [Google Scholar] [CrossRef]
- Cheng, L.; Zhang, H.; Cui, H.; Wang, W.; Yuan, Q. Efficient Production of the Anti-Aging Drug Cycloastragenol: Insight from Two Glycosidases by Enzyme Mining. Appl. Microbiol. Biotechnol. 2020, 104, 9991–10004. [Google Scholar] [CrossRef]
- Jones, G.; Willett, P.; Glen, R.C.; Leach, A.R.; Taylor, R. Development and Validation of a Genetic Algorithm for Flexible Docking. J. Mol. Biol. 1997, 267, 727–748. [Google Scholar] [CrossRef]
- Huang, W.C.; Cullis, P.M.; Raven, E.L.; Roberts, G.C.K. Control of the Stereo-Selectivity of Styrene Epoxidation by Cytochrome P450 BM3 Using Structure-Based Mutagenesis. Metallomics 2011, 3, 410–416. [Google Scholar] [CrossRef]
- Linder, M.; Hermansson, A.; Liebeschuetz, J.; Brinck, T. Computational Design of a Lipase for Catalysis of the Diels-Alder Reaction. J. Mol. Model. 2011, 17, 833–849. [Google Scholar] [CrossRef]
- Ali, A.; Azam, M.W.; Khan, A.U. Non-Active Site Mutation (Q123A) in New Delhi Metallo-β-Lactamase (NDM-1) Enhanced Its Enzyme Activity. Int. J. Biol. Macromol. 2018, 112, 1272–1277. [Google Scholar] [CrossRef]
- Tran, L.T.; Blay, V.; Luang, S.; Eurtivong, C.; Choknud, S.; González-Diáz, H.; Ketudat Cairns, J.R. Engineering Faster Transglycosidases and Their Acceptor Specificity. Green. Chem. 2019, 21, 2823–2836. [Google Scholar] [CrossRef]
- Tautermann, C.S. GPCR Homology Model Generation for Lead Optimization. Methods Mol. Biol. 2018, 1705, 115–131. [Google Scholar] [CrossRef]
- Krammer, A.; Kirchhoff, P.D.; Jiang, X.; Venkatachalam, C.M.; Waldman, M. LigScore: A Novel Scoring Function for Predicting Binding Affinities. J. Mol. Graph. Model. 2005, 23, 395–407. [Google Scholar] [CrossRef] [PubMed]
- Pandey, R.P.; Parajuli, P.; Shin, J.Y.; Lee, J.; Lee, S.; Hong, Y.S.; Park, Y.I.; Kim, J.S.; Sohng, J.K. Enzymatic Biosynthesis of Novel Resveratrol Glucoside and Glycoside Derivatives. Appl. Environ. Microbiol. 2014, 80, 7235–7243. [Google Scholar] [CrossRef] [PubMed]
- Li, Q.; Huang, X.; Zhu, Y. Evaluation of Active Designs of Cephalosporin C Acylase by Molecular Dynamics Simulation and Molecular Docking. J. Mol. Model. 2014, 20, 2314. [Google Scholar] [CrossRef]
- Friesner, R.A.; Banks, J.L.; Murphy, R.B.; Halgren, T.A.; Klicic, J.J.; Mainz, D.T.; Repasky, M.P.; Knoll, E.H.; Shelley, M.; Perry, J.K.; et al. Glide: A New Approach for Rapid, Accurate Docking and Scoring. 1. Method and Assessment of Docking Accuracy. J. Med. Chem. 2004, 47, 1739–1749. [Google Scholar] [CrossRef]
- Sirin, S.; Kumar, R.; Martinez, C.; Karmilowicz, M.J.; Ghosh, P.; Abramov, Y.A.; Martin, V.; Sherman, W. A Computational Approach to Enzyme Design: Predicting W-Aminotransferase Catalytic Activity Using Docking and MM-GBSA Scoring. J. Chem. Inf. Model. 2014, 54, 2334–2346. [Google Scholar] [CrossRef]
- Sun, Z.; Lonsdale, R.; Ilie, A.; Li, G.; Zhou, J.; Reetz, M.T. Catalytic Asymmetric Reduction of Difficult-to-Reduce Ketones: Triple-Code Saturation Mutagenesis of an Alcohol Dehydrogenase. ACS Catal. 2016, 6, 1598–1605. [Google Scholar] [CrossRef]
- Lee, H.S.; Park, J.; Yoo, Y.J.; Yeon, Y.J. Engineering D-Lactate Dehydrogenase from Pediococcus Acidilactici for Improved Activity on 2-Hydroxy Acids with Bulky C3 Functional Group. Appl. Biochem. Biotechnol. 2019, 189, 1141–1155. [Google Scholar] [CrossRef]
- Dong, Q.; Yuan, S.; Wu, L.; Su, L.; Zhao, Q.; Wu, J.; Huang, W.; Zhou, J. Structure-Guided Engineering of a Thermobifida fusca Cutinase for Enhanced Hydrolysis on Natural Polyester Substrate. Bioresour. Bioprocess. 2020, 7, 37. [Google Scholar] [CrossRef]
- Rarey, M.; Kramer, B.; Lengauer, T.; Klebe, G. A Fast Flexible Docking Method Using an Incremental Construction Algorithm. J. Mol. Biol. 1996, 261, 470–489. [Google Scholar] [CrossRef]
- Kersten, C.; Fleischer, E.; Kehrein, J.; Borek, C.; Jaenicke, E.; Sotriffer, C.; Brenk, R. How to Design Selective Ligands for Highly Conserved Binding Sites: A Case Study Using N-Myristoyltransferases as a Model System. J. Med. Chem. 2020, 63, 2095–2113. [Google Scholar] [CrossRef]
- Kandasamy, S.; Duraisamy, S.; Chinnappan, S.; Balakrishnan, S.; Thangasamy, S.; Muthusamy, G.; Arumugam, S.; Palanisamy, S. Molecular Modeling and Docking of Protease from Bacillus Sp. for the Keratin Degradation. Biocatal. Agric. Biotechnol. 2018, 13, 95–104. [Google Scholar] [CrossRef]
- Srinivasan, S.; Sadasivam, S.K.; Gunalan, S.; Shanmugam, G.; Kothandan, G. Application of Docking and Active Site Analysis for Enzyme Linked Biodegradation of Textile Dyes. Environ. Pollut. 2019, 248, 599–608. [Google Scholar] [CrossRef]
- Morris, G.M.; Ruth, H.; Lindstrom, W.; Sanner, M.F.; Belew, R.K.; Goodsell, D.S.; Olson, A.J. AutoDock4 and AutoDockTools4: Automated Docking with Selective Receptor Flexibility. J. Comput. Chem. 2009, 30, 2785–2791. [Google Scholar] [CrossRef]
- Trott, O.; Olson, A.J. AutoDock Vina: Improving the Speed and Accuracy of Docking with a New Scoring Function, Efficient Optimization, and Multithreading. J. Comput. Chem. 2009, 31, 455–461. [Google Scholar] [CrossRef]
- Dehghanian, F.; Kay, M.; Kahrizi, D. A Novel Recombinant AzrC Protein Proposed by Molecular Docking and in Silico Analyses to Improve Azo Dye’s Binding Affinity. Gene 2015, 569, 233–238. [Google Scholar] [CrossRef]
- Moharana, T.R.; Rao, N.M. Substrate Structure and Computation Guided Engineering of a Lipase for Omega-3 Fatty Acid Selectivity. PLoS ONE 2020, 15, e0231177. [Google Scholar] [CrossRef]
- Durmaz, E.; Kuyucak, S.; Sezerman, U.O. Modifying the Catalytic Preference of Tributyrin in Bacillus Thermocatenulatus Lipase through In-Silico Modeling of Enzyme-Substrate Complex. Protein Eng. Des. Sel. 2013, 26, 325–333. [Google Scholar] [CrossRef]
- Ding, X.; Tang, X.-L.; Ren-Chao, Z.; Zheng, Y.-G. Identification and Engineering of the Key Residues at the Crevice-like Binding Site of Lipases Responsible for Activity and Substrate Specificity. Biotechnol. Lett. 2018, 41, 137–146. [Google Scholar] [CrossRef]
- Li, T.; Li, M.-S.; Mei, X.-J.; Sun, L.-C.; Liu, H.; Zhang, L.-J.; Cao, M.-J.; Liu, G.-M. Altering the Substrate Specificity of a Myofibril-Bound Serine Proteinase from Crucian Carp by Site-Directed Mutagenesis. J. Chem. Technol. Biotechnol. 2019, 94, 136–146. [Google Scholar] [CrossRef]
- Abagyan, R.; Totrov, M.; Kuznetsov, D. ICM—A New Method for Protein Modeling and Design: Applications to Docking and Structure Prediction from the Distorted Native Conformation. J. Comput. Chem. 1994, 15, 488–506. [Google Scholar] [CrossRef]
- Bavan, S.; Sherman, B.; Luetje, C.W.; Abaffy, T. Discovery of Novel Ligands for Mouse Olfactory Receptor MOR42-3 Using an in Silico Screening Approach and in Vitro Validation. PLoS ONE 2014, 9, e92064. [Google Scholar] [CrossRef]
- Chacón, M.G.; Kendrick, E.G.; Leak, D.J. Engineering Escherichia Coli for the Production of Butyl Octanoate from Endogenous Octanoyl-CoA. PeerJ 2019, 2019, e6971. [Google Scholar] [CrossRef]
- Stewart, N.K.; Bhattacharya, M.; Toth, M.; Smith, C.A.; Vakulenko, S.B. A Surface Loop Modulates Activity of the Bacillus Class D β-Lactamases. J. Struct. Biol. 2020, 211, 107544. [Google Scholar] [CrossRef]
- McGann, M. FRED and HYBRID Docking Performance on Standardized Datasets. J. Comput. Aided Mol. Des. 2012, 26, 897–906. [Google Scholar] [CrossRef]
- Brus, B.; Košak, U.; Turk, S.; Pišlar, A.; Coquelle, N.; Kos, J.; Stojan, J.; Colletier, J.P.; Gobec, S. Discovery, Biological Evaluation, and Crystal Structure of a Novel Nanomolar Selective Butyrylcholinesterase Inhibitor. J. Med. Chem. 2014, 57, 8167–8179. [Google Scholar] [CrossRef]
- Fox, R. Directed Molecular Evolution by Machine Learning and the Influence of Nonlinear Interactions. J. Theor. Biol. 2005, 234, 187–199. [Google Scholar] [CrossRef] [PubMed]
- Fox, R.J.; Davis, S.C.; Mundorff, E.C.; Newman, L.M.; Gavrilovic, V.; Ma, S.K.; Chung, L.M.; Ching, C.; Tam, S.; Muley, S.; et al. Improving Catalytic Function by ProSAR-Driven Enzyme Evolution. Nat. Biotechnol. 2007, 25, 338–344. [Google Scholar] [CrossRef] [PubMed]
- Berland, M.; Offmann, B.; Andre, I.; Remaud-Simeon, M.; Charton, P. A Web-Based Tool for Rational Screening of Mutants Libraries Using ProSAR. Protein Eng. Des. Sel. 2014, 27, 375–381. [Google Scholar] [CrossRef]
- Verhaeghe, T.; De Winter, K.; Berland, M.; De Vreese, R.; D’Hooghe, M.; Offmann, B.; Desmet, T. Converting Bulk Sugars into Prebiotics: Semi-Rational Design of a Transglucosylase with Controlled Selectivity. Chem. Commun. 2016, 52, 3687–3689. [Google Scholar] [CrossRef]
- Yang, M.; Fehl, C.; Lees, K.V.; Lim, E.K.; Offen, W.A.; Davies, G.J.; Bowles, D.J.; Davidson, M.G.; Roberts, S.J.; Davis, B.G. Functional and Informatics Analysis Enables Glycosyltransferase Activity Prediction. Nat. Chem. Biol. 2018, 14, 1109–1117. [Google Scholar] [CrossRef] [PubMed]
- Barradas-Bautista, D.; Rosell, M.; Pallara, C.; Fernández-Recio, J. Structural Prediction of Protein–Protein Interactions by Docking: Application to Biomedical Problems. In Advances in Protein Chemistry and Structural Biology; Academic Press Inc.: Cambridge, MA, USA, 2018; Volume 110, pp. 203–249. ISBN 9780128143445. [Google Scholar]
- Dirks-Hofmeister, M.E.; Verhaeghe, T.; De Winter, K.; Desmet, T. Creating Space for Large Acceptors: Rational Biocatalyst Design for Resveratrol Glycosylation in an Aqueous System. Angew. Chem. Int. Ed. 2015, 54, 9289–9292. [Google Scholar] [CrossRef] [PubMed]
- Hermann, J.C.; Marti-Arbona, R.; Fedorov, A.A.; Fedorov, E.; Almo, S.C.; Shoichet, B.K.; Raushel, F.M. Structure-Based Activity Prediction for an Enzyme of Unknown Function. Nature 2007, 448, 775–779. [Google Scholar] [CrossRef] [PubMed]
- Liu, J.; Wang, R. Classification of Current Scoring Functions. J. Chem. Inf. Model. 2015, 55, 475–482. [Google Scholar] [CrossRef]
- Weiner, S.J.; Kollman, P.A.; Singh, U.C.; Case, D.A.; Ghio, C.; Alagona, G.; Profeta, S.; Weiner, P. A New Force Field for Molecular Mechanical Simulation of Nucleic Acids and Proteins. J. Am. Chem. Soc. 1984, 106, 765–784. [Google Scholar] [CrossRef]
- Clark, M.; Cramer, R.D.; Van Opdenbosch, N. Validation of the General Purpose Tripos 5.2 Force Field. J. Comput. Chem. 1989, 10, 982–1012. [Google Scholar] [CrossRef]
- Némethy, G.; Gibson, K.D.; Palmer, K.A.; Yoon, C.N.; Paterlini, G.; Zagari, A.; Rumsey, S.; Scheraga, H.A. Energy Parameters in Polypeptides. 10. Improved Geometrical Parameters and Nonbonded Interactions for Use in the ECEPP/3 Algorithm, with Application to Proline-Containing Peptides. J. Phys. Chem. 1992, 96, 6472–6484. [Google Scholar] [CrossRef]
- Eldridge, M.D.; Murray, C.W.; Auton, T.R.; Paolini, G.V.; Mee, R.P. Empirical Scoring Functions: I. The Development of a Fast Empirical Scoring Function to Estimate the Binding Affinity of Ligands in Receptor Complexes. J. Comput. Aided Mol. Des. 1997, 11, 425–445. [Google Scholar] [CrossRef]
- Jain, A.N. Surflex: Fully Automatic Flexible Molecular Docking Using a Molecular Similarity-Based Search Engine. J. Med. Chem. 2003, 46, 499–511. [Google Scholar] [CrossRef] [PubMed]
- Thomsen, R.; Christensen, M.H. MolDock: A New Technique for High-Accuracy Molecular Docking. J. Med. Chem. 2006, 49, 3315–3321. [Google Scholar] [CrossRef]
- Wang, R.; Lai, L.; Wang, S. Further Development and Validation of Empirical Scoring Functions for Structure-Based Binding Affinity Prediction. J. Comput. Aided Mol. Des. 2002, 16, 11–26. [Google Scholar] [CrossRef] [PubMed]
- Huey, R.; Morris, G.M.; Olson, A.J.; Goodsell, D.S. A Semiempirical Free Energy Force Field with Charge-Based Desolvation. J. Comput. Chem. 2007, 28, 1145–1152. [Google Scholar] [CrossRef] [PubMed]
- Muegge, I.; Martin, Y.C.; Hajduk, P.J.; Fesik, S.W. Evaluation of PMF Scoring in Docking Weak Ligands to the FK506 Binding Protein. J. Med. Chem. 1999, 42, 2498–2503. [Google Scholar] [CrossRef]
- Muegge, I. PMF Scoring Revisited. J. Med. Chem. 2006, 49, 5895–5902. [Google Scholar] [CrossRef]
- McGann, M.R.; Almond, H.R.; Nicholls, A.; Grant, J.A.; Brown, F.K. Gaussian Docking Functions. Biopolymers 2003, 68, 76–90. [Google Scholar] [CrossRef]
- Krüger, D.M.; Gohlke, H. DrugScorePPI Webserver: Fast and Accurate in Silico Alanine Scanning for Scoring Protein-Protein Interactions. Nucleic Acids Res. 2010, 38, W480–W486. [Google Scholar] [CrossRef] [PubMed]
- Li, H.; Leung, K.-S.; Wong, M.-H.; Ballester, P.J. Improving AutoDock Vina Using Random Forest: The Growing Accuracy of Binding Affinity Prediction by the Effective Exploitation of Larger Data Sets. Mol. Inform. 2015, 34, 115–126. [Google Scholar] [CrossRef] [PubMed]
- Wójcikowski, M.; Ballester, P.J.; Siedlecki, P. Performance of Machine-Learning Scoring Functions in Structure-Based Virtual Screening. Sci. Rep. 2017, 7, srep46710. [Google Scholar] [CrossRef] [PubMed]
- Wang, C.; Zhang, Y. Improving Scoring-Docking-Screening Powers of Protein–Ligand Scoring Functions Using Random Forest. J. Comput. Chem. 2017, 38, 169–177. [Google Scholar] [CrossRef] [PubMed]
- Xu, D.; Meroueh, S.O. Effect of Binding Pose and Modeled Structures on SVMGen and GlideScore Enrichment of Chemical Libraries. J. Chem. Inf. Model. 2016, 56, 1139–1151. [Google Scholar] [CrossRef]
- Li, G.B.; Yang, L.L.; Wang, W.J.; Li, L.L.; Yang, S.Y. ID-Score: A New Empirical Scoring Function Based on a Comprehensive Set of Descriptors Related to Protein-Ligand Interactions. J. Chem. Inf. Model. 2013, 53, 592–600. [Google Scholar] [CrossRef] [PubMed]
- Yan, Y.; Wang, W.; Sun, Z.; Zhang, J.Z.H.; Ji, C. Protein-Ligand Empirical Interaction Components for Virtual Screening. J. Chem. Inf. Model. 2017, 57, 1793–1806. [Google Scholar] [CrossRef]
- Arciniega, M.; Lange, O.F. Improvement of Virtual Screening Results by Docking Data Feature Analysis. J. Chem. Inf. Model. 2014, 54, 1401–1411. [Google Scholar] [CrossRef] [PubMed]
- Ashtawy, H.M.; Mahapatra, N.R. BgN-Score and BsN-Score: Bagging and Boosting Based Ensemble Neural Networks Scoring Functions for Accurate Binding Affinity Prediction of Protein-Ligand Complexes. BMC Bioinform. 2015, 16 (Suppl. 4), S8. [Google Scholar] [CrossRef]
- Durrant, J.D.; McCammon, J.A. NNScore 2.0: A Neural-Network Receptor-Ligand Scoring Function. J. Chem. Inf. Model. 2011, 51, 2897–2903. [Google Scholar] [CrossRef]
- Meng, E.C.; Shoichet, B.K.; Kuntz, I.D. Automated Docking with Grid-based Energy Evaluation. J. Comput. Chem. 1992, 13, 505–524. [Google Scholar] [CrossRef]
- Kollman, P.A.; Massova, I.; Reyes, C.; Kuhn, B.; Huo, S.; Chong, L.; Lee, M.; Lee, T.; Duan, Y.; Wang, W.; et al. Calculating Structures and Free Energies of Complex Molecules: Combining Molecular Mechanics and Continuum Models. Acc. Chem. Res. 2000, 33, 889–897. [Google Scholar] [CrossRef] [PubMed]
- Totrov, M.; Abagyan, R. Flexible Protein-Ligand Docking by Global Energy Optimization in Internal Coordinates. Proteins 1997, 29 (Suppl. 1), 215–220. [Google Scholar] [CrossRef]
- Sorna, V.; Theisen, E.R.; Stephens, B.; Warner, S.L.; Bearss, D.J.; Vankayalapati, H.; Sharma, S. High-Throughput Virtual Screening Identifies Novel N′-(1-Phenylethylidene)-Benzohydrazides as Potent, Specific, and Reversible LSD1 Inhibitors. J. Med. Chem. 2013, 56, 9496–9508. [Google Scholar] [CrossRef]
- MolSoft ICM Used in World’s Largest Ever Virtual Screen Resulting in 3 New Lead Compounds. Available online: http://molsoft.com/~jack/www/icm-cloud.html (accessed on 1 December 2020).
- Muller, P.; Lena, G.; Boilard, E.; Bezzine, S.; Lambeau, G.; Guichard, G.; Rognan, D. In Silico-Guided Target Identification of a Scaffold-Focused Library: 1,3,5-Triazepan-2,6-Diones as Novel Phospholipase A2 Inhibitors. J. Med. Chem. 2006, 49, 6768–6778. [Google Scholar] [CrossRef]
- Martin, S.J.; Chen, I.J.; Chan, A.W.E.; Foloppe, N. Modelling the Binding Mode of Macrocycles: Docking and Conformational Sampling. Bioorg Med. Chem. 2020, 28, 115143. [Google Scholar] [CrossRef]
- Böhm, H.J. The Development of a Simple Empirical Scoring Function to Estimate the Binding Constant for a Protein-Ligand Complex of Known Three-Dimensional Structure. J. Comput. Aided Mol. Des. 1994, 8, 243–256. [Google Scholar] [CrossRef] [PubMed]
- Warren, G.L.; Andrews, C.W.; Capelli, A.M.; Clarke, B.; LaLonde, J.; Lambert, M.H.; Lindvall, M.; Nevins, N.; Semus, S.F.; Senger, S.; et al. A Critical Assessment of Docking Programs and Scoring Functions. J. Med. Chem. 2006, 49, 5912–5931. [Google Scholar] [CrossRef] [PubMed]
- Gastreich, M.; Lilienthal, M.; Briem, H.; Claussen, H. Ultrafast de Novo Docking Combining Pharmacophores and Combinatorics. J. Comput. Aided Mol. Des. 2006, 20, 717–734. [Google Scholar] [CrossRef]
- Nissink, J.W.M.; Murray, C.; Hartshorn, M.; Verdonk, M.L.; Cole, J.C.; Taylor, R. A New Test Set for Validating Predictions of Protein-Ligand Interaction. Proteins: Struct. Funct. Genet. 2002, 49, 457–471. [Google Scholar] [CrossRef]
- Morris, G.M.; Goodsell, D.S.; Halliday, R.S.; Huey, R.; Hart, W.E.; Belew, R.K.; Olson, A.J. Automated Docking Using a Lamarckian Genetic Algorithm and an Empirical Binding Free Energy Function. J. Comput. Chem. 1998, 19, 1639–1662. [Google Scholar] [CrossRef]
- Nguyen, N.T.; Nguyen, T.H.; Pham, T.N.H.; Huy, N.T.; Van Bay, M.; Pham, M.Q.; Nam, P.C.; Vu, V.V.; Ngo, S.T. Autodock Vina Adopts More Accurate Binding Poses but Autodock4 Forms Better Binding Affinity. J. Chem. Inf. Model. 2020, 60, 204–211. [Google Scholar] [CrossRef]
- Koppensteiner, W.A.; Sippl, M.J. Knowledge-Based Potentials—Back to the Roots. Biochemistry 1998, 63, 247–252. [Google Scholar]
- Kelley, B.P.; Brown, S.P.; Warren, G.L.; Muchmore, S.W. POSIT: Flexible Shape-Guided Docking for Pose Prediction. J. Chem. Inf. Model. 2015, 55, 1771–1780. [Google Scholar] [CrossRef] [PubMed]
- Wang, R.; Fang, X.; Lu, Y.; Wang, S. The PDBbind Database: Collection of Binding Affinities for Protein-Ligand Complexes with Known Three-Dimensional Structures. J. Med. Chem. 2004, 47, 2977–2980. [Google Scholar] [CrossRef] [PubMed]
- Gilson, M.K.; Liu, T.; Baitaluk, M.; Nicola, G.; Hwang, L.; Chong, J. BindingDB in 2015: A Public Database for Medicinal Chemistry, Computational Chemistry and Systems Pharmacology. Nucleic Acids Res. 2016, 44, D1045–D1053. [Google Scholar] [CrossRef] [PubMed]
- Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B.A.; Thiessen, P.A.; Yu, B.; et al. PubChem 2019 Update: Improved Access to Chemical Data. Nucleic Acids Res. 2019, 47, D1102–D1109. [Google Scholar] [CrossRef] [PubMed]
- Gaulton, A.; Hersey, A.; Nowotka, M.L.; Patricia Bento, A.; Chambers, J.; Mendez, D.; Mutowo, P.; Atkinson, F.; Bellis, L.J.; Cibrian-Uhalte, E.; et al. The ChEMBL Database in 2017. Nucleic Acids Res. 2017, 45, D945–D954. [Google Scholar] [CrossRef]
- Huang, N.; Shoichet, B.K.; Irwin, J.J. Benchmarking Sets for Molecular Docking. J. Med. Chem. 2006, 49, 6789–6801. [Google Scholar] [CrossRef]
- Mysinger, M.M.; Carchia, M.; Irwin, J.J.; Shoichet, B.K. Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking. J. Med. Chem. 2012, 55, 6582–6594. [Google Scholar] [CrossRef] [PubMed]
- Rohrer, S.G.; Baumann, K. Maximum Unbiased Validation (MUV) Data Sets for Virtual Screening Based on PubChem Bioactivity Data. J. Chem. Inf. Model. 2009, 49, 169–184. [Google Scholar] [CrossRef]
- Mazurenko, S.; Prokop, Z.; Damborsky, J. Machine Learning in Enzyme Engineering. ACS Catal. 2020, 10, 1210–1223. [Google Scholar] [CrossRef]
- Wang, J.; Cao, H.; Zhang, J.Z.H.; Qi, Y. Computational Protein Design with Deep Learning Neural Networks. Sci. Rep. 2018, 8, 6349. [Google Scholar] [CrossRef]
- Cutler, A.; Cutler, D.R.; Stevens, J.R. Random Forests. In Ensemble Machine Learning; Springer: Boston, MA, USA, 2012; pp. 157–175. [Google Scholar]
- Gabel, J.; Desaphy, J.; Rognan, D. Beware of Machine Learning-Based Scoring Functions-on the Danger of Developing Black Boxes. J. Chem. Inf. Model. 2014, 54, 2807–2815. [Google Scholar] [CrossRef]
- Su, M.; Yang, Q.; Du, Y.; Feng, G.; Liu, Z.; Li, Y.; Wang, R. Comparative Assessment of Scoring Functions: The CASF-2016 Update. J. Chem. Inf. Model. 2019, 59, 895–913. [Google Scholar] [CrossRef]
- Pan, W.; Mao, L.; Shi, M.; Fu, Y.; Jiang, X.; Feng, W.; He, Y.; Xu, D.; Yuan, L. The Cytochrome: C-Cyclo [6]Aramide Complex as a Supramolecular Catalyst in Methanol. New J. Chem. 2018, 42, 3857–3866. [Google Scholar] [CrossRef]
- Shen, C.; Ding, J.; Wang, Z.; Cao, D.; Ding, X.; Hou, T. From Machine Learning to Deep Learning: Advances in Scoring Functions for Protein–Ligand Docking. WIREs Comput. Mol. Sci. 2020, 10, e1429. [Google Scholar] [CrossRef]
- Romero, P.A.; Krause, A.; Arnold, F.H. Navigating the Protein Fitness Landscape with Gaussian Processes. Proc. Natl. Acad. Sci. USA 2013, 110, E193–E201. [Google Scholar] [CrossRef]
- Yang, L.L.; Yang, X.; Li, G.B.; Fan, K.G.; Yin, P.F.; Chen, X.G. An Integrated Molecular Docking and Rescoring Method for Predicting the Sensitivity Spectrum of Various Serine Hydrolases to Organophosphorus Pesticides. J. Sci. Food Agric. 2016, 96, 2184–2192. [Google Scholar] [CrossRef]
- Duan, B.; Sun, Y. Integration of Machine Learning Improves the Prediction Accuracy of Molecular Modelling for M. jannaschii Tyrosyl-TRNA Synthetase Substrate Specificity. bioRxiv 2020. [Google Scholar] [CrossRef]
- Monza, E.; Lucas, M.F.; Camarero, S.; Alejaldre, L.C.; Martínez, A.T.; Guallar, V. Insights into Laccase Engineering from Molecular Simulations: Toward a Binding-Focused Strategy. J. Phys. Chem. Lett. 2015, 6, 1447–1453. [Google Scholar] [CrossRef]
- Santiago, G.; De Salas, F.; Lucas, M.F.; Monza, E.; Acebes, S.; Martinez, Á.T.; Camarero, S.; Guallar, V. Computer-Aided Laccase Engineering: Toward Biological Oxidation of Arylamines. ACS Catal. 2016, 6, 5415–5423. [Google Scholar] [CrossRef]
- Wilding, M.; Scott, C.; Warden, A.C. Computer-Guided Surface Engineering for Enzyme Improvement. Sci. Rep. 2018, 8, 11998. [Google Scholar] [CrossRef]
- Chan, H.C.S.; Pan, L.; Li, Y.; Yuan, S. Rationalization of Stereoselectivity in Enzyme Reactions. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2019, 9, e1403. [Google Scholar] [CrossRef]
- Foscato, M.; Jensen, V.R. Automated in Silico Design of Homogeneous Catalysts. ACS Catal. 2020, 10, 2354–2377. [Google Scholar] [CrossRef]
- Cerqueira, N.M.F.S.A.; Fernandes, P.A.; Ramos, M.J. Protocol for Computational Enzymatic Reactivity Based on Geometry Optimisation. ChemPhysChem 2018, 19, 669–689. [Google Scholar] [CrossRef]
- Lee, S.J.R.; Welborn, M.; Manby, F.R.; Miller, T.F. Projection-Based Wavefunction-in-DFT Embedding. Acc. Chem. Res. 2019, 52, 1359–1368. [Google Scholar] [CrossRef]
- Mateljak, I.; Monza, E.; Lucas, M.F.; Guallar, V.; Aleksejeva, O.; Ludwig, R.; Leech, D.; Shleev, S.; Alcalde, M. Increasing Redox Potential, Redox Mediator Activity, and Stability in a Fungal Laccase by Computer-Guided Mutagenesis and Directed Evolution. ACS Catal. 2019, 9, 4561–4572. [Google Scholar] [CrossRef]
- Chung, L.W.; Sameera, W.M.C.; Ramozzi, R.; Page, A.J.; Hatanaka, M.; Petrova, G.P.; Harris, T.V.; Li, X.; Ke, Z.; Liu, F.; et al. The ONIOM Method and Its Applications. Chem. Rev. 2015, 115, 5678–5796. [Google Scholar] [CrossRef]
- Wijma, H.J.; Marrink, S.J.; Janssen, D.B. Computationally Efficient and Accurate Enantioselectivity Modeling by Clusters of Molecular Dynamics Simulations. J. Chem. Inf. Model. 2014, 54, 2079–2092. [Google Scholar] [CrossRef]
- Wijma, H.J.; Floor, R.J.; Bjelic, S.; Marrink, S.J.; Baker, D.; Janssen, D.B. Enantioselective Enzymes by Computational Design and In Silico Screening. Angew. Chem. Int. Ed. 2015, 54, 3726–3730. [Google Scholar] [CrossRef] [PubMed]
- Arabnejad, H.; Bombino, E.; Colpa, D.I.; Jekel, P.A.; Trajkovic, M.; Wijma, H.J.; Janssen, D.B. Computational Design of Enantiocomplementary Epoxide Hydrolases for Asymmetric Synthesis of Aliphatic and Aromatic Diols. ChemBioChem 2020, 21, 1893–1904. [Google Scholar] [CrossRef]
- Borrelli, K.W.; Vitalis, A.; Alcantara, R.; Guallar, V. PELE: Protein Energy Landscape Exploration. A Novel Monte Carlo Based Technique. J. Chem. Theory Comput. 2005, 1, 1304–1311. [Google Scholar] [CrossRef]
- Madadkar-Sobhani, A.; Guallar, V. PELE Web Server: Atomistic Study of Biomolecular Systems at Your Fingertips. Nucleic Acids Res. 2013, 41, W322–W328. [Google Scholar] [CrossRef] [PubMed]
- Khersonsky, O.; Lipsh, R.; Avizemer, Z.; Ashani, Y.; Goldsmith, M.; Leader, H.; Dym, O.; Rogotner, S.; Trudeau, D.L.; Prilusky, J.; et al. Automated Design of Efficient and Functionally Diverse Enzyme Repertoires. Mol. Cell 2018, 72, 178. [Google Scholar] [CrossRef] [PubMed]
- Haatveit, K.C.; Garcia-Borràs, M.; Houk, K.N. Computational Protocol to Understand P450 Mechanisms and Design of Efficient and Selective Biocatalysts. Front. Chem. 2019, 7, 663. [Google Scholar] [CrossRef]
- Liao, K.; Yang, Y.F.; Li, Y.; Sanders, J.N.; Houk, K.N.; Musaev, D.G.; Davies, H.M.L. Design of Catalysts for Site-Selective and Enantioselective Functionalization of Non-Activated Primary C–H Bonds. Nat. Chem. 2018, 10, 1048–1055. [Google Scholar] [CrossRef] [PubMed]
- Kheirabadi, M.; Çelebi-Ölçüm, N.; Parker, M.F.L.; Zhao, Q.; Kiss, G.; Houk, K.N.; Schafmeister, C.E. Spiroligozymes for Transesterifications: Design and Relationship of Structure to Activity. J. Am. Chem. Soc. 2012, 134, 18345–18353. [Google Scholar] [CrossRef] [PubMed]
- Grisewood, M.J.; Gifford, N.P.; Pantazes, R.J.; Li, Y.; Cirino, P.C.; Janik, M.J.; Maranas, C.D. OptZyme: Computational Enzyme Redesign Using Transition State Analogues. PLoS ONE 2013, 8, e75358. [Google Scholar] [CrossRef]
- Amrein, B.A.; Steffen-Munsberg, F.; Szeler, I.; Purg, M.; Kulkarni, Y.; Kamerlin, S.C.L. CADEE: Computer-Aided Directed Evolution of Enzymes. IUCrJ 2017, 4, 50–64. [Google Scholar] [CrossRef] [PubMed]
- Saraf, M.C.; Moore, G.L.; Goodey, N.M.; Cao, V.Y.; Benkovic, S.J.; Maranas, C.D. IPRO: An Iterative Computational Protein Library Redesign and Optimization Procedure. Biophys. J. 2006, 90, 4167–4180. [Google Scholar] [CrossRef] [PubMed]
- Robert, J.; Pantazes, M.J.; Grisewood, T.L.; Nathanael, P.; Gifford, C.D.M. The Iterative Protein Redesign and Optimization (IPRO) Suite of Programs. Available online: http://www.maranasgroup.com/submission/ipro2014.htm (accessed on 29 October 2020).
- Grisewood, M.J.; Hernández-Lozada, N.J.; Thoden, J.B.; Gifford, N.P.; Mendez-Perez, D.; Schoenberger, H.A.; Allan, M.F.; Floy, M.E.; Lai, R.Y.; Holden, H.M.; et al. Computational Redesign of Acyl-ACP Thioesterase with Improved Selectivity toward Medium-Chain-Length Fatty Acids. ACS Catal. 2017, 7, 3837–3849. [Google Scholar] [CrossRef]
- Richter, F.; Leaver-Fay, A.; Khare, S.D.; Bjelic, S.; Baker, D. De Novo Enzyme Design Using Rosetta3. PLoS ONE 2011, 6, e19230. [Google Scholar] [CrossRef]
- Fazelinia, H.; Cirino, P.C.; Maranas, C.D. Extending Iterative Protein Redesign and Optimization (IPRO) in Protein Library Design for Ligand Specificity. Biophys. J. 2007, 92, 2120–2130. [Google Scholar] [CrossRef] [PubMed]
- Yang, Y.; Zhou, Y. Specific Interactions for Ab Initio Folding of Protein Terminal Regions with Secondary Structures. Proteins Struct. Funct. Genet. 2008, 72, 793–803. [Google Scholar] [CrossRef]
- Bommarius, A.S.; Paye, M.F. Stabilizing Biocatalysts. Chem. Soc. Rev. 2013, 42, 6534–6565. [Google Scholar] [CrossRef]
- Redelings, B.D.; Suchard, M.A. Joint Bayesian Estimation of Alignment and Phylogeny. Syst. Biol. 2005, 54, 401–418. [Google Scholar] [CrossRef] [PubMed]
- Nguyen, V.; Wilson, C.; Hoemberger, M.; Stiller, J.B.; Agafonov, R.V.; Kutter, S.; English, J.; Theobald, D.L.; Kern, D. Evolutionary Drivers of Thermoadaptation in Enzyme Catalysis. Science 2017, 355, 289–294. [Google Scholar] [CrossRef]
- Guindon, S.; Dufayard, J.-F.; Lefort, V.; Anisimova, M.; Hordijk, W.; Gascuel, O. New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0. Syst. Biol. 2010, 59, 307–321. [Google Scholar] [CrossRef]
- Babkova, P.; Sebestova, E.; Brezovsky, J.; Chaloupkova, R.; Damborsky, J. Ancestral Haloalkane Dehalogenases Show Robustness and Unique Substrate Specificity. ChemBioChem 2017, 18, 1448–1456. [Google Scholar] [CrossRef]
- Dehouck, Y.; Kwasigroch, J.M.; Gilis, D.; Rooman, M. PoPMuSiC 2.1: A Web Server for the Estimation of Protein Stability Changes upon Mutation and Sequence Optimality. BMC Bioinform. 2011, 12, 151. [Google Scholar] [CrossRef]
- Deng, Z.; Yang, H.; Li, J.; Shin, H.D.; Du, G.; Liu, L.; Chen, J. Structure-Based Engineering of Alkaline α-Amylase from Alkaliphilic Alkalimonas Amylolytica for Improved Thermostability. Appl. Microbiol. Biotechnol. 2014, 98, 3997–4007. [Google Scholar] [CrossRef] [PubMed]
- Song, L.; Tsang, A.; Sylvestre, M. Engineering a Thermostable Fungal GH10 Xylanase, Importance of N-Terminal Amino Acids. Biotechnol. Bioeng. 2015, 112, 1081–1091. [Google Scholar] [CrossRef] [PubMed]
- Yin, S.; Ding, F.; Dokholyan, N.V. Eris: An Automated Estimator of Protein Stability. Nat. Methods 2007, 4, 466–467. [Google Scholar] [CrossRef]
- Mirzaei, M.; Latifi, A.M.; Jafari, R. Improvement of Thermal Stability of DFPase by In Silico Methods. J. Appl. Biotechnol. Rep. 2014, 1, 155–159. [Google Scholar]
- Schymkowitz, J.; Borg, J.; Stricher, F.; Nys, R.; Rousseau, F.; Serrano, L. The FoldX Web Server: An Online Force Field. Nucleic Acids Res. 2005, 33, W382. [Google Scholar] [CrossRef]
- Bi, J.; Chen, S.; Zhao, X.; Nie, Y.; Xu, Y. Computation-Aided Engineering of Starch-Debranching Pullulanase from Bacillus Thermoleovorans for Enhanced Thermostability. Appl. Microbiol. Biotechnol. 2020, 104, 7551–7562. [Google Scholar] [CrossRef] [PubMed]
- Mu, Q.; Cui, Y.; Tian, Y.; Hu, M.; Tao, Y.; Wu, B. Thermostability Improvement of the Glucose Oxidase from Aspergillus Niger for Efficient Gluconic Acid Production via Computational Design. Int. J. Biol. Macromol. 2019, 136, 1060–1068. [Google Scholar] [CrossRef]
- Zhou, H.; Zhou, Y. Distance-Scaled, Finite Ideal-Gas Reference State Improves Structure-Derived Potentials of Mean Force for Structure Selection and Stability Prediction. Protein Sci. 2002, 11, 2714–2726. [Google Scholar] [CrossRef]
- Zhou, H.; Zhang, C.; Liu, S.; Zhou, Y. Web-Based Toolkits for Topology Prediction of Transmembrane Helical Proteins, Fold Recognition, Structure and Binding Scoring, Folding-Kinetics Analysis and Comparative Analysis of Domain Combinations. Nucleic Acids Res. 2005, 33, W193–W197. [Google Scholar] [CrossRef]
- Capriotti, E.; Fariselli, P.; Rossi, I.; Casadio, R. A Three-State Prediction of Single Point Mutations on Protein Stability Changes. BMC Bioinform. 2008, 9 (Suppl. 2), S6. [Google Scholar] [CrossRef]
- Laimer, J.; Hofer, H.; Fritz, M.; Wegenkittl, S.; Lackner, P. MAESTRO—Multi Agent Stability Prediction upon Point Mutations. BMC Bioinform. 2015, 16, 116. [Google Scholar] [CrossRef]
- Fakhravar, A.; Hesampour, A. Rational Design-Based Engineering of a Thermostable Phytase by Site-Directed Mutagenesis. Mol. Biol. Rep. 2018, 45, 2053–2061. [Google Scholar] [CrossRef]
- Winter, P.; Stubenvoll, S.; Scheiblhofer, S.; Joubert, I.A.; Strasser, L.; Briganser, C.; Soh, W.T.; Hofer, F.; Kamenik, A.S.; Dietrich, V.; et al. In Silico Design of Phl p 6 Variants with Altered Fold-Stability Significantly Impacts Antigen Processing, Immunogenicity and Immune Polarization. Front. Immunol. 2020, 11, 1824. [Google Scholar] [CrossRef] [PubMed]
- Sumbalova, L.; Stourac, J.; Martinek, T.; Bednar, D.; Damborsky, J. HotSpot Wizard 3.0: Web Server for Automated Design of Mutations and Smart Libraries Based on Sequence Input Information. Nucleic Acids Res. 2018, 46, W356–W362. [Google Scholar] [CrossRef]
- Klermund, L.; Riederer, A.; Hunger, A.; Castiglione, K. Protein Engineering of a Bacterial N-Acyl-d-Glucosamine 2-Epimerase for Improved Stability under Process Conditions. Enzym. Microb. Technol. 2016, 87–88, 70–78. [Google Scholar] [CrossRef]
- Wang, X.; Ma, R.; Xie, X.; Liu, W.; Tu, T.; Zheng, F.; You, S.; Ge, J.; Xie, H.; Yao, B.; et al. Thermostability Improvement of a Talaromyces Leycettanus Xylanase by Rational Protein Engineering. Sci. Rep. 2017, 7, 15287. [Google Scholar] [CrossRef]
- Wijma, H.J.; Floor, R.J.; Jekel, P.A.; Baker, D.; Marrink, S.J.; Janssen, D.B. Computationally Designed Libraries for Rapid Enzyme Stabilization. Protein Eng. Des. Sel. 2014, 27, 49–58. [Google Scholar] [CrossRef] [PubMed]
- Bednar, D.; Beerens, K.; Sebestova, E.; Bendl, J.; Khare, S.; Chaloupkova, R.; Prokop, Z.; Brezovsky, J.; Baker, D.; Damborsky, J. FireProt: Energy- and Evolution-Based Computational Design of Thermostable Multiple-Point Mutants. PLoS Comput. Biol. 2015, 11, e1004556. [Google Scholar] [CrossRef] [PubMed]
- Goldenzweig, A.; Goldsmith, M.; Hill, S.E.; Gertman, O.; Laurino, P.; Ashani, Y.; Dym, O.; Unger, T.; Albeck, S.; Prilusky, J.; et al. Automated Structure- and Sequence-Based Design of Proteins for High Bacterial Expression and Stability. Mol. Cell 2016, 63, 337. [Google Scholar] [CrossRef] [PubMed]
- Hettiaratchi, M.H.; O’meara, M.J.; O’meara, T.R.; Pickering, A.J.; Letko-Khait, N.; Shoichet, M.S. Reengineering Biocatalysts: Computational Redesign of Chondroitinase ABC Improves Efficacy and Stability. Sci. Adv. 2020, 6, eabc6378. [Google Scholar] [CrossRef] [PubMed]
- Porebski, B.T.; Buckle, A.M. Consensus Protein Design. Protein Eng. Des. Sel. 2016, 29, 245–251. [Google Scholar] [CrossRef]
- Liu, L.; Yu, H.; Du, K.; Wang, Z.; Gan, Y.; Huang, H. Enhanced Trypsin Thermostability in Pichia Pastoris through Truncating the Flexible Region 06 Biological Sciences 0601 Biochemistry and Cell Biology. Microb. Cell Fact. 2018, 17, 165. [Google Scholar] [CrossRef]
- Farnoosh, G.; Khajeh, K.; Mohammadi, M.; Hassanpour, K.; Latifi, A.M.; Aghamollaei, H. Catalytic and Structural Effects of Flexible Loop Deletion in Organophosphorus Hydrolase Enzyme: A Thermostability Improvement Mechanism. J. Biosci. 2020, 45, 54. [Google Scholar] [CrossRef]
- Tawfik, D.S. Loop Grafting and the Origins of Enzyme Species. Science 2006, 311, 475–476. [Google Scholar] [CrossRef] [PubMed]
- Tang, H.; Shi, K.; Shi, C.; Aihara, H.; Zhang, J.; Du, G. Enhancing Subtilisin Thermostability through a Modified Normalized B-Factor Analysis and Loop-Grafting Strategy. J. Biol. Chem. 2019, 294, 18398–18407. [Google Scholar] [CrossRef]
- Liu, C.; Zhao, J.; Liu, J.; Guo, X.; Rao, D.; Liu, H.; Zheng, P.; Sun, J.; Ma, Y. Simultaneously Improving the Activity and Thermostability of a New Proline 4-Hydroxylase by Loop Grafting and Site-Directed Mutagenesis. Appl. Microbiol. Biotechnol. 2019, 103, 265–277. [Google Scholar] [CrossRef]
- Mortazavi, M.; Hosseinkhani, S. Surface Charge Modification Increases Firefly Luciferase Rigidity without Alteration in Bioluminescence Spectra. Enzym. Microb. Technol. 2017, 96, 47–59. [Google Scholar] [CrossRef] [PubMed]
- Schweiker, K.L.; Zarrine-Afsar, A.; Davidson, A.R.; Makhatadze, G.I. Computational Design of the Fyn SH3 Domain with Increased Stability through Optimization of Surface Charge-Charge Interactions. Protein Sci. 2007, 16, 2694–2702. [Google Scholar] [CrossRef] [PubMed]
- Zhang, H.; Sang, J.; Zhang, Y.; Sun, T.; Liu, H.; Yue, R.; Zhang, J.; Wang, H.; Dai, Y.; Lu, F.; et al. Rational Design of a Yarrowia Lipolytica Derived Lipase for Improved Thermostability. Int. J. Biol. Macromol. 2019, 137, 1190–1198. [Google Scholar] [CrossRef] [PubMed]
- Fang, H.; Lü, C.; Hua, Y.; Hu, S.; Zhao, W.; Fang, W.; Song, K.; Huang, J.; Mei, L. Increasing the Thermostability of Glutamate Decarboxylase from Lactobacillus Brevis by Introducing Proline. Sheng Wu Gong. Cheng Xue Bao 2019, 35, 636–646. [Google Scholar] [CrossRef]
- Niu, C.; Zhu, L.; Xu, X.; Li, Q. Rational Design of Disulfide Bonds Increases Thermostability of a Mesophilic 1,3-1,4-β-Glucanase from Bacillus Terquilensis. PLoS ONE 2016, 11, e0154036. [Google Scholar] [CrossRef] [PubMed]
- Nakamura, H.; Oda-Ueda, N.; Ueda, T.; Ohkuri, T. A Novel Engineered Interchain Disulfide Bond in the Constant Region Enhances the Thermostability of Adalimumab Fab. Biochem. Biophys. Res. Commun. 2018, 495, 7–11. [Google Scholar] [CrossRef]
- Bashirova, A.; Pramanik, S.; Volkov, P.; Rozhkova, A.; Nemashkalov, V.; Zorov, I.; Gusakov, A.; Sinitsyn, A.; Schwaneberg, U.; Davari, M.D. Disulfide Bond Engineering of an Endoglucanase from Penicillium Verruculosum to Improve Its Thermostability. Int. J. Mol. Sci. 2019, 20, 1602. [Google Scholar] [CrossRef] [PubMed]
- Ece, S.; Evran, S.; Janda, J.-O.; Merkl, R.; Sterner, R. Improving Thermal and Detergent Stability of Bacillus Stearothermophilus Neopullulanase by Rational Enzyme Design. Protein Eng. Des. Sel. 2015, 28, 147–151. [Google Scholar] [CrossRef] [PubMed]
- Kumar, M.D.S.; Bava, K.A.; Gromiha, M.M.; Prabakaran, P.; Kitajima, K.; Uedaira, H.; Sarai, A. ProTherm and ProNIT: Thermodynamic Databases for Proteins and Protein-Nucleic Acid Interactions. Nucleic Acids Res. 2006, 34, D204–D206. [Google Scholar] [CrossRef] [PubMed]
- Capriotti, E.; Fariselli, P.; Casadio, R. I-Mutant2.0: Predicting Stability Changes upon Mutation from the Protein Sequence or Structure. Nucleic Acids Res. 2005, 33, W306–W310. [Google Scholar] [CrossRef]
- Yang, J.; Li, F.Z.; Arnold, F.H. Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering. ACS Cent. Sci. 2024, 10, 226–241. [Google Scholar] [CrossRef] [PubMed]
- Goldenzweig, A.; Fleishman, S.J. Principles of Protein Stability and Their Application in Computational Design. Annu. Rev. Biochem. 2018, 87, 105–129. [Google Scholar] [CrossRef]
- Trainor, K.; Broom, A.; Meiering, E.M. Exploring the Relationships between Protein Sequence, Structure and Solubility. Curr. Opin. Struct. Biol. 2017, 42, 136–146. [Google Scholar] [CrossRef] [PubMed]
- Kramer, R.M.; Shende, V.R.; Motl, N.; Pace, C.N.; Scholtz, J.M. Toward a Molecular Understanding of Protein Solubility: Increased Negative Surface Charge Correlates with Increased Solubility. Biophys. J. 2012, 102, 1907–1915. [Google Scholar] [CrossRef] [PubMed]
- Buck, P.M.; Kumar, S.; Singh, S.K. On the Role of Aggregation Prone Regions in Protein Evolution, Stability, and Enzymatic Catalysis: Insights from Diverse Analyses. PLoS Comput. Biol. 2013, 9, e1003291. [Google Scholar] [CrossRef]
- Navarro, S.; Ventura, S. Computational Re-Design of Protein Structures to Improve Solubility. Expert. Opin. Drug Discov. 2019, 14, 1077–1088. [Google Scholar] [CrossRef]
- Meric, G.; Robinson, A.S.; Roberts, C.J. Driving Forces for Nonnative Protein Aggregation and Approaches to Predict Aggregation-Prone Regions. Annu. Rev. Chem. Biomol. Eng. 2017, 8, 139–159. [Google Scholar] [CrossRef] [PubMed]
- Fernandez-Escamilla, A.M.; Rousseau, F.; Schymkowitz, J.; Serrano, L. Prediction of Sequence-Dependent and Mutational Effects on the Aggregation of Peptides and Proteins. Nat. Biotechnol. 2004, 22, 1302–1306. [Google Scholar] [CrossRef] [PubMed]
- Ganesan, A.; Siekierska, A.; Beerten, J.; Brams, M.; Van Durme, J.; De Baets, G.; Van Der Kant, R.; Gallardo, R.; Ramakers, M.; Langenberg, T.; et al. Structural Hot Spots for the Solubility of Globular Proteins. Nat. Commun. 2016, 7, 10816. [Google Scholar] [CrossRef]
- Dudgeon, K.; Rouet, R.; Kokmeijer, I.; Schofield, P.; Stolp, J.; Langley, D.; Stock, D.; Christ, D. General Strategy for the Generation of Human Antibody Variable Domains with Increased Aggregation Resistance. Proc. Natl. Acad. Sci. USA 2012, 109, 10879–10884. [Google Scholar] [CrossRef]
- van der Kant, R.; Karow-Zwick, A.R.; Van Durme, J.; Blech, M.; Gallardo, R.; Seeliger, D.; Aßfalg, K.; Baatsen, P.; Compernolle, G.; Gils, A.; et al. Prediction and Reduction of the Aggregation of Monoclonal Antibodies. J. Mol. Biol. 2017, 429, 1244–1261. [Google Scholar] [CrossRef] [PubMed]
- Kuroda, D.; Tsumoto, K. Engineering Stability, Viscosity, and Immunogenicity of Antibodies by Computational Design. J. Pharm. Sci. 2020, 109, 1631–1651. [Google Scholar] [CrossRef] [PubMed]
- Tan, P.H.; Chu, V.; Stray, J.E.; Hamlin, D.K.; Pettit, D.; Wilbur, D.S.; Vessella, R.L.; Stayton, P.S. Engineering the Isoelectric Point of a Renal Cell Carcinoma Targeting Antibody Greatly Enhances ScFv Solubility. Immunotechnology 1998, 4, 107–114. [Google Scholar] [CrossRef]
- van der Kant, R.; van Durme, J.; Rousseau, F.; Schymkowitz, J. Solubis: Optimizing Protein Solubility by Minimal Point Mutations. In Methods in Molecular Biology; Humana Press Inc.: Totowa, NJ, USA, 2019; Volume 1873, pp. 317–333. [Google Scholar]
- Santos, J.; Pujols, J.; Pallarès, I.; Iglesias, V.; Ventura, S. Computational Prediction of Protein Aggregation: Advances in Proteomics, Conformation-Specific Algorithms and Biotechnological Applications. Comput. Struct. Biotechnol. J. 2020, 18, 1403–1413. [Google Scholar] [CrossRef] [PubMed]
- Berman, H.M.; Gabanyi, M.J.; Kouranov, A.; Micallef, D.I.; Westbrook, J.; Protein Structure Initiative Network of Investigators. Protein Structure Initiative—TargetTrack 2000–2017—All Data Files. 2017. Available online: https://zenodo.org/records/821654 (accessed on 30 December 2024).
- Thangakani, A.M.; Kumar, S.; Nagarajan, R.; Velmurugan, D.; Gromiha, M.M. GAP: Towards Almost 100 Percent Prediction for β-Strand-Mediated Aggregating Peptides with Distinct Morphologies. Bioinformatics 2014, 30, 1983–1990. [Google Scholar] [CrossRef] [PubMed]
- Tompa, D.R.; Kadhirvel, S. Changes in Hydrophobicity Mainly Promotes the Aggregation Tendency of ALS Associated SOD1 Mutants. Int. J. Biol. Macromol. 2020, 145, 904–913. [Google Scholar] [CrossRef]
- Lu, X.; Brickson, C.R.; Murphy, R.M. TANGO-Inspired Design of Anti-Amyloid Cyclic Peptides. ACS Chem. Neurosci. 2016, 7, 1264–1274. [Google Scholar] [CrossRef]
- Andersen, T.C.B.; Lindsjø, K.; Hem, C.D.; Koll, L.; Kristiansen, P.E.; Skjeldal, L.; Andreotti, A.H.; Spurkland, A. Solubility of Recombinant Src Homology 2 Domains Expressed in E. Coli Can Be Predicted by TANGO. BMC Biotechnol. 2014, 14, 3. [Google Scholar] [CrossRef] [PubMed]
- Nichols, P.; Li, L.; Kumar, S.; Buck, P.M.; Singh, S.K.; Goswami, S.; Balthazor, B.; Conley, T.R.; Sek, D.; Allen, M.J. Rational Design of Viscosity Reducing Mutants of a Monoclonal Antibody: Hydrophobic versus Electrostatic Inter-Molecular Interactions. MAbs 2015, 7, 212–230. [Google Scholar] [CrossRef] [PubMed]
- Paladin, L.; Piovesan, D.; Tosatto, S.C.E. SODA: Prediction of Protein Solubility from Disorder and Aggregation Propensity. Nucleic Acids Res. 2017, 45, W236–W240. [Google Scholar] [CrossRef] [PubMed]
- Amir, M.; Mohammad, T.; Kumar, V.; Alajmi, M.F.; Rehman, M.T.; Hussain, A.; Alam, P.; Dohare, R.; Islam, A.; Ahmad, F. Structural Analysis and Conformational Dynamics of STN1 Gene Mutations Involved in Coat Plus Syndrome. Front. Mol. Biosci. 2019, 6, 41. [Google Scholar] [CrossRef]
- Walsh, I.; Seno, F.; Tosatto, S.C.E.; Trovato, A. PASTA 2.0: An Improved Server for Protein Aggregation Prediction. Nucleic Acids Res. 2014, 42, W301–W307. [Google Scholar] [CrossRef] [PubMed]
- Carcamo-Noriega, E.N.; Saab-Rincon, G. Identification of Fibrillogenic Regions in Human Triosephosphate Isomerase. PeerJ 2016, 2016, e1676. [Google Scholar] [CrossRef] [PubMed]
- Kaur, G.; Kapoor, S.; Thakur, K.G. Bacillus Subtilis HelD, an RNA Polymerase Interacting Helicase, Forms Amyloid-Like Fibrils. Front. Microbiol. 2018, 9, 1934. [Google Scholar] [CrossRef] [PubMed]
- Hirose, S.; Noguchi, T. Espresso: A System for Estimating Protein Expression and Solubility in Protein Expression Systems. Proteomics 2013, 13, 1444–1456. [Google Scholar] [CrossRef]
- Schlee, S.; Straub, K.; Schwab, T.; Kinateder, T.; Merkl, R.; Sterner, R. Prediction of Quaternary Structure by Analysis of Hot Spot Residues in Protein-protein Interfaces: The Case of Anthranilate Phosphoribosyltransferases. Proteins Struct. Funct. Bioinform. 2019, 87, 815–825. [Google Scholar] [CrossRef] [PubMed]
- Sormanni, P.; Aprile, F.A.; Vendruscolo, M. The CamSol Method of Rational Design of Protein Mutants with Enhanced Solubility. J. Mol. Biol. 2015, 427, 478–490. [Google Scholar] [CrossRef] [PubMed]
- Ibrahim, A.E.C.; Reljic, R.; Drake Pascal, M.W.; Ma, J.K.C. Rational Design and Expression of a Recombinant Plant Rhabdovirus Glycoprotein for Production of Immunoreactive Murine Anti-Sera. Protein Expr. Purif. 2020, 175, 105691. [Google Scholar] [CrossRef] [PubMed]
- Camilloni, C.; Sala, B.M.; Sormanni, P.; Porcari, R.; Corazza, A.; De Rosa, M.; Zanini, S.; Barbiroli, A.; Esposito, G.; Bolognesi, M.; et al. Rational Design of Mutations That Change the Aggregation Rate of a Protein While Maintaining Its Native Structure and Stability. Sci. Rep. 2016, 6, 25559. [Google Scholar] [CrossRef] [PubMed]
- Martin, W.R.; Cheng, F. A Rational Design of a Multi-Epitope Vaccine Against SARS-CoV-2 Which Accounts for the Glycan Shield of the Spike Glycoprotein. J. Biomol. Struct. Dyn. 2021, 40, 7099–7113. [Google Scholar] [CrossRef] [PubMed]
- Zambrano, R.; Jamroz, M.; Szczasiuk, A.; Pujols, J.; Kmiecik, S.; Ventura, S. AGGRESCAN3D (A3D): Server for Prediction of Aggregation Properties of Protein Structures. Nucleic Acids Res. 2015, 43, W306–W313. [Google Scholar] [CrossRef] [PubMed]
- Kuriata, A.; Iglesias, V.; Pujols, J.; Kurcinski, M.; Kmiecik, S.; Ventura, S. Aggrescan3D (A3D) 2.0: Prediction and Engineering of Protein Solubility. Nucleic Acids Res. 2019, 47, W300–W307. [Google Scholar] [CrossRef] [PubMed]
- Sankar, K.; Krystek, S.R.; Carl, S.M.; Day, T.; Maier, J.K.X. AggScore: Prediction of Aggregation-Prone Regions in Proteins Based on the Distribution of Surface Patches. Proteins Struct. Funct. Bioinform. 2018, 86, 1147–1156. [Google Scholar] [CrossRef] [PubMed]
- Yang, Y.; Niroula, A.; Shen, B.; Vihinen, M. PON-Sol: Prediction of Effects of Amino Acid Substitutions on Protein Solubility. Bioinformatics 2016, 32, 2032–2034. [Google Scholar] [CrossRef]
- Tian, Y.; Deutsch, C.; Krishnamoorthy, B. Scoring Function to Predict Solubility Mutagenesis. Algorithms Mol. Biol. 2010, 5, 33. [Google Scholar] [CrossRef]
- Louros, N.; Orlando, G.; De Vleeschouwer, M.; Rousseau, F.; Schymkowitz, J. Structure-Based Machine-Guided Mapping of Amyloid Sequence Space Reveals Uncharted Sequence Clusters with Higher Solubilities. Nat. Commun. 2020, 11, 3314. [Google Scholar] [CrossRef] [PubMed]
- De Baets, G.; Van Durme, J.; Van Der Kant, R.; Schymkowitz, J.; Rousseau, F. Solubis: Optimize Your Protein. Bioinformatics 2015, 31, 2580–2582. [Google Scholar] [CrossRef]
- Walsh, I.; Martin, A.J.M.; Di domenico, T.; Tosatto, S.C.E. Espritz: Accurate and Fast Prediction of Protein Disorder. Bioinformatics 2012, 28, 503–509. [Google Scholar] [CrossRef] [PubMed]
- Kyte, J.; Doolittle, R.F. A Simple Method for Displaying the Hydropathic Character of a Protein. J. Mol. Biol. 1982, 157, 105–132. [Google Scholar] [CrossRef] [PubMed]
- Piovesan, D.; Walsh, I.; Minervini, G.; Tosatto, S.C.E. FELLS: Fast Estimator of Latent Local Structure. Bioinformatics 2017, 33, 1889–1891. [Google Scholar] [CrossRef]
- Conchillo-Solé, O.; de Groot, N.S.; Avilés, F.X.; Vendrell, J.; Daura, X.; Ventura, S. AGGRESCAN: A Server for the Prediction and Evaluation of “Hot Spots” of Aggregation in Polypeptides. BMC Bioinform. 2007, 8, 65. [Google Scholar] [CrossRef] [PubMed]
- Jamroz, M.; Kolinski, A.; Kmiecik, S. CABS-Flex: Server for Fast Simulation of Protein Structure Fluctuations. Nucleic Acids Res. 2013, 41, W427–W431. [Google Scholar] [CrossRef]
- Habibi, N.; Mohd Hashim, S.Z.; Norouzi, A.; Samian, M.R. A Review of Machine Learning Methods to Predict the Solubility of Overexpressed Recombinant Proteins in Escherichia Coli. BMC Bioinform. 2014, 15, 134. [Google Scholar] [CrossRef]
- Land, H.; Humble, M.S. YASARA: A Tool to Obtain Structural Guidance in Biocatalytic Investigations. In Methods in Molecular Biology; Humana Press Inc.: Totowa, NJ, USA, 2018; Volume 1685, pp. 43–67. [Google Scholar]
- Buß, O.; Rudat, J.; Ochsenreither, K. FoldX as Protein Engineering Tool: Better Than Random Based Approaches? Comput. Struct. Biotechnol. J. 2018, 16, 25–33. [Google Scholar] [CrossRef] [PubMed]
- Sahin, E.; Jordan, J.L.; Spatara, M.L.; Naranjo, A.; Costanzo, J.A.; Weiss, W.F., IV; Robinson, A.S.; Fernandez, E.J.; Roberts, C.J. Computational Design and Biophysical Characterization of Aggregation-Resistant Point Mutations for Γd Crystallin Illustrate a Balance of Conformational Stability and Intrinsic Aggregation Propensity. Biochemistry 2011, 50, 628–639. [Google Scholar] [CrossRef]
- Sant’Anna, R.; Braga, C.; Varejão, N.; Pimenta, K.M.; Granã-Montes, R.; Alves, A.; Cortines, J.; Cordeiro, Y.; Ventura, S.; Foguel, D. The Importance of a Gatekeeper Residue on the Aggregation of Transthyretin Implications for Transthyretin-Related Amyloidoses. J. Biol. Chem. 2014, 289, 28324–28337. [Google Scholar] [CrossRef] [PubMed]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).