3.1. Ensemble-Based Approaches
The generation of molecular ensembles by using MD and Monte Carlo (MC) simulations has become more affordable for a wider group of users, creating a means to face novel protein design challenges. By utilizing conformational ensembles, protein design algorithms can take the dynamic nature of the protein structures into account, providing a biologically sound strategy and frequently improving the performance of the employed methods [
98,
99].
We start this section by reviewing insights from two studies aiming at benchmarking generic procedures for ensemble generation on the success of protein design or redesign tasks. In the first comparative research by Ludwiczak and colleagues, 10 protocols combining methods from Rosetta software [
100] with MD simulations were applied to 12 diverse proteins [
54]. For protein redesign, three distinct structural ensembles were obtained using MD simulation, MD simulation followed by the introduction of small backbone perturbations with Rosetta Backrub [
101], or Rosetta Backrub alone. Subsequently, the protein sequences were redesigned using either the fixed backbone (FixBB) or design-and-relax (D&R) methods on each ensemble [
102,
103]. We note here that the employed simulations were run for four ns, although with 50 replicas, representing somewhat limited sampling around the conformational minima even though the target proteins were relatively small (up to 103 residues). The designed sequences were analyzed based on entropy, covariation, profile similarity, and packing quality in the corresponding generated structures. The best performance was observed for the protocol using MD simulation in combination with Rosetta Backrub for the ensemble generation, followed by redesign with the D&R method. This time, analogous protocols were tested for de novo design purposes using only the more efficient D&R method, confirming that the procedure based on the MD simulation coupled with Rosetta Backrub yielded the best results. In the second benchmarking study, Loshbaugh and Kortemme performed a comprehensive evaluation of four different flexible backbone design methods available within the Rosetta software using six datasets [
104]. Comparing FastDesign [
105,
106], Backrub Ensemble Design [
107], CoupledMoves with Backrub [
52], and CoupledMoves with kinematic closure, the authors concluded that the CoupledMoves method performs better in recapitulating sequences of known proteins compared to the other two alternatives. This finding highlights the importance of incorporating the side-chain and backbone flexibility simultaneously during the design. Interestingly, all methods performed poorly on two deep sequencing datasets, which should be taken with caution when applying Rosetta for such purposes. Overall, both studies emphasize that flexible backbone approaches combined with side-chain flexibility can significantly outperform methods utilizing only a single conformation.
The predictive performance of the Flex ddG method in estimating the change in binding free energy after mutation (ΔΔG) at protein–protein interfaces has also been boosted when using a structural ensemble instead of a single static structure [
92]. In this method, an ensemble of up to 50 structures is generated by the conformational sampling in the surroundings of mutated sites with the Rosetta Backrub program. Then, the wild-type ensemble is optimized by repacking side-chains and performing energy minimization. To generate a mutant ensemble, the mutation of interest is introduced to each structure before conducting the analogous side-chain repacking and minimization. Finally, both ensembles are scored to calculate the ensemble-averaged ΔΔG. The method was validated using the ZEMu dataset of 1240 mutations [
108] derived from the SKEMPI database [
109]. For this dataset, the Flex ddG method reached a Pearson correlation coefficient (PCC) of 0.63 and an average absolute error of 0.96 Rosetta energy units. The enhanced performance was especially prominent in the case of small-to-large mutations, emphasizing that backbone flexibility constitutes a key factor during the modeling of these mutations. Relevant improvements were also achieved for modeling stabilizing mutations and mutating antibody–antigen interfaces. Interestingly, the enhanced performance over a fixed backbone approach was observed already when averaging over 20–30 conformations, a relatively low number in contrast to by previous ensemble-based methods, for which thousands of structural models were required [
110].
Notably, the Flex ddG method was evaluated in three comprehensive benchmarking studies focusing on different engineering scenarios. Aldeghi and coworkers evaluated alchemical free-energy calculations and three Rosetta protocols including Flex ddG in combination with different force fields for the prediction of changes in binding the affinity of ligands upon mutation [
111]. In total, 134 mutations were considered for 27 ligands and 17 proteins, showing that Flex ddG can reach quantitative agreement with such experimental data with a root-mean-square error (RMSE) of 1.46 kcal/mol and a PCC of 0.25, which was on par with the best performing alchemical calculations (an RMSE of 1.39 kcal/mol and a PCC of 0.43) [
111]. At this point, it is worth comparing the computational resources required for such predictions. The alchemical calculations were reported to take two to five days using 20 CPU threads and one GPU, while Flex ddG computations were usually finished within a day on a single CPU core [
111]. The same author collective also evaluated the utilization of these methods for the prediction of 31 drug resistance-conferring mutations for eight tyrosine kinase inhibitors of human kinase ABL [
112]. For this dataset, Flex ddG was found to be highly accurate with an RMSE of 0.72 kcal/mol and a PCC of 0.67, even outperforming the much more demanding alchemical calculations [
112]. Interestingly, significant improvements in ΔΔG prediction could be reached with a consensus of predictions from Flex ddG and alchemical calculations in both studies [
111,
112]. Another comparative study investigated the performance of five predictive tools when applied for alanine scanning to identify hotspot residues at protein–protein interfaces [
113]. For a dataset of 748 single-point mutations to alanine from the SKEMPI database, Flex ddG ranked the best (PCC of 0.51) from the tools that were not trained using this database [
113].
The advantages of incorporating conformational ensembles during design have also been noted during the development of a multistate framework that enables the adoption of reliable methods implemented in the Rosetta package for single-state design (SSD) and also for multistate design (MSD) [
93]. Briefly explaining the mode of action, the input for the framework consists of a set of multiple states (structural conformations) and the population of sequences generated by randomly introduced single-point mutations, which are processed and altered by a genetic algorithm. Next, each sequence–state pair is evaluated and scored based on the Rosetta SSD protocol of the user’s choice. The score of each sequence are communicated back to a sequence optimizer to perform the next iteration, until the fitness criteria are satisfied, finally giving a population of the optimized sequences. This is opposite to the standard SSD, which uses an MC algorithm and produces only a single sequence. The performance of MSD was evaluated on several design perspectives. Firstly, the performances of MSD and SSD in the task of recapitulating the binding site in the human intestinal fatty acid-binding protein was compared utilizing its ensemble obtained by NMR spectroscopy. Here, the SSD approach was used separately for each conformation, while the MSD was run on the whole ensemble at once. The MSD procedure achieved higher average native sequence recovery (NSR) and native sequence similarity recovery (NSSR) rates. Additionally, de novo ligand-binding design was performed for 16 proteins using SSD and MSD, where conformational ensembles of 20 and 1000 structures were generated by the Rosetta Backrub algorithm and a 10 ns long MD simulation, respectively. In this comparison, the MSD approach primarily produced sequences with higher NSR and NSSR rates and slightly lower energies, proving the advantages of the ensemble utilization. Interestingly, the quality of the designs originating from Rosetta Backrub and MD simulations were comparable, even though the mean Cα RMSDs over the ensembles differed notably, which were 0.17 and 0.62 Å, respectively. Finally, the multistate framework was tested by introducing retro-aldolase activity into protein scaffolds, which revealed nine proteins with experimentally confirmed activities [
93].
A similar idea of combining an ensemble-based design and a multistate approach was behind the development of a meta-multistate design procedure (meta-MSD) to rationally design proteins that spontaneously switch between conformational states [
94]. In this case, the procedure started with the generation of an ensemble of backbone templates with Rosetta Backrub and PertMin approaches [
99,
114] to cover the conformational landscape, including all transition states of interest. Next, the whole ensemble was split into microstates that were energy-minimized. Then, these microstates were assigned to major, transition, and minor states based on their structural features. Finally, the sequences expected to transit between the states were identified based on their relative energies. Based on meta-MSD, several Streptococcal protein G domain β1 variants were engineered to obtain structures that can exchange conformations between two states spontaneously, producing experimentally validated protein exchangers capable of switching between the states on a millisecond timescale [
94], thereby highlighting the importance of the accurate modeling of a local energy landscape for designing protein dynamics.
3.2. Knowledge-Based Approaches
Following the expansion of protein structure databases, which contain a considerable amount of data related to structure–dynamics–function relationships in proteins, new methods to assess backbone flexibility have been designed, benefiting from this wealth of knowledge. The methods introduced here are implemented in the Rosetta software and represent an exciting direction for improving protein design processes by more efficiently exploring alternative backbone conformations.
The first among the reviewed data-driven approaches is the flexible backbone learning by Gaussian processes (FlexiBaL-GP) method [
95] that uses multiple structures of a given protein to learn the most probable global backbone movements specific for training structures using the Gaussian process latent variable model as a machine learning method. These learned movements are then applied to guide the search for proteins with alternative backbone conformations by Markov Chain Monte Carlo sampling, where at each step 95% of the time is spent on the selection of the optimal side-chain rotamers and 5% of the time is spent on the generation of the protein backbone movements. FlexiBaL-GP can utilize various sources of training data including X-ray structures, NMR models, and MD simulations. When learning from a set of 28 crystal structures of ubiquitin and using two latent variables, the FlexiBal-GP method generated an ensemble of structures for native ubiquitin with an RMSD range of 0.5–0.65 Å from a reference structure. Notably, the ensemble recovered over 40% of the conformational diversity of the ensemble obtained by NMR spectroscopy. Moreover, the method’s ability to enrich a library of ubiquitin variants towards those with improved affinity to ubiquitin carboxyl-terminal hydrolase 21 was evaluated. For this task, the FlexiBal-GP method was trained on two wild-type complexes only or combined with either a structure of a tightly binding mutant or MD-based ensembles starting from the two wild-type structures. All three derived models outperformed flexible designs with Rosetta Backrub, as well as designs based on ensembles generated with MD simulations and the constraint-based method, CONCOORD [
115].
A different approach to harnessing knowledge from structural databases and to navigating sequence space sampling with a flexible backbone has been explored by the structural homology algorithm for protein design (SHADES) [
96]. This approach relies on the libraries of In-contact amino acid residue TErtiary Motifs (ITEMs) derived from curated protein structures, in which local contacts were analyzed for each residue. Analogously, target ITEMs are then identified for each position in the target structure in a position-specific manner and matched to the ITEMs database in order to generate candidate ITEMs libraries. Finally, these libraries are exploited by an iterative population-based optimization method that substitutes all residues in each target ITEM position with all residues from a candidate ITEM. The structure of the altered fragment is then adjusted by optimizing its backbone with the Rosetta Backrub method, repacking the side-chains and minimizing or relaxing the whole structure with or without backbone restraints. Using a dataset of 40 proteins from different families, the SHADES performance in recovering the native sequences of the proteins was evaluated, reaching a 30% average sequence recovery and a 46% sequence similarity between the designed and natural proteins, when candidate ITEMs derived from homologous proteins were excluded. When the homologs were retained in the candidate libraries, the sequence recovery rate increased up to 93%. Notably, rather large conformational diversity was observed for the successfully designed models, in some instances exhibiting more than a 1 Å RMSD from their respective native structures. Overall, these tests indicated that SHADES could capture sequence–dynamics–structure relationships correctly while spending about 25 times less CPU time than the redesign mode of the Rosetta FastRelax method [
116].
3.3. Provable Algorithms
Due to the high complexity of protein design tasks, especially when employing ensemble-based approaches (
Section 3.1), the majority of the tools rely on heuristic algorithms as an expedient way to obtain the desired constructs. For more complicated tasks, these approaches are often barred from generating optimal solutions, which in turn can lead to the design of sequences that are not guaranteed to have the lowest energy [
117]. In response to those limitations, provable algorithms have been developed, creating a promising alternative for reaching entrenched solutions [
117,
118]. Here, we briefly outline some of the most compelling developments that led to an advanced description of backbone flexibility. For a more comprehensive overview of provable algorithms and their evolution and application, please see the very insightful reviews published recently [
119,
120].
The development of provable algorithms started with the adaptation of the dead-end elimination (DEE) method [
121] that was later improved by introducing rotamers’ minimization before pruning to enable a more continuous description of side-chains, an essential component of several successful designs [
118,
122]. The initial approach to backbone flexibility was introduced with the dead-end elimination with perturbations (DEEPer) method [
123], relying on a predefined set of small local movements extracted from an experimental structure such as Backrub [
124] or sheer. However, such motions are mostly restricted to subangstrom dimensions to avoid disruptive changes propagated to a distant region from the segment of the altered backbone. To enable more progressive motions in a predefined contiguous part of the backbone such as the movement of a flexible loop, the coordinates of atoms by Taylor series (CATS) approach was recently introduced [
97]. The main idea of the approach lies in the new definition of the backbone internal coordinate system, which enables physically sensible, continuous, and strictly localized perturbations of the given segment of the backbone in a manner that is compatible with the advanced DEE workflows. The CATS method was tested on 28 different proteins with flexible backbone treatment enabled for five to nine-residue long segments. By introducing more pronounced changes in backbone conformations, almost 0.2 Å on average, CATS reached a mean improvement in design energies of 3.5 kcal/mol in comparison to the rigid-backbone approximation. Such an improvement is nearly twice as large as what was observed previously for restricted backbone perturbations introduced by the DEEPer method on the same set.
Owing to persistent optimization efforts [
125,
126,
127,
128], provable algorithms can nowadays be applied for protein design while simultaneously employing both the continuous flexibility of side-chains and enhanced backbone flexibility efficiently at similar computational costs to more rigid approaches. These methods are available in OSPREY 3.0 [
129], in which the analysis speed has been further promoted by the newly supported use of GPUs and multicore CPUs for some of the modeling tasks, which were prohibitively complicated for the previous version of the software. As underlined by several studies featuring various applications of provable algorithms [
130,
131,
132,
133], these algorithms have matured enough to be of practical utility for protein engineers. This trend will undoubtedly gain further momentum with the recent developments discussed herein, even though their computational demands might still be limiting for some applications.