From Extraction of Local Structures of Protein Energy Landscapes to Improved Decoy Selection in Template-Free Protein Structure Prediction

Due to the essential role that the three-dimensional conformation of a protein plays in regulating interactions with molecular partners, wet and dry laboratories seek biologically-active conformations of a protein to decode its function. Computational approaches are gaining prominence due to the labor and cost demands of wet laboratory investigations. Template-free methods can now compute thousands of conformations known as decoys, but selecting native conformations from the generated decoys remains challenging. Repeatedly, research has shown that the protein energy functions whose minima are sought in the generation of decoys are unreliable indicators of nativeness. The prevalent approach ignores energy altogether and clusters decoys by conformational similarity. Complementary recent efforts design protein-specific scoring functions or train machine learning models on labeled decoys. In this paper, we show that an informative consideration of energy can be carried out under the energy landscape view. Specifically, we leverage local structures known as basins in the energy landscape probed by a template-free method. We propose and compare various strategies of basin-based decoy selection that we demonstrate are superior to clustering-based strategies. The presented results point to further directions of research for improving decoy selection, including the ability to properly consider the multiplicity of native conformations of proteins.


Comparison of Computational Runtimes
The six different selection strategies are compared in terms of running time on the shortest (in terms of number of amino acids) and longest protein target from each of the three categories (easy, medium, and hard). Figure 1 shows the running times in a log-scale plot, so as to accommodate trivial strategies, such as Cluster-Random, and the most computationally-expensive strategy, Cluster-Size. The results show that the basin-based selection strategies are more efficient than Cluster-Size.

Impact of Energy-based Filtering on Selection Strategies
One can consider a-priori filtering of computed conformations prior to clusteror basin-based selection, particularly as a strategy to reduce data size and thus computational time. Two opposite filtering strategies can be considered. One removes the x% lowest-energy conformations, whereas the other removes the x% highest-energy conformations. Figure 2 shows the basins obtained when removing the 10% lowest-energy conformations from three representative test cases (one for each of the easy, medium, and hard categories described in the main article). As Figure 2 demonstrates, the removal of the lowest-energy conformations results in an explosion of basins with very low purity, causing the Pareto-based selection strategies to select low-purity basins. This is expected. Removing the lowest-energy conformations drastically changes the structure of the underlying landscape, as removing the lowest energies removes focal energies. Hence, such filtering results in many spurious basins on the deformed landscape.
The strategy of removing high-energy conformations (in other words, retaining lowest-energy conformations), on the other hand, preserves enough of the structure of the landscape, as shown in Figures 3 and 4. Figure 3 shows the basins (marking those selected by the Pareto-based strategies) obtained when the 50% highest-energy conformations are removed, whereas after removing the 90% highest-energy conformations. As expected, removing higher-energy conformations does not drastically change the structure of the landscape and preserves enough basins for the basin-based selection strategies to hone in on high-purity ones.

Impact of Distance Threshold on Selection Strategies
The impact of different values of the dist thresh parameter that determines which conformations are native (based on their lRMSD distance in dist thresh to an experimentally-known native conformation) on the metrics n and p is now analyzed. The analysis is limited to the top cluster or basin selected and compares Cluster-Size, Basin-Size+Energy, and Basin-PR+PC. The plots below group the results on the different protein targets based on the three categories (easy, medium, and hard), as the ranges for dist thresh are different for the three categories. Figure 5 shows the impact on n (left panel) and p (right panel) on Cluster-Size and Basin-Size+Energy as dist thresh is varied, and Figure 6 does so for Cluster-Size and Basin-PR+PC. Similar observations can be drawn from these two comparisons. As dist thresh increases, n decreases and p increases. This is expected, as there is a scarcity of conformations sufficiently near to the known native conformation (e.g., < 1Å) in the dataset, especially for the targets in the medium and hard categories. Therefore, allowing larger distances from the known native conformation (i.e., larger values of dist thresh) for a particular conformation to be deemed native increases the number of native conformations in a selection, which is reflected in higher purity.
For example, consider the target with known native conformation under PDB entry 2h5nd (bottom row of Figure 5). There are hardly any decoys closer than 6Å to the known native conformation. As a result, when dist thresh is set to 6Å, the ratio (n) of the very few deemed-native conformations in the top basin selected by Basin-Size+Energy to the total number of native conformations (which is also very small) in the dataset results in a noticeable percentage, thus a high value of n. As dist thresh increases, this percentage drops quickly, since the limited number of native conformations in a cluster or basin is now compared with a higher number of native conformations in the decoy dataset. On the other hand, as the the number of native conformations in a cluster or basin increases (with the increased value of dist thresh), the ratio of that number to the size of the group/basin also increases, resulting in higher p.
On the easy targets (top row in Figure 5), varying dist thresh does not significantly impact purity; with the exception of the target with known native conformation under PDB entry 1ail, satisfactory purity (around 85% at approximately 1.5Å) is achieved on all the easy targets. This behavior is expected, as there are a lot of decoys in the easy targets that are closer than 2Å to the known native conformation. The medium and hard targets show varying growth of p in response to varying dist thresh. This is particularly the case for the medium targets, where most decoys are far away from the known native conformation. In the case of the target with known native conformation under PDB entry 1hz6a, good purity is achieved fairly quickly, whereas in the case of the target with known native conformations under PDB entry 1hhp, good purity is obtained only when dist thresh becomes large.
n, easy p, easy n, medium p, medium n, hard p, hard  Finally, in almost all targets (easy, medium, and hard), despite yielding a higher percentage of native conformations in the top cluster, Cluster-Size is outperformed by Basin-Size+Energy and Basin-PR+PC in terms of purity for varying dist thresh values.
In summary, Figures 5-6 show that n decreases and p increases as dist thresh increases. This implies that an evaluation of the performance of the selection strategies would yield comparatively-similar results at any specific dist thresh. In the evaluation presented in the paper, we set dist thresh so as to have the largest group selected by at least one selection strategy not be devoid of native conformations.