A Comparison between Enrichment Optimization Algorithm (EOA)-Based and Docking-Based Virtual Screening

Virtual screening (VS) is a well-established method in the initial stages of many drug and material design projects. VS is typically performed using structure-based approaches such as molecular docking, or various ligand-based approaches. Most docking tools were designed to be as global as possible, and consequently only require knowledge on the 3D structure of the biotarget. In contrast, many ligand-based approaches (e.g., 3D-QSAR and pharmacophore) require prior development of project-specific predictive models. Depending on the type of model (e.g., classification or regression), predictive ability is typically evaluated using metrics of performance on either the training set (e.g.,QCV2) or the test set (e.g., specificity, selectivity or QF1/F2/F32). However, none of these metrics were developed with VS in mind, and consequently, their ability to reliably assess the performances of a model in the context of VS is at best limited. With this in mind we have recently reported the development of the enrichment optimization algorithm (EOA). EOA derives QSAR models in the form of multiple linear regression (MLR) equations for VS by optimizing an enrichment-based metric in the space of the descriptors. Here we present an improved version of the algorithm which better handles active compounds and which also takes into account information on inactive (either known inactive or decoy) compounds. We compared the improved EOA in small-scale VS experiments with three common docking tools, namely, Glide-SP, GOLD and AutoDock Vina, employing five molecular targets (acetylcholinesterase, human immunodeficiency virus type 1 protease, MAP kinase p38 alpha, urokinase-type plasminogen activator, and trypsin I). We found that EOA consistently outperformed all docking tools in terms of the area under the ROC curve (AUC) and EF1% metrics that measured the overall and initial success of the VS process, respectively. This was the case when the docking metrics were calculated based on a consensus approach and when they were calculated based on two different sets of single crystal structures. Finally, we propose that EOA could be combined with molecular docking to derive target-specific scoring functions.


Introduction
Time and money are two of the most required resources in the design of new drugs and materials. Several techniques are available to expedite and lower the costs of these processes, such as functional genomics [1], high-throughput screening (HTS) [2] and combinatorial chemistry [3]. Over the years, computational methods have demonstrated their ability to complement and even replace experimental techniques for such tasks.
Among the computational methods, virtual screening (VS) stands out as a viable alternative to HTS in the initial stages of drug and material design projects. VS is typically performed using structure-based approaches employing molecular docking, and to a lesser extent, simulation methods, or various ligand-based approaches [4][5][6]. Most docking tools, scoring functions and force fields utilized in structure-based VS were designed to be as general as possible, as evident, e.g., from the large volume of experimental data used for their derivation [7][8][9][10][11], and consequently only require knowledge of the 3D structure of the biotarget, although additional experimental information, when available, could be used to select the most appropriate tool(s) and to optimize their performances. In contrast, many ligand-based approaches require the prior development of target-specific predictive models by using information on active (and when available inactive) compounds. Ligand-based approaches for VS are typically implemented by means of pharmacophore models [5], 3D-QSAR [12,13] or various similarity search strategies [14,15]. Both structurebased and ligand-based approaches have demonstrated notable success in multiple VS campaigns [16,17].
A particularly appealing yet somewhat less common approach to virtual screening is presented by QSAR equations derived from easy to calculate 1D, 2D and sometimes global 3D descriptors. Several such studies were reported in the literature, and in most cases, the descriptors were calculated for the ligands in their unbound states [18][19][20][21][22][23][24][25][26][27][28]. Some of these efforts were summarized in several review articles [29,30]. In other cases, ligand-based descriptors were combined with descriptors derived from the ligands' 3D conformations as obtained from molecular docking [31,32] or with the docking scores themselves [33]. However, these models were typically not used for virtual screening due to the computational resources required for large scale docking. An interesting combination of QSAR and docking for the purpose of VS was recently presented by Gentile et al., who used a deep neural network based on molecular fingerprints to predict docking scores of >1.3 billion compounds retrieved from the ZINC database [34]. This method was subsequently used for the VS of a similar number of compounds against the SARS-CoV-2 main protease [35]. All in all, QSAR-based VS holds promise for handling the increasingly large collections of commercially available or synthetically feasible compound collections offered by many vendors. This is because the computational resources needed for calculating such descriptors are significantly less than those required by other techniques.
A common theme of all the above-mentioned literature reports is that QSAR models derived for VS were validated using metrics of performance on either a training set (e.g., Q 2 CV ) or a test set (e.g., metrics derived from the confusion matrix for classification models or Q 2 F1/F2/F3 for regression models) [36]. In this respect, it is important to note the lack of correlation between internal and external validation, a phenomenon sometimes referred to as the "Kubinyi Paradox" [37,38]. This, together with similar observations, have led to the realization that models should be evaluated on external test sets only [39]. However, irrespective of the exact nature of the evaluation metric, there is no reason to a priori assume that any of these metrics could reliably assess the performances of a QSAR model in the context of VS. This is because the task faced by VS, namely, the identification of a set of weakly active compounds from within a large pool of diverse, mostly inactive compounds, is quite different from the ability to qualitatively or quantitatively predict the activities of a small set of similar compounds. Thus, we argue that the evaluation of QSAR equations (and by extension of any computational model) should reflect their intended usage. In particular, if QSAR equations are derived with VS in mind, they should be evaluated in a VS scenario. Moreover, if the derivation of QSAR equations is treated as an optimization problem in the space of the molecular descriptors, which is often the case, then for the purpose of VS, the metric to be optimized should be VS-aware.
With this in mind, we previously presented the enrichment optimization algorithm (EOA), which derives QSAR models in the form of multiple linear regression (MLR) equations by optimizing an enrichment-like metric, and demonstrated its superiority in small-scale VS campaigns over QSAR equations derived by optimizing according to a "classical" metric (mean averaged error) [40]. Still, the original EOA algorithm suffered from several drawbacks, and in particular from a high degree of redundancy in the optimized metric. Thus, many QSAR equations led to identical values of the evaluation metric with no way to further rank them.
In this work, we present an improved version of EOA and demonstrate its superior performances in the virtual screening of five protein targets, this time in comparison with the most common VS approach, namely, molecular docking. Table 1 presents the results obtained with the EOA models for all datasets, whereas  Table 2 presents a comparison between the results obtained with EOA and all the docking tools tested in this work for all test sets. Test set performances were evaluated using the AUC and EF 1% metrics. The docking results are based on the consensus approach across two crystal structures for each target. The results for the individual crystal structures are presented in Tables S4 and S5.  The EOA results presented in Table 1 demonstrate the expected yet small decrease in performance when going from the training sets to the validation sets (overall averaged percentages of active compounds found within the first L places of the ranked list of 85% and 82%, for the training sets and validation sets, respectively). A much larger decrease occurred for the test sets (average percentage = 41%), which could be attributed to the much smaller percentage of active compounds in these sets (see Materials and Methods section). In terms of the number of descriptors, we note a slight increase in performance when going from 7-descriptor to 10-descriptor and to 13-descriptor models. However, this increase was consistent across training, validation and test sets: The averaged percentages of active compounds found within the first L places of the ranked list were 83%, 86%, and 87% for training sets calculated with 7, 10 and 13-descriptor models. The corresponding scores for the validation sets were 81%, 83% and 84%; and for test sets, 40%, 42% and 42%. Taken together, these results suggest that our models are unlikely to be over-fitted. We note that in all cases, the number of descriptors in the final models was well below the number of compounds used for model derivation (see Table 1). Finally, in terms of the different protein targets, the best results were obtained for UROK and TRY1, followed by HIVPR, whereas models derived for ACES and MK14 gave overall poorer results (averaged percentages of active compounds found within the first L places of the ranked list obtained for training, validation and test sets were: UROK: 96%, 94%, 58%; TRY1: 91%, 88%, 62%; HIVPR: 87%, 86%, 34%; ACES: 76%, 69%, 28%; MK14: 76%, 74%, 26%). A list of the most common descriptors appearing in all EOA equations derived for each dataset together with the number of occurrences and a short explanation is provided in Table S2. A complete list of descriptors is provided in Table S3.

Results
The results of the VS are presented in Table 2. Similarly to Table 1, we see a slight increase in EOA performance when going from 7-descriptor to 10-descriptor models but no further increase when going to 13-descriptor models (averaged AUC and EF 1% values across all EOA models across all datasets and subsets for 7, 10 and 13-descriptor models  Table 1. Most significantly, the results in Table 2 clearly indicate that the EOA algorithm outperformed all docking programs tested in this study across all subsets and datasets in terms of both AUC and EF 1% values, even when the latter were based on the consensus approach (averaged AUC across all EOA models across all datasets and subsets: 0.93; averaged AUC across all docking tools across all datasets and subsets: 0.76 for the consensus approach, 0.73 based on the DUD-E associated structures (Table S4) and 0.74 based on the alternative structures (Table S5); the corresponding EF 1% values are 46.7, 17.8, 15.3, and 16.2 for EOA, consensus docking, DUD-E associated structures and alternative structures, respectively). Thus, while using a consensus approach across two structures improved the docking results relative to using a single structure, the performances of EOA were still better.
Finally, while a comparison between the different docking algorithms was not the focus of this study, we note that in terms of AUC values and using the consensus approach, AD Vina performed the best for ACES, HIVPR and MK14. For UROK and TRY1, the results are test set-dependent, with Glide and Gold outperforming AD Vina. In terms of EF 1% , GOLD performed the best for ACES and HIVPR, whereas Glide performed the best for MK14, UROK and TRY1. Thus, no clear-cut trends were observed. A comparison of Table 2 with Tables S4 and S5 suggests that out of the five targets considered in this work, the largest increases in AUC and EF 1% upon moving from single crystal-based VS to consensus-based VS occurred for the TRY1 dataset docked with Glide, and for the MK14 dataset also docked with Glide, respectively. Figure 1 presents, for each dataset, the EOA and docking-generated ROC curves, based on the consensus approach. As noted in the Materials and Methods section, each dataset gave rise to four subsets which led to four corresponding test sets. Thus, for each dataset, we chose as a reference the test set that yielded the best ROC curve for any of the docking programs and compared the results obtained with the other docking tools and with the EOA algorithm to it. A similar analysis based on using as a reference the test set that yielded the best ROC curve across the two crystal structures (DUD-E-associated and alternative structures) and the four test sets is presented in Figure S1. The complete set of ROC curves for both sets of structures and for the consensus approach is provided in Figures S2-S16. These results reinforce those presented in Table 2, Tables S4 and S5-namely, that for this set of targets, the EOA algorithm performed the best. Note that this is the case even though the EOA algorithm is not necessarily represented by its best result. For example, for the UROK case we chose to present the results obtained for set 3, since for this set, we obtained the best results from among the docking tools with GOLD (AUC = 0.86). The results obtained with the EOA algorithm for this set were AUC = 0.96. However, the results obtained with EOA for set 1 of this target were slightly superior, at 0.997. To further check that the results obtained with the EOA algorithm were not chancecorrelated, we performed Y-scrambling. For this purpose, the training set that gave rise to the best EOA model from among the four subsets for each of the five parent datasets was scrambled five times. Each scrambling was performed by tagging all active compounds as inactive and by randomly tagging the appropriate number of inactive compounds as active. Next, EOA models were derived from the scrambled datasets using 10 descriptors for the ACES, HIVPR and TRY1 datasets, and 13 descriptors for the MK14 and UROK datasets, and were applied to the test sets. The resulting ROC curves are presented in Figure 2 and demonstrate performances well below random selection. Thus, the original models derived from the unscrambled dataset are likely not chance-correlated.

Discussion
In this work we present an improved version of our recently reported enrichment optimization algorithm (EOA). EOA derives QSAR models by optimizing an enrichmentlike function, specifically by ranking a set of L active and O inactive compounds using an MLR equation and by maximizing the number of active compounds within the top L places of the ranked list. In this version we have augmented the scoring function by a secondary score which favors solutions in which active compounds, if found beyond the first L places, and inactive compounds, if found within the first L places, occupy positions as close as possible to position L. However, we deliberately restricted this secondary score to take values in the range of (0, 1). This effectively means that solutions that introduce more active compounds into the first L places of the list will always be preferred irrespective of the positions of the other active/inactive compounds. We argue that this is a viable strategy for virtual screening, wherein the purpose is to maximize the number of active compounds at the top of a list ranked by some scoring function, although other strategies could also be considered. From within solutions with equal numbers of active compounds in L, the secondary score favors those that "push" inactive compounds towards the bottom of the L (active) list and active compounds towards the top of the O (inactive) list.
Using the new algorithm, EOA models were derived according to common best practices [41][42][43]. Specifically, models were derived using a training set and validated on independent validation and test sets. All divisions into training/validation/test sets were performed at random and repeated four times. The only difference between the validation and test sets were the proportions between active and decoy compounds, tests sets being constructed with a small percentage of active compounds (i.e., 0.5-0.7%) which is the common case for datasets used in virtual screening [44,45]. To make sure that our models are not over-fitted we derived 7, 10 and 13-descriptor models and compared their performances. To make sure our models are not chance-correlated, we performed y-scrambling.
The results section suggests that the EOA models derived in this work are able to retrieve on average > 80% of the active compounds from training and validation sets and over 40% of the active compounds from test sets designed to include only a small fraction of active compounds, typical of VS campaigns. Importantly, these EOA models are unlikely to be chance-correlated or over-fitted. Looking at the most common descriptors appearing in the EOA equations derived for the different datasets (Table S2), we note that many of them are directly relevant to the ligand-protein recognition process (e.g., number of N-H groups, number of hydrogen donor groups, electrotopological state indices, molecular charge descriptors) whereas others are more related to ligands' ADME properties (e.g., ALOGP) or overall structure (e.g., Balaban index, Sum of topological distances between all nitrogen atoms in the molecule). All in all, the presence of these specific descriptors in the final EOA equations makes physical sense.
Having identified the most common descriptors, we wanted to see whether better EOA models could be developed using a subset of them and in particular such that emphasize protein-ligand interactions and key ADME ligand properties. For this purpose, we focused on a subset of seven descriptors, namely, PEOE3, HBD, MR1, ALOGP2, ALOGP7, ssCH2_Cnt and aaCH_Cnt (see Table S2 for a short explanation on their meaning) and constructed five EOA models, one for each target, using a total of three descriptors selected from this pool. The results are presented in Table 3 and are overall poorer than those obtained with the 7, 10 and 13-descriptor models (except of the 7-descriptors model for set 1 of the HIVPR target). Next, we wanted to test whether the performances of EOA-based models are correlated in any way with the ease of differentiating between active and decoy compounds in the different datasets. As a surrogate to ease of separation, we chose to look at average distances between actives and decoys. Since actives/decoys distances depend on the specific compounds and on the descriptors used to characterize them, different distances are expected for each of the three EOA models derived for each of the four subsets from each target. Thus, for each target we calculated distances based on the descriptors selected for the best 10-descriptors EOA model. This is because for three out of the five targets (MK14, UROK, TRY1), EOA models based on ten descriptors performed the best, in terms of the AUC metric, across all four subsets. Furthermore, since EOA models were derived using normalized descriptors, for the purpose of relevant comparison, distance calculations also employed the normalized descriptors. For ease of calculation, Euclidian distances were calculated using all principle components derived from principle component analysis (PCA) performed with the WEKA program [46]. The results of our calculations together with the corresponding AUC values are presented in Table 4 and suggest that: (1) The average distances calculated for the different subsets are not significantly different from one another and (2) Overall, there is no correlation between the actives/decoys distances and model performances. Thus, it seems that the better performances observed, e.g., for the UROK and TRY1 sets are not necessarily the result of a larger separation between active and decoy compounds. Finally, when compared against three common docking-based virtual screening tools, namely, Glide-SP, GOLD and AutoDock Vina EOA performances where significantly better both in terms of the AUC which informs on the overall success of the process and in terms of EF 1% values which inform on the success of the process in its initial stages. This was the case both when docking metrics were calculated based on the consensus approach or based on two different single crystal structures.
The comparison between EOA and docking requires some discussion. Cleves et al., have advocated the usage of multiple crystal structures for virtual screening, demonstrating a modest increase in performances relative to the usage of a single crystal structure (averaged AUC increased from 0.81 ± 0.11 to 0.84 ± 0.09 for 92 targets [47]). Admittedly, some of the targets in the DUD-E + database (including two utilized in this study, ACES and MK14) greatly benefited from the inclusion of multiple protein structures. In the case of ACES, we observed no increase in AUC on going from single structure-based VS to the consensus and only a small increase in EF 1% (average AUC across the three docking programs, four datasets and two single structure-based VS is 0.73 ± 0.05; the average AUC across the three docking programs and the four data sets for the consensus approach is 0.73 ± 0.06. The corresponding numbers for EF1% are 16.2 ± 8.2 and 17.6 ± 8.7). This could be attributed to the smaller number of structures used in our consensus (two vs. five). We note however, that the average AUC and EF 1% values obtained in this study for single structure-based VS, 0.73 and 16.2, are significantly higher than the numbers obtained by Cleves et al. for single structure-based VS (0.53 and 3) and in fact closely match the results of the five-structures consensus (0.74 and 19). Thus our single structure-based results leave much less room for improvement. The average (across all models and datasets) AUC and EF 1% values obtained by EOA are significantly higher at 0.89 ± 0.03 and 28 ± 19. In the case of MK14, we observed small increases in both AUC and EF 1% on going from single structure-based VS to the consensus (AUC: 0.68 ± 0.07 and 0.72 ± 0.04, respectively; EF 1% : 8.9 ± 2.3 and 12.4 ± 5.0, respectively). For this target, the lowest AUC value was obtained with Glide using the alternative crystal structure (0.55 ± 0.01). This number significantly increased to 0.74 ± 0.01 upon going to the two structure-based consensus. Cleves et al. reported a similar increase from 0.66 to 0.89 using a five-structure consensus. Still, our EOA results are significantly higher at 0.92 ± 0.02 and 30.2 ± 8.1 for the AUC and EF 1% , respectively. Finally, we note that while a consensus approach may benefit docking-based VS, in many projects, single structure-based VS is still the method of choice, either because of limited computational resources or because of lack of multiple crystal structures for the target of interest [34,35,[48][49][50][51][52][53][54][55].
A possible explanation for the improved performances of EAO over docking is that none of the docking tools we used were specifically trained on any of the specific targets, whereas the EOA models were. Thus, the better performances obtained with EOA could be regarded as another manifestation of the superiority of local over global models. While we cannot forego this argument, we note that in our previous work we demonstrated that, in the context of VS, EOA-derived models outperformed other regression and classification models trained on the same set of data [40]. Thus, using an enrichment-aware function for model derivation clearly has merit for the purpose of virtual screening.
Despite its good performances, two limitations of EOA and in fact of all ligand-based approaches for virtual screening should be noted: (1) Such approaches require information on active (and preferably also on inactive) compounds in order to derive and validate the models. While pharmacophore models could be developed using a rather small dataset, QSAR-based models typically require larger sets. EOA is no exception; however, since the method can utilize qualitative activity data, it has an advantage in this respect compared to regression-based models. In contrast, molecular docking could be performed with no a priori knowledge of active or inactive compounds. However, to validate and/or fine-tune the docking procedure for a specific target, this information is mandatory. In this respect, the difference between docking-based and EOA-based virtual screening is how to use available information, namely, for method validation and fine-tuning only (docking) or for method development and validation (EOA). (2) Ligand-based methods do not provide information on binding modes. EOA per se cannot overcome this limitation. However, in cases where information on active and inactive compounds is available, combining EOA with docking presents an appealing strategy whereby the components that make up the scoring function could be used as descriptors for the derivation of EOA-based models. This amounts to re-adjusting the weights of these components in a target-specific, virtual screening-aware manner. Depending on the specific docking tool, the new weights could be used either for pose re-scoring or even for the docking itself. In addition to providing target-specific scoring functions, this approach will also serve to more reliably compare between the performances of EOA and docking and to unveil the true advantage in using an enrichment-aware metric for model derivation by segregating the effect of the optimized function from that of model locality/generality. Work along these lines is currently being conducted in our laboratory.
Despite these limitations, we view EOA-based models as viable and practical tools for virtual screening. While model derivation might be time consuming, particularly for large datasets characterized by multiple descriptors, using such models for virtual screening is highly efficient in terms of computational resources. Thus, such models can easily handle the currently available large collections of commercial and proprietary screening compounds, thereby increasing the probability of identifying good starting points for drug and material design efforts.

Datasets
Datasets for five protein targets, namely, acetylcholinesterase (ACES), human immunodeficiency virus type 1 protease (HIVPR), MAP kinase p38 alpha (MK14), urokinase-type plasminogen activator (UROK) and trypsin I (TRY1), were retrieved from the DUD-E database [56]. These targets represent four different protein families according to the Pfam classifier [57] (carboxylesterase, retroviral aspartyl protease and P-kinase for ACES, HIVPR and MK14, respectively; and trypsin for UROK and TRY1). All datasets contain compounds which were experimentally determined to be active or decoy compounds. The PDB structures associated with each target in the DUD-E database are: 1e66, 1xl2, 2qd9, 1sqt and 2ayw, respectively. In addition, for each target we selected an alternative structure as suggested by the DUD-E + database [47], namely, 1acj, 2pwc, 3o8t, 4fue and 3rxl. PDB codes, and the numbers of active and decoy compounds for each target are listed in Table 5. All compounds and proteins considered in this work were prepared by Schrodinger's LigPrep program [58] and the Protein Preparation Wizard program [59], respectively. Protein preparation consisted of addition of hydrogen atoms, completion of missing side chains/residues and assignment of correct protonation states for ionizable residues. Ligand preparation consisted of obtaining reliable conformations, tautomeric forms and protonation states (at pH = 7).
Next, 1-dimensional and 2-dimentional (1D and 2D) molecular descriptors for all compounds from all datasets were calculated by the Canvas program [60,61]. The resulting 750 descriptors (see Table S1 for a listing of the main descriptor types) were preprocessed by removing correlated (r 2 > 0.7), constant and nearly constant (i.e., constant over 70% of the compounds) descriptors. The remaining descriptors were normalized using z-score scaling.
For the purpose of developing enrichment optimizer algorithm (EOA) models (see below), four subsets, each consisting of active and decoy compounds, were randomly selected from each parent dataset. These subsets were randomly divided into training and validation sets in an 80%/20% ratio, and the unselected compounds served as test sets. As a result of this selection procedure, the test sets contained much smaller percentages of active compounds (0.5-0.7%). This was done on purpose in order to mimic real VS campaigns. The compositions of all datasets are listed in Table 5.

Enrichment Optimizer Algorithm (EOA) Algorithm
In our previous work, we presented a novel algorithm for the derivation of multiple linear regression (MLR) equations suitable for usage in virtual screening, based on the optimization of an enrichment-like objective function. We termed the new algorithm enrichment optimizer algorithm (EOA) [40]. Briefly, EOA accepts as input a set of L active compounds together with a set of O inactive (either known inactive or decoy) compounds characterized by a set of N molecular descriptors. The algorithm then derives an MLR equation to rank the compounds, counts the number of active compounds within the first L places of the ranked list and then uses a Monte Carlo/simulated annealing (MC/SA) optimizer to maximize this number in the space of the descriptors and their weights.
One drawback of the original EOA algorithm was the very limited range of values the objective function could take. This led to a high degree of redundancy manifested as multiple MLR equations leading to the same final value of the objective function with no means to further rank them. To address this problem, we introduced a post processing mechanism whereby redundant solutions with the highest value for the objective function were further scored based on the positions of active compounds that were ranked beyond the first L places and of inactive compounds that were ranked within the first L places. Higher (i.e., better) scores were allocated to solutions where both types of compounds had ranks closer to L.
In the present study we generated a new scoring function by combining the number of active compounds located within the first L places of the ranked list (primary score) with the results of the above-described post processing mechanism (secondary score). The secondary score was normalized to be within the range of 0-1 by means of an inverse sigmoid function. This was done in order to give preference to solutions (i.e., MLR models) that gave rise to the maximal number of active compounds within the first L places, irrespective of the precise locations of other active or inactive compounds. As before, the new scoring function was optimized using MC/SA in the space of the descriptors and their weights. Figure 3 presents the flowchart of the modified algorithm, and a detailed description is provided in the Supporting information. For each of the four subsets selected from each of the five parent datasets, three EOA models were derived using 7, 10 and 13 descriptors. In addition, we developed five more models, one for each target, using three descriptors selected from an initial pool of seven descriptors. This was done in order to test whether good models could be generated from a more limited set of descriptors which focuses on the description of protein-ligand interactions and key ligand ADME properties. The specific descriptors for the initial pool were selected based on the data in Table S2 (descriptors occurrences) and included PEOE3, HBD, MR1, ALOGP2, ALOGP7, ssCH2_Cnt and aaCH_Cnt (see Table S2 for short explanations of these descriptors). Thus, a total of 5 × 4 × 3 + 5 = 65 EOA models were derived in this work. A typical MC/SA run for the derivation of an EOA model consisted of 1,000,000 MC steps. Simulated annealing was implemented by means of a saw-tooth procedure whereby repeated annealing cycles were performed. In each cycle the RT term was linearly decreased from 1 to 0.01 in 0.01 intervals, running 400 MC steps per interval. The range of values of the RT term led to an acceptance rate of roughly 2-6%.

Molecular Docking
We chose to compare the EOA algorithm with some of the docking tools commonly used for VS, including AutoDock Vina [10] using the PyRx platform version 0.9 [62], Glide-SP [9] and GOLD [63]. All docking calculations were performed with default parameters. More specifically: For Glide, input conformations for docking were generated using the "Canonicalize input conformation" option, and the energy window for ring sampling was set to 2.5 kcal/mol. The best 400 poses per ligand were kept for energy minimization using the OPLS3e force field and the pose with the lowest, standard precision Glide gscore was kept. For GOLD, the population size for the genetic algorithm was set to 100 and the maximal number of operations per ligand to 100,000. The pose with the lowest CHEMPLP value was kept. For AD Vina, the exhaustiveness argument was set to 8, and the maximum number of binding modes to generate was set to 9. The pose with the highest Vina score was kept.
In the interest of performing a fair comparison, docking was performed on those compounds included in the test sets used for the evaluation of the EOA models. For each target, all test sets were docked against the two crystal structures (see Table 5).

Evaluation Metrics
The performances of EOA and of the different docking methods were evaluated using two common metrics, namely, area under the ROC curve (AUC) and enrichment at 1% of the library (EF 1% ). AUC measures the overall performances of the VS procedure, whereas EF 1% measures performances at early stages of the screening process. In the case of docking, for each target, the two metrics were calculated for each PDB structure separately. In addition, we implemented a consensus approach whereby each ligand was ranked based on its best docking score across the two structures and the metrics calculated from this consensus ranking.