1. Introduction
Scoring functions (SFs) are ubiquitous useful tools for early stage drug discovery [
1,
2,
3]. However, their accuracy is currently moderate and there is a clear need for an improvement in accuracy to make the entire drug discovery process less risky and demanding of experimental resources. SFs can be categorized into four distinct classes: (a) force-field or physics based; (b) empirical; (c) knowledge based (statistical); and (d) machine learning or feature based [
4]. Other things being equal, the computational performance increases in a series (a)–(b)–(c)–(d), whereas the degree of generalization decreases in the same series. The classical SFs of types (a)–(c) have found and will continue to find numerous applications in drug discovery [
5,
6,
7,
8], despite all the known difficulties [
1,
9]. In many cases those SFs take into account (either explicitly or implicitly) the ligand–receptor affinity driving interactions, including electrostatic complementarity, hydrogen bonding and hydrophobic interactions [
10]. Whereas, the traditional SFs of the first three classes seem to have reached their accuracy limits [
11,
12], the main recent focus is on the fourth class—the machine learning scoring functions [
3]. Inspired by successes in the field of image analysis and Big Data of social media and related fields, the machine learning approaches have been given a new impetus in the fields of SF development. Higher levels of accuracy metrics have been reported for machine learning-based SFs in the literature [
12,
13].
Although machine learning approaches have definitely brought a fresh impulse in the approaches used to train SF models, the increased flexibility of those models introduced a new point of concern to the field—a greater ability of models to overfit [
14,
15,
16,
17]. Whereas this problem was less applicable to previous, rougher SF models, the machine learning approaches definitely require additional state-of-the-art efforts to ensure the resulting models are not overfitted using the available amount of input data. This is a fundamental problem in the field of drug discovery, since the amount of reliable data in comparison with the available chemical space, as pointed out by Bender et al. [
18,
19], is orders of magnitude less than for fields where machine learning approaches have come from and where they have had significant success. Thus, significant efforts should be applied in order to obtain a robust and not overfitted model using such flexible machine learning tools [
15,
20].
The field in which SF operates is intrinsically complicated—the free energy of ligand–receptor binding is affected by many different factors, their combination being different for different ligand–receptor pairs. For instance, the proper account of intramolecular ligand conformations and entropic terms was reported to be crucial [
21,
22]. The explicit account of water molecules is another crucial factor for certain complexes [
23,
24,
25], but which could not be straightforwardly performed for all simulation scenarios. The intrinsic mobility of the binding site, or its parts, is another source of deviation for scoring predictions from the experimental affinity or activity, for which several approaches to sample protein conformations have been proposed [
26]. The abovementioned difficulties cannot be straightforwardly solved without significant complication of the SF and, hence, decreasing its computational efficiency. The latter is the cornerstone for the main application of an SF in drug discovery practice, as an important stage in the early stages of drug discovery, where fast screening is crucial to focus the attention of researchers on a tractable fraction of large datasets of potential molecules.
Although the direct account of the complex free energy effects is cumbersome, the indirect account is quite possible, which is illustrated by the success and applicability of the target specific SFs [
27]. In the latter, the parameters of the SF are specifically tuned to better reproduce the ligand–receptor interactions involving a single receptor or a limited set of receptors. Thus, the specifics of interactions, governed by the specifics of the receptor, are taken into account implicitly. The same ability of the SF to be better parameterized for a certain class of targets in comparison to the others is known to be one of the main difficulties that limits the accuracy of the “reverse screening” or ”target fishing” [
8], in which a target is being predicted for a certain ligand in question. The implicit bias of the SF towards the specifics of certain types (in terms of interactions involved) of ligand–receptor complexes results in a situation whereby, for other types of ligand–receptor complexes, the prediction of the complex’s free energy appears to be systematically worse. In such a case, the choice of target for a ligand, based on the results of the virtual screening of a panel of targets, becomes complicated, since the scores that the SF produces seem to be dominated by some types of interactions. Yet another confirmation that the SF might be biased towards certain types of interactions is the better performance (in terms of robustness of predictions) of the “consensus scoring” [
28], in which several different SFs have their voice in a final score value. This way, the deteriorated accuracy of one SF at certain ligand–receptor complexes is offset by the other’s SF, for which it is statistically less probable that the same type of complex is also more problematic.
In contrast to target specific SFs, the ligand-specific SFs seem to be poorly represented in the literature as everyday practical tools [
29]. It can be easily explained by comparing the diversity and cardinality of the spaces of receptors and ligands. The possible diversity, and hence the accessible chemical space of ligands, is immense [
30]. Thus, it is not only difficult to sample its specific subspaces adequately, but the overfit for the specifics of the ligands included into the training set is also more possible by far. The same applies to the descriptors/features defining ligand properties. The cardinality of the feature space that could reasonably explain the observed differences between different ligands of the chemical space is also large. Therefore, a large number of structural features are required to discern the properties of all ligands, even in the drug discovery related subspace. On the other hand, only a few distinct types of interactions, which are observed in the experiment and have physics-based explanations, are known and being constantly used by medicinal chemists [
10]. These are, e.g., the well known hydrogen bonds, hydrophobic and aromatic interactions. Those interactions are not only well interpretable, but appear to greatly define the entire energy of the ligand–receptor interactions, which also explains their wide applicability in practice both at qualitative and quantitative levels. On the one hand, the terms of the known SFs were in many cases specifically chosen to well describe (though in a throughput manner) the abovementioned basic interactions. On the other hand, the extent to which those interactions are being properly accounted for has not been explicitly studied previously to the best of our knowledge. In a broad formulation, the question can be casted as to what extent the current SFs describe these basic interactions. At a more technical level, the question is which features of the ligands (responsible for possible interactions with the receptor) are not fully accounted for in an SF in question, and hence could be subject to a focused optimization in order to arrive at a more accurate SF. The main assumption about the possibility of improving the existing SFs is that the means of increasing the accuracy should not require additional computational overheads. Otherwise, it would limit the scope of applicability of the SFs. Thus, in the most simple and advantageous cases, only the focused parameters tuning of an SF might be required to achieve the goal.
The ability to detect the deficiencies of an SF in describing certain types of interactions represented by ligand features, therefore paves the way for the systematic studies aimed at improving the current SF and perhaps devises the ways to develop new ones with the increased accuracy. The same approach can also be used hierarchically. After the presence of the ligand features responsible for the basic types of interactions is well explained, a study of the significance of the more subtle and/or rare ligand features can be performed. For example, halogen bonding (XB) has received much attention during the last decades but is definitely not one of the main driving forces in drug discovery [
31,
32,
33]. However, the proper account of XB by SFs might be crucial for hit-to-lead or especially lead optimization stages. Similarly, one can study various types of more specific interactions represented by certain features of the ligand. Therefore, the enhancement can be performed systematically and using the natural priorities of the significance/occurrence of the effects being taken into account.
In this work we thus hypothesized that the specific features of the ligands, corresponding to the well appreciated by medicinal chemistry interactions (e.g., hydrogen bonds, hydrophobic and aromatic interactions) might be responsible in part for the remaining SF error. The latter provides the direction for the efforts directed towards the rational and systematic improvement of the accuracy of the SFs. We also tested the proposed approach in its ability to assess the significance of the halogen bonding effect and its proper account.
In what follows, we first describe the choice of the dataset used in the study. Then, the features of the ligands, relevant for description of the basic interactions, are defined at structural level. The choice of a representative panel of the SFs is explained next. After that, a set of correlation studies is performed to reveal how the presence of the features in ligands affects the description of the experimentally measured ligand–receptor affinities. Then, the correlation of the residual errors of description of the experimental affinities (by each of the SFs in the panel) with the presence of chemical features is analyzed. Finally, several useful interpretations of the results in a broader context of drug discovery are given.
3. Discussion
The abovementioned statistical results, combined with additional reference information (
Table 9), admit a reasonable interpretation and discussion, which may help to advance the field of SF development for drug discovery.
3.1. AutoDock 4.2
It was shown that AutoDock 4.2 SF tends to overestimate polar and ionic interactions (
Figure 2) and thus requires the opposite sign correction for those components (
Figure 3). This is due to the explicit treatment of electrostatic (Coulomb) interactions modeled by means of Gasteiger partial charges.
Gasteiger partial charges are known for their ability to predict and model chemical properties (such as an inductive effect). However, they are also known to be too low in amplitude (compared to any charges reasonably reproducing the electrostatic potential at HF/6-31G* level) for use in molecular mechanics applications. It was also explicitly shown [
20] that the use of charge models directly reproducing the HF/6-31G* molecular electrostatic potential (MEP), in combination with robust regression analysis and outlier exclusions, improves the ability of the AutoDock 4.2 to reproduce experimental
pK values. We assume this was not only due to the robust regression analysis of AutoDock 4.2 energy terms. Both AM1-BCC and RESP charge methods used in that work are capable of not only quantitatively reproducing the reference MEP, but also qualitatively correctly redistributing charge density compared to the Gasteiger charges, which should be especially noticeable in the case of formally charged molecules. We hypothesize that the main inconsistency in the use of Gasteiger charges for formally charged species lies in the combination of low-amplitude values of partial charge of neutral groups in combination with formally charged groups whose charge values are integers. Thus, there is no single scaling factor for these two types of groups and their respective charges. Therefore, more consistent charges between the formally charged and neutral parts of a molecule should lead to a more consistent correlation with the experimental activities.
Another point is that none of the tested scoring functions other than the AutoDock 4.2, ∆Vina RF20 and NNScore 2.0 explicitly take into account electrostatic interactions; however, they perform on the same level or even better in terms of
pK reproduction metrics (
R2, SD,
Table 3). The work also showed that the most important (
Figure 1) and most undervalued (
Figure 3) interactions are hydrophobic in nature. Thus, the question arises: is it necessary to explicitly take into account electrostatic interactions at all? It is a known concept that the directed, in particular, electrostatic interactions are necessary not to increase affinity, but rather to ensure specificity and selectivity of binding with respect to decoy receptors. In any case, the significance of electrostatic interactions requires further detailed study.
3.2. AutoDock Vina and AutoDock VinaXB Halogen Bonding
AutoDock VinaXB did not show any improvement over the original AutoDock Vina. There were only 10 cases (out of 42 ligands containing heavy halogens) that exhibited non-negligible halogen bonding as assessed by AutoDock VinaXB (
Table 7). However, even in these cases, the difference between the predicted
pK values of AutoDock Vina and AutoDock VinaXB was in the range of 0.055–0.19
pK units, which is considered as an insignificant change (corresponding to a factor of 1.135–1.55 in K
d/K
i), which also does not actually lead to any increase in accuracy (
Table 7).
There are two feasible hypotheses. The first is that AutoDock VinaXB is incapable of properly and fully accounting for halogen bonding. This hypothesis is partially supported by the results of Free-Wilson analysis. The second hypothesis is that it is not the halogen bonding itself that is important, but any other molecular properties of the ligand that are affected by the presence of the heavy halogen in a molecule (e.g., hydrophobicity). In any case, the topic of the importance of including of halogen bonding in scoring functions requires further research in order to narrow the gap between the general interest in XB and its proper representation in SFs.
3.3. X-Score
It was shown that the X-Score SF predictions themselves may be well described by Free-Wilson correlations (R2 = 0.67), which is not surprising considering that X-Score uses a linear combination of factors that account for different interactions. The latter are well described by the chemical features present in ligands. However, X-Score goes beyond (R2 = 0.41) statistics derived from a simple Free-Wilson correlation with the reference (R2 = 0.36), apparently by using a finer grained representation of the interaction, also including the receptor part. Despite its simplicity, X-Score performed as one of the best SFs in our study, which is consistent with the results of the scoring power test from the CASF-2016 Update study. It should also be noted that X-Score does not contain specific electrostatic terms other than the hydrogen bonding term and is still able to reproduce the experimental affinity well.
3.4. ∆Vina RF20 and NNScore 2.0
Both ∆Vina RF20 and NNScore 2.0 are machine learning SFs using the corrections based on AutoDock Vina calculations. However, they use completely different approaches to these corrections, resulting in a completely different quality of pK estimates.
∆Vina RF20 was shown to be superior (
R2 = 0.67) among the tested SFs. Qualitatively, this is due to the correctly estimated (
Figure 3) contribution of hydrophobic descriptors (especially HP1 and Hal), which were underestimated by other scoring functions in this test. Ultimately, ∆Vina RF20 does not gain any additional score from using Free-Wilson correction. This suggests that the mere presence of structural features in a ligand is not enough to improve the statistics and finer corrections are needed.
At the same time, the NNScore 2.0 estimates were rather contradictory regarding the contributions of the chemical features (
Figure 2). It overestimated the features that are not important for
pK reference reproduction (e.g., HBD1, HBA) and, at the same time, underestimated important ones (e.g., Hal, PICat, HP3). It appears that the main reason NNScore 2.0 predictions are still reasonable (
R2 = 0.41) is that NNScore 2.0 is able to capture most of the hydrophobic interactions (HP1, PIPI, HP2) that have been shown to be the most important for the selected complexes set. Another possible reason is that an ensemble of models used in NNScore 2.0, even if they produce significantly different predictions, can be combined favorably in a consensus scoring model.
While ∆Vina RF20 may serve as the best example in ML class, NNScore 2.0 can serve as an example of what to expect on average. By itself, using a ML approach does not automatically increase the precision and reliability of the results. Only a wise and rigorous approach to balancing generalization and precision provides improvements. We argue that the same applies to the modification of the functional form and the parameterization of the classical SF.
3.5. DSX
DSX is a knowledge-based SF which does not aim at reproducing the reference energies, but instead provides a pure score. However, it can predict the experimental
pK using linear correlation at the same quality level (
R2 = 0.35,
Table 3) as the scoring functions specifically designed for that purpose. Thus, the potential non-linearity of the DSX scores did not seem to show any advantages under our experiment conditions. On the other hand, the good ranking power of DSX seems to be well justified by its decent (compared with the other SFs) ability to score diverse ligand–receptor complexes.
Another, more technical point, is that the proposed approach to revealing the ligand features that are insufficiently described in SF was shown to be applicable not only to the SFs that are specifically aimed at reproducing the free energy of binding, but also to the general type of SFs that give the “score”, monotonically associated with free energy.
3.6. ∆SAS
∆SAS was selected as perhaps the simplest model for comparing “real” scoring functions with. It does not explicitly capture any kind of contributions other than a simple change in surface area during complex formation. However, as applied to a ligand in an already optimal position (in our case, the position extracted from crystal structures), it will characterize areas of optimal contacts and, thus, should correlate with the most important features. Indeed, the ∆SAS value was shown to be significantly better reproduced with the Free-Wilson correlation (
R2 = 0.80) than for other scoring functions. The ∆SAS value, as expected, strongly correlates with the most important hydrophobic features (HP1, HP2, PIPI), so it practically does not require correction to adjust them (
Figure 3). However, some polar features (PICat, HBA) and halogen features (especially F) require adjustments.
The abovementioned findings further support that hydrophobic interactions are a major contributor to ligand–receptor affinity. Of course, as shown in the CASF-2016 Update study, this score is not sufficient to distinguish between different binding modes. This requires correct consideration of directional interactions.
∆SAS is the second SF (along with DSX) in our study, illustrating the usefulness of our approach to non-energy-based SFs.
3.7. The Role of Fluorine in Ligands
The fluorine atom was used as a separate feature, which became statistically significant for correlation with affinity. This reinforced, among other things, our initial assumption that the fluorine atom is commonly used in the later stages of drug design, typically to improve the ADMET properties. Despite the fact that the fluorine atom is not considered as a fragment participating in specific intermolecular interactions, the calculated value of the correlation between the presence of fluorine and experimental activity was at a good level during the study. The reason for this may be that since ADMET properties are adjusted late in the drug discovery process, the presence of a fluorine atom in the compound may indicate that the ligand is already well optimized in other directions since it has managed to reach this stage. Thus, the inclusion of fluorine atoms should not be recommended as a prospective tool to enhance affinity, as it is more of an artifact of the analyzed dataset.
3.8. Free-Wilson Correction
It was illustrated that Free-Wilson analysis (benchmark) of the scoring functions can be used for many purposes. First, it can be used to reveal which chemical features (i.e., interaction motives) are actually important in reproducing the reference pK. Second, pK values predicted by the scoring functions can also be decomposed in terms of the contributions of chemical features so that shortcomings in the scoring function predictions can be pre-assessed. Finally, it can be used to correct the pK predictions by accounting for chemical features that are underestimated by the original scoring function.
The proposed benchmark was tested in practice on several scoring functions (
Table 10) and on the set of CASF-2016 complexes. The benchmark helped us to rank the chemical features in order of their actual importance (hydrophobic interactions tend to be the most important).
It was shown that the use of the Free-Wilson model, which takes into account these features on top of the scoring function, can generally improve the quality of the prediction. As a general rule, the less accurate the original model, the higher the quality can be obtained using the Free-Wilson correction (
Table 9); and vice versa, the more precise and complex the initial scoring function, the less Free-Wilson approach can contribute to its quality. This is especially noticeable in the case of ∆Vina RF20. It has also been shown that some of the scoring functions may themselves correlate well to the Free-Wilson features, so their prediction will also not be improved by such a correction.
The proposed benchmark also helped to reveal inaccuracies in the accounting of these features by the selected scoring functions and, thus, outlined further directions for research and improvement.