1. Introduction
The development of target-fishing (TF) approaches, aimed at identifying the possible protein targets of a small molecule, still represents a current topic in medicinal chemistry. Computational approaches are conventionally focused on studying the interactions of multiple drug-like molecules with a single protein target, and they are successfully employed in virtual screening (VS) campaigns for identifying novel ligands for the target of interest [
1,
2]. Differently, computer-aided reverse screening methods, also known as in silico TF, are increasingly being used to identify the most likely protein target of a query ligand [
3]. TF methods are highly valuable for predicting the bioactivity of a query small molecule, or elucidating the mechanisms of action of all therapeutically interesting compounds, for which the actual target is still unknown. Therefore, TF strategies have found multiple applications in the fields of drug discovery and biomedical research [
4].
Reverse screening approaches also represent important computational techniques for identifying new macromolecular targets of existing drugs or active compounds, and for analyzing their functional mechanisms or side effects [
5]. In fact, in silico TF strategies can find application in drug repositioning campaigns, thus saving huge amount of money that have been estimated for the successful launch of a single new drug [
6,
7,
8], as well as in off-target effect predictions [
9,
10]. However, off-targets can also be responsible for the beneficial secondary effects of existing drugs and drug candidates. It has been proven that each known drug has, on average, six different molecular targets on which it exhibits activity [
11]. In this sense, polypharmacology, i.e., the ability of for small molecules to interact with multiple protein targets, acquire particular interest for rationally designing more effective and less toxic drugs [
12]. Actually, polypharmacology can be highly desirable in the treatment of cancer and other complex diseases that involve the functional modulation of multiple proteins [
13].
Due to the huge number of possible protein targets that a small molecule may interact with, experimental TF approaches are out of reach, since they involve time-consuming, and above all, expensive biological assays [
14]. Taking into consideration the continuous development of computational techniques, in silico TF strategies represent a valuable alternative to classic high-throughput screening (HTS) approaches. These computational methods may be divided into two classes according to their underlying principles: ligand-based methods such as shape-based screening and pharmacophore screening, and receptor-based strategies, namely reverse docking [
15]. In the absence of receptor X-ray structures, the above-mentioned ligand-based methods allow for the identification of potential protein targets of a query molecule, based on the hypothesis that similar ligands bind similar targets. Therefore, either the molecular structure or the shape of the query molecule, or its key pharmacophore features, are compared with those of compounds that are known to be active toward certain targets [
16]. Then, the known targets of the ligands that best satisfy the similarity criteria can be considered as potential targets of the query molecule. The advantage of ligand-based TF approaches relies on the fact that no structural knowledge of the target receptors is needed for their application. However, only protein targets for which active compounds have been experimentally identified and reported in literature can be taken into account, using these approaches. Moreover, the efficacy of these methods is hampered by the structural diversity between the query molecule and the known ligands; therefore, a true target of the query ligand is likely to be successfully predicted only if structurally related active compounds have already been discovered.
Conversely, receptor-based methods only rely on the structural information that are relative to the potential target receptors. In fact, reverse docking consists of evaluating the possible binding mode of the query molecule into the binding site of multiple protein targets, in order to identify proteins with strong binding affinities for the query ligand that can thus be considered as its potential targets [
17]. Therefore, when exploiting the large number of crystallographic protein structures that have been determined to date [
18], such a receptor-based approach represents an effective strategy for the target prediction of a query ligand. Reverse docking approaches indeed require only the availability of a single structure for each target to be screened, and they can be applied, regardless of the presence/absence of the known ligands for the test targets. Moreover, reverse docking appears to be the most comprehensive method, since it considers the key elements of both molecular shape and the pharmacophore moieties of the query ligand in relation to the binding sites of the screened targets. Several examples of receptor-based TF approaches have been reported in the literature [
19,
20]. However, to the best of our knowledge, no proper evaluation of such an approach has been performed yet.
In the present study, taking into consideration the high potential of a reverse docking strategy in identifying the most likely target of a query ligand, an extensive performance assessment of docking-based TF approaches was carried out. For this purpose, a set of X-ray structures belonging to different targets was selected, and a dataset of compounds, including 10 experimentally active ligands for each target, was created. A target-fishing benchmark database was thus obtained and used to assess the reliability of 13 different docking procedures in identifying the correct target of the dataset ligands.
2. Results and Discussion
To assess the reliability of a docking-based TF strategy, we created a benchmark database, including the X-ray structure of 60 different targets and 600 known active compounds. The selected targets and their ligands belonged to three datasets that have been broadly used in the validation of computer-aided drug design methods, i.e., the Directory of Useful Decoys (DUD) [
21], the Maximum Unbiased Validation (MUV) [
22], and ChEMBL datasets [
23] (see Materials and Methods for details). The 60 selected targets covered a wide range of protein types, since they comprised steroid hormone receptors (androgen, estrogen, glucocorticoid, mineralocorticoid and progesterone receptors), different enzymes, including many kinases and hydrolases, some reductases or phosphorylases, several transmembrane receptors coupled to G proteins (adrenergic, dopaminergic or muscarinic receptors), and other different protein targets (
Table 1). For each target, 10 active ligands were chosen among the experimentally active compounds reported in the corresponding datasets, considering some structural variability among them (where possible), in order to avoid any bias in docking results due to the potential structural similarities of the ligands.
As a first step, we evaluated the ability of every single docking procedure to identify the proper target of each dataset ligand. Therefore, the 600 compounds were docked into the X-ray structures of all of the selected targets, and for each ligand, the docking result obtained in its “correct” target was compared with those generated by the docking calculations in the other targets. This protocol was applied by using 13 different docking procedures (see Materials and Methods for more details), and as a result, a total of about 470,000 docking calculations were taken into account. The docking score value relative to the best-ranked docking pose calculated for each ligand was considered as a parameter to compare docking results. Basically, the docking score is a measure of the ligand–protein binding affinity that is estimated by the docking methods, taking into account the number and type of favorable intermolecular interactions established by the molecule within the protein binding site in the predicted docking pose [
24,
25]. For each ligand, the 60 docking score values associated with the docking poses obtained with the 60 targets were employed to rank the potential affinities of the 60 targets for the ligand; then, the ranking position of the true target of the ligand was calculated and used for statistical evaluations. In fact, in the ideal case, the true target of the ligand should present the maximum affinity (and thus the highest rank), since the score associated with the docking pose of the ligand into its true target should be higher than those that are associated with the docking poses of the same ligand into different targets. To assess the performance of every single docking method in identifying the correct target of each ligand, and to compare the results obtained from different docking procedures, we calculated the median ranking position of the ligands’ own targets that were achieved by using each docking method (see Materials and Methods for more details).
Figure 1 summarizes the main results obtained from this first docking analysis. Fred and Glide, using the standard precision (SP) method, seemed to be the best performing docking procedures, as they both showed a median ranking position of the true targets of 11.0, out of 60 total targets. This means that, considering the target fishing screens performed by using each of the 600 dataset ligands as the query molecule, the correct target of the query ligands was ranked 11th overall in the targets dataset. On the contrary, Dock6 showed the worst performance, with a median ranking position of 20.0. Despite these differences, the results obtained did not allow for the identification of a single promising docking procedure that was able to recognize the correct target of a ligand in an effective manner. In fact, the calculated median values revealed that the different docking procedures ranked the correct target of each ligand at around the top 20–30% of the target dataset. Moreover, it is worth noting that a high standard deviation (SD), namely a large variability of ranking position values, was observed for every tested docking procedure, indicating that the obtained results were spread out over a wide range of values (
Figure 1). This may be ascribed to the intrinsic variability of the docking results in terms of docking poses and scores that are produced by single methods for different ligands and targets, as already observed in our previous validation analyses of docking procedures across different targets [
26,
27,
28].
The target-fishing performances of the different docking procedures were also evaluated in terms of the true positive rate (TPR) and false discovery rate (FDR), in order to better verify the quality of target prediction achieved by using the different docking methods (see Materials and Methods for details). Specifically, the TPR is a measure of the overall target prediction reliability. In fact, the higher the TPR obtained by using a certain docking procedure, the higher the number of predictions in which the correct target of the query ligand was ranked within the top 10% of the target dataset. Conversely, the FDR is a measure of target prediction inaccuracy, since the higher its value, the higher the overall number of incorrect targets ranked within the top 10% of the dataset. As shown in
Table 2, the percentage of TPR achieved by the tested procedures ranged from a minimum value of 22%, obtained using Vina, up to a maximum value of 36% achieved by Fred and Glide SP, which were confirmed to be the best performing procedures. This means that, by using these two docking methods, true target predictions were obtained for 36% of the query ligands (i.e., for 213 out of the 600 dataset ligands). On the other hand, high FDR values were found for all the different docking procedures, ranging between 67% and 81%. Again, Fred and Glide SP showed the best results, being the only two procedures with an FDR below 70%; however, these values still highlight a certain overall inaccuracy of the target prediction, which is consistent with the high standard deviation that is observed in the results of the different docking methods. Overall, this analysis confirmed the results highlighted by the first evaluation, based on target ranking.
To evaluate whether combining the results of multiple docking procedures could lead to an improvement in target prediction capability, a consensus docking analysis [
29] was performed. As shown by previous results, a consensus docking approach can be profitably used to predict reliable ligand binding dispositions [
30], and to identify new hit compounds in virtual screening strategies [
31]. In this instance, we were interested in the effects of a consensus docking analysis on the ability to identify the correct targets of a ligand. The docking score was again used as the evaluation parameter; thus, a consensus scoring approach was basically followed in these analysis. In particular, we calculated the number of docking methods (among the 13 methods used) that were able to rank the proper targets of each ligand to within the top-scored 10% of the total targets, defined as the consensus level (see Materials and Methods for more details). The same analysis was applied to the 59 unrelated targets of each ligand, and for all of them, the consensus level was also calculated. The ranking position of the proper target of each ligand with respect to the other targets was then estimated, based on the consensus level obtained. As shown in
Table 2, the consensus docking analysis confirmed the previously obtained results. Basically, the combination of the results obtained by the 13 docking procedures did not cause an actual improvement in target prediction ability, although it performed as well as the best methods tested, achieving a TPR of 36% and an FDR of 67%. Moreover, it was observed that there was a considerable variability among the results achieved for the different targets. For instance, AR, SAAH, and TK were identified as being the most likely targets of their corresponding active ligands, being ranked within the first two positions of the targets dataset (
Figure 2). Conversely, INHA, D3, and FXI were only ranked among the last 15 positions of the targets dataset (rank 45, 51 and 60 respectively); therefore, they were not identified as being possible targets of the corresponding ligands.
Consistent with these results, we observed a clear difference among the consensus level values that was achieved by the different targets (
Table 3). We envisioned that this diversity in docking performances might rely in the different types of ligands and proteins herein taken into account. Based on these considerations, we investigated whether some properties of the targets and/or ligands could affect docking results, thus influencing the ability of the applied docking procedures (either alone or in combination) in identifying the true target of a ligand.
Regarding ligands, both the molecular weight (MW) and the number of heavy atoms were considered, in order to evaluate whether the sizes of the different molecules could affect docking results. Moreover, the effects of charged moieties, hydrogen bonds acceptors, and hydrogen bond donors in the dataset ligands were evaluated. To verify whether the consensus level could be positively or negatively affected by the conformational freedom of a molecule, we calculated the number of aromatic heavy atoms, and the fraction of sp3 carbons in all the tested compounds. Finally, we evaluated the effects of the ligand lipophilicity on the consensus level. For this purpose, the consensus log
P value of the dataset ligands, which combines five different log
P calculation methods, was obtained through the Swiss ADME web tool [
32], as previously performed [
33]. The median value of each property, calculated for the 10 ligands belonging to each target, was related to the median consensus level that was achieved by the same target. As shown in
Figure 3, no evident link was observed between the eight considered ligand properties and the consensus level that was reached by targets. Concerning the net charge of the ligands (
Figure 3E), it is worth noting that a high consensus level (from 10 to 12) frequently corresponded to clusters of ligands characterized by a common charged group (all negative or positive), suggesting that such a group potentially represents an essential feature for the ligand–protein interaction, and it has an effect on ligand binding affinity. However, no linear trend that was able to justify a clear relationship between the charge and the consensus level was observed (see also
Figures S1–S4 in the Supplementary Materials).
Regarding the targets, the volumes of the binding sites were taken into consideration, with the aim of evaluating whether a different size of the target binding pockets could affect the docking results. In this case, an interesting trend was observed, since the consensus level tends to be higher for targets with small and mainly closed binding pockets.
Figure 4 shows the results obtained from this analysis. In particular, as the volume of the binding sites increased, and the binding pockets became more open and solvent-accessible, the consensus level decreased, emphasizing the tight connection between these properties and the target prediction ability of the docking procedures. The 10 targets that showed a higher consensus level (open circles enclosed within the dashed square in
Figure 4) belonged to the class of steroid hormone receptors (androgen, estrogen, glucocorticoid, mineralocorticoid and progesterone receptors) and other classes of proteins (COX1, HIVRT, RXR, SAHH, TK) that all shared small and mainly closed binding sites. Conversely, few targets (closed dark circles in
Figure 4) significantly diverged from the common linear trend, namely NA, ER_ANT, FXA, GART, trypsin, and PNP. For these proteins, the reported consensus level was not found to be related to the target properties. However, we observed that the reference active ligands of all of these targets shared a common structural moiety. For instance, the NA and GART ligands presented a negatively charged moiety, while the ER_ANT, FXA and trypsin ligands were characterized by a positively charged group. As shown in
Figure 4, a high consensus level (8 or above) was achieved by all of these targets; we thus hypothesized that these results were most probably due to the presence of the common charged portion that was shared by all active ligands of the same target, which probably affected the docking results (see also
Figure 3E). Differently, the PNP ligands did not share a charged moiety; nevertheless, they all presented a common structural portion that might have influenced the docking results as well, although in a negative way. By excluding these six presumed outliers, a correlation coefficient of 0.59 between binding site volume and consensus level was obtained, with a
P-value < 0.01.
Based on these results, the consensus docking-based TF procedure seemed to be effective for identifying the true targets of a ligand, when its corresponding receptor was characterized by a small and mainly closed binding site. In order to verify the reliability of these results, we calculated the number of ligands among the 600 dataset compounds, for which the targets with small binding sites achieved a high consensus level (above or equal to 10). In this way, we wanted to check whether the results of the consensus docking-based TF procedure were affected by a bias, due to the fact that high consensus levels were always achieved by targets with closed binding sites, regardless of the fact that the query molecule was a true active ligand of that target, or a decoy. Nevertheless, we observed that the targets with small binding sites obtained a high consensus level only for a maximum of 50 out of the 600 ligands, corresponding to less than 10% of the cases (
Figure 5). Moreover, we verified that no single protein target reached a median consensus level (calculated by computing the median value obtained for the whole dataset of ligands) higher than 4. These analyses confirmed the reliability of the consensus docking-based TF protocol, at least for predictions involving targets characterized by a small or closed binding site. In practice, our evaluations demonstrated that if the consensus docking-based TF protocol is applied for identifying the possible targets of a certain query molecule, and a receptor characterized by a small or closed binding site is obtained among the top-scored targets, such a prediction should be considered as reliable, and the query molecule is likely to be an actual ligand of the identified target. On the contrary, the prediction of a protein presenting a large or highly solvent-exposed binding site as a possible target of the query molecule should be taken with caution, since it is probably not sufficiently reliable.