Next Article in Journal
Biogenesis of Triterpene Dimers from Orthoquinones Related to Quinonemethides: Theoretical Study on the Reaction Mechanism
Previous Article in Journal
Quantitative Structure Activity Relationship of Cinnamaldehyde Compounds against Wood-Decaying Fungi
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Three-Dimensional Biologically Relevant Spectrum (BRS-3D): Shape Similarity Profile Based on PDB Ligands as Molecular Descriptors

1
State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, Wuhan 430070, China
2
Agricultural Bioinformatics Key Laboratory of Hubei Province, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
*
Author to whom correspondence should be addressed.
Molecules 2016, 21(11), 1554; https://doi.org/10.3390/molecules21111554
Submission received: 12 October 2016 / Revised: 10 November 2016 / Accepted: 11 November 2016 / Published: 17 November 2016
(This article belongs to the Section Computational and Theoretical Chemistry)

Abstract

:
The crystallized ligands in the Protein Data Bank (PDB) can be treated as the inverse shapes of the active sites of corresponding proteins. Therefore, the shape similarity between a molecule and PDB ligands indicated the possibility of the molecule to bind with the targets. In this paper, we proposed a shape similarity profile that can be used as a molecular descriptor for ligand-based virtual screening. First, through three-dimensional (3D) structural clustering, 300 diverse ligands were extracted from the druggable protein–ligand database, sc-PDB. Then, each of the molecules under scrutiny was flexibly superimposed onto the 300 ligands. Superimpositions were scored by shape overlap and property similarity, producing a 300 dimensional similarity array termed the “Three-Dimensional Biologically Relevant Spectrum (BRS-3D)”. Finally, quantitative or discriminant models were developed with the 300 dimensional descriptor using machine learning methods (support vector machine). The effectiveness of this approach was evaluated using 42 benchmark data sets from the G protein-coupled receptor (GPCR) ligand library and the GPCR decoy database (GLL/GDD). We compared the performance of BRS-3D with other 2D and 3D state-of-the-art molecular descriptors. The results showed that models built with BRS-3D performed best for most GLL/GDD data sets. We also applied BRS-3D in histone deacetylase 1 inhibitors screening and GPCR subtype selectivity prediction. The advantages and disadvantages of this approach are discussed.

1. Introduction

Computer-aided drug discovery includes structure-based and ligand-based methods. Over the last few decades, advances in both scoring algorithm and computer capability have made structure-based drug discovery a popular tool in hit identification [1,2,3]. However, the massive publicly available libraries of bioactivity screening data are growing rapidly [4,5,6]. To exploit such big-data for drug discovery, ligand-based approaches are becoming increasingly important. According to how the molecular structure’s features are represented, ligand-based approaches can be categorized into two-dimensional (2D) or three-dimensional (3D) ones. The utility of the 2D approaches (based on 2D molecular descriptor or fingerprint) have been confirmed with a variety of algorithms, including similarity coefficients (e.g., Tanimoto coefficient), distance functions (e.g., Euclidean distance), or, most recently, the Similarity Ensemble Approach (SEA) algorithm [7].
In contrast to 2D methods, 3D methods (including 3D-QSAR [8,9] and pharmacophore modeling [10,11]) are based on molecular shape or conformation. As the ligand–receptor interaction involves 3D shape and properties (hydrophobic and electrostatic potential) complementarity, 3D methods are considered to have more potential for rational drug design, especially for scaffold hopping [12], even though previous studies show that 2D approaches are usually much faster and perform better than 3D ones [12,13,14]. Most 3D methods depend on hypothetical active conformations and require that all of the molecules are superimposed in advance. This situation is true for both CoMFA-related methods and pharmacophore approaches. Therefore, these traditional 3D methods are more applicable in situations when the active compounds share similar scaffolds or pharmacophores. However, if the active conformations were unavailable, the superimpositions were not feasible.
In 2011, we built a database using all the rigid active compounds in PubChem (unpublished work) with the expectation of guiding ligand design for corresponding targets. However, the disadvantages of weak activity and low molecular weight limited its application. In fact, our knowledge about the biologically active conformations is compiled as complexes with biological macromolecules in the Protein Data Bank (PDB; http://www.rcsb.org) [15]. The crystallized ligands in PDB complexes can be treated as frozen inverse shapes of their binding sites and can be used as templates to measure a compound’s binding probability to the corresponding targets through 3D similarity calculation. The more similarity between the compound and the crystallized ligand, the more likely the compound can form a similar shape and bind to the protein. Then, the similarity array between the compound and a pre-defined template set can be used as a virtual bioactivity profile (a multiple-dimensional molecular descriptor) in virtual screening.
On the other hand, proteins came into being over a long evolutionary process. The homologous or closely related proteins share similar sequences. As we all known, the number of druggable genes is in a limited number and the protein structures and functions are more conserved than their primary sequences [16]. Thus, these sequence-similar proteins tend to form a similar structure. Consequently, the protein or protein pocket (active sites) structural classes are also in a limited number. This conclusion can be demonstrated by the fact that there has been no new fold or superfamily submitted to the PDB in recent years [17,18,19,20,21]. In addition, long-term functional selection forced some proteins with dissimilar sequences to form similar active sites (the ligands induced the formation of enzyme/receptor structures). For example, the 5-HT3A receptors are ion channels, while the other 5-HT receptors (5-HT1,2,4–7) are G protein-coupled receptors (GPCR) [22]. And, for the same reason, most drugs bind to more than one target, which is defined as drug promiscuity or polypharmacology. Therefore, we believe, the protein pocket classes are limited and can be represented with PDB structures.
In this article, based on these hypotheses discussed above: (1) the structural classes of protein pockets (accumulated in PDB) are limited in number; (2) the shape features of a pocket can be reflected by its ligands; (3) high similarity between a compound and the PDB ligand indicates possible binding with the corresponding target; we proposed a protocol to calculate the shape similarity profile based on PDB ligands and applied it in ligand-based virtual screening and QSAR study. We termed this method the Three-Dimensional Biologically Relevant Spectrum (BRS-3D) after our related 2D approach [23]. Firstly, we selected 300 diverse ligands from the sc-PDB to compose the 3D Biologically-relevant Representative Compound Database (BRCD-3D), which were used as templates for the BRS-3D calculation. Then, predictive discriminant models were established for 42 benchmark data sets using BRS-3D and the SVM algorithm. We compared the performance of BRS-3D and other state-of-the-art 2D and 3D molecular descriptors. We also applied the BRS-3D approach in histone deacetylase 1 (HDAC1) inhibitors screening and GPCR subtype selectivity prediction.

2. Results

2.1. Summary of BRCD-3D

Based on the self-similarity matrix and cluster analysis, a diverse set of ligands was extracted from the sc-PDB to compose the BRCD-3D. The size of the BRCD-3D is critical for the calculation efficiency and application effectiveness of BRS-3D. Therefore, we prepared a series of BRCD-3D databases with 500, 300, 200, 100, or 50 ligands. The prediction performances of BRCD-3D databases with different sizes were compared with two data sets from the ChEMBL [4]: 1189 human acetylcholinesterase (AChE) inhibitors and 1024 HIV-1 protease inhibitors. Two thousand random molecules selected from the Available Chemicals Directory (ACD) [24] were used as the negative samples. As shown in Supplementary Materials Figure S1, Tables S1 and S2, the Accuracy, Precision, Recall, and MCC values of the models decreased when the BRCD-3D size was reduced. The performance of the models with BRCD-3D size of 300 was close to the performance with a size of 500, while further reducing BRCD-3D size affected the discriminant efficiency. Thus, a BRCD-3D size of 300 was chosen to balance the computational consumption and modeling performance. Information regarding to the 300 ligands and their targets are provided in the Excel file in the Supplementary Materials. The ligand structures are also provided in a zipped file (mol2 format).
The 300 ligands in the BRCD-3D included 281 putative ligands, 14 oligopeptides, and 5 cofactors. The peptides were composed of eight or fewer residues. We analyzed the physicochemical property distribution of the BRCD-3D ligands (Figure 1A–F), including molecular weight (MW), octanol-water partition coefficient (AlogP), number of hydrogen bond acceptors (HBAs), number of hydrogen bond donors (HBDs), polar surface area (PSA), and number of rotatable bonds (RBs). The MW of the ligands ranged from 140 to 800 Da. Most of the ligands conformed to Lipinski’s rule-of-five [25,26].
The biological diversity of the BRCD-3D ligands could be analyzed with their corresponding targets, as shown in the pie charts in Figure 1. The targets were categorized into seven classes: oxidoreductase, transferase, hydrolase, lyase, isomerase, ligase, and non-enzyme (the left pie chart, Figure 1G). Enzymes and GPCRs are the most important therapeutic targets in current drug discovery [27,28,29,30,31]. There were 255 enzymes, including 80 kinases and 90 proteases, representing the main part of BRCD-3D ligands’ targets. The structural classifications of the 300 targets were annotated according to SCOP (Structural Classification of Proteins, 1.75 release) [20] (the right pie chart, Figure 1H). Detailed information is presented in the Supplementary Materials Tables S3–S5.

2.2. Evaluation with GLL/GDD (G Protein-Coupled Receptor (GPCR) Ligand Library and the GPCR Decoy Database) Benchmark Data Sets

Predictive discriminant models were successfully constructed using BRS-3D (Figure 2 and Supplementary Materials Table S6). We compared the performances of the three methods used to handle the unbalanced data. For the “1:10” data reduction method, the Recall values of most models were greater than the other two. However, reducing the number of decoy compounds caused information loss. The overall prediction accuracies (ACC) of the models were worse than the other methods, implicating that there were more false positives. For the “weighted” method (Table 1), the cross-validation AUC values were greater than 0.95 for all data sets, indicating that the SVM models had a fairly reliable learning ability. The ACC values for the test sets were greater than 0.95 for all models. The Precision values of most models were also acceptable. However, due to the data imbalance, the average Recall was approximately 0.76, meaning that the SVM models were biased towards predicting the compounds as decoys. The 1:39, 1:10, and weighted methods resulted in 40, 39, and 34 models with Precision values greater than 0.9, respectively. These models can effectively enrich the active compounds from decoys. The lowest Precision values for the 1:39, 1:10, and weighted methods were 0.788 (6th, 5HT2C_Agonist), 0.842 (7th, 5HT2C_Antagonist), and 0.562 (19th, ADA2B_Antagonist), respectively. These models could lead to high false positive rates. Except one data set (ADA2B_Antagonist), all the other MCC values were greater than 0.7, indicating that the models were acceptable.
The GLL/GDD was designed as a benchmark data set for a structure-based method (such as docking). Gatica et al. studied some targets in this data set with the docking approach. There were six common targets among their study and ours (Supplementary Materials Table S7). The maximum enrichment factors (EFmax) ranged from 1.7 to 38.2, according to their docking approach [32]. In comparison, prediction based on BRS-3D models can reach EFs at 10–38 (top 10% and 2% in test sets). More importantly, our approach can be applied to systems with no known target structures.
In summary, the results of the SVM models based on benchmark data sets demonstrate that BRS-3D can effectively characterize the 3D structural features of molecules and can be used as a multi-dimensional structural descriptor. Combined with appropriate machine learning methods, prediction models can be developed to identify target-specific active compounds.

2.3. Feature Selection

We studied the influence of different feature subsets on the performance of the 42 SVM models. Weighted C parameter was used to handle the data imbalance. The results are presented in Figure 3. The cross-validation AUC values of models based on different feature subsets were acceptable (mostly over 0.9), again confirming the excellent learning ability of the method. For the test sets, the ACC values were mostly over 0.9. The influence of feature selection on these two statistical parameters was negligible. However, different feature subsets do affect the predictive ability, as shown by the variations of the Precision, Recall, and MCC values. The predictive ability was enhanced as more variables were added to the model. Most of the models with minimum feature numbers (5%) had the lowest Precision, meaning that they produced higher false positive results. When the feature number increased to 30% (90 BRS-3D variables), most models reached acceptable Precision and MCC values. After that, when more features were added, the trends became complex, meaning that the new added variables provided useful information for model construction but also brought some noises to the models. For most of the data sets, models with full BRS-3D performed best, as judged from the parameters. This behavior demonstrates the effectiveness of reducing the original 9878 sc-PDB ligands to 300 with cluster analysis during the BRCD-3D construction process. Of course, the models should be analyzed with more sophisticated statistical parameters for real screening, such as ROCFIT and ROCED [33], which exceed the scope of this article.

2.4. Comparison with Other Molecular Descriptors

The prediction performances of BRS-3D-based models were compared with models based on Dragon 2D and MOE 3D descriptors. The results are shown in Table 2. The prediction Accuracy values of all the three descriptors are sufficiently high, but the Precision values of BRS-3D are much higher than Dragon 2D descriptors and MOE 3D descriptors, which means the BRS-3D based models have the lowest false positive rates. Although the Recall values of BRS-3D are slightly lower than the other two descriptors, the models based on BRS-3D still show sufficient sensitivity. We compared the MCC values of these three descriptors (Figure 4), which are considered as a comprehensive evaluation index for classification models. For 29 of all the 42 data sets, the BRS-3D-based models possessed the highest MCC values. For the other 13 data sets, BRS-3D performed as well as Dragon 2D topological descriptors, and both these two descriptors performed better than MOE 3D descriptors.

2.5. HDAC1 Inhibitor Screening

HDAC1 is an attractive drug target for cancer therapy [34]. Vorinostat (SAHA), the inhibitor of HDAC1, has been approved by the FDA as an effective drug for the treatment of cutaneous T cell lymphoma [35]. Therefore, we applied BRS-3D in the screening for HDAC1 inhibitors. The candidate database for virtual screening contained more than 300,000 drug-like or lead-like compounds derived from Specs, ChemDiv, and Enamine. We used only two screening filters, the pharmacophore model (refer to Supplementary Materials Figure S2) and the BRS-3D discriminant model (refer to Supplementary Materials Table S8), to simplify the study. At last, 30 molecules were selected and purchased. The activity assay results showed that two similar molecules could inhibit HDAC1 at micromolar concentrations. The inhibition rates of these two molecules were 34.65% and 38.66% at 10 μM. The IC50 values calculated by curve fitting were 43.99 and 30.07 μM, respectively (Supplementary Materials Figure S3).

2.6. Application of BRS-3D in Subtype Selectivity Predictions

We noticed that one shortcoming of molecular superimposing was that shape played the most important role in superimposition scoring, while fewer pharmacophore features were taken into consideration. This limited the application of BRS-3D in activity prediction for targets that were charge- or H-bond-sensitive. Different subtypes of a GPCR family can be activated by the same ligand. We believed that the subtype selectivity of GPCR ligands was dominated largely by dynamic conformation-changing patterns. For example, all dopamine receptors (DR1–4) can be activated by dopamine or its mimics [36]. The pharmacophore distributions of the ligands are similar to each other. The selectivity of DR ligands depends on their dynamic shape or conformation-changing patterns. Because BRS-3D is calculated with multiple superimposing templates, it can encode information about multiple-conformation and then can be applied to GPCR subtype selectivity prediction. We applied BRS-3D in subtype selectivity prediction of dopamine receptors (DR) [37], adenosine receptors (AR) [38], cannabinoid receptors (CB), and an enzyme, monoamine oxidase (MAO). Predictive models could be constructed for these systems. In this paper, we report the result of CB subtype selectivity prediction as an example.
Cannabinoid receptors include two subtypes, denoted as CB1 and CB2. CB1 selective antagonists show clinical efficacy in the treatment of obesity, metabolic disorders, and drug abuse [39,40], whereas CB2 selective agents demonstrate efficacy in inflammatory pain models [41] and play neuroprotective roles in Huntington’s and Alzheimer’s diseases [42,43]. The development of subtype-selective compounds for CB1/CB2 receptors can not only avoid the unwanted side effects produced by nonselective ligands, but also help us understand the specific physiological functions of each subtype receptor.
We extracted all the compounds with definite Ki values for both CB1 and CB2 receptors (human sapiens) in ChEMBL (version 20). The selectivity ratio (SR) was defined as SRCB1/CB2 = pKiCB1 − pKiCB2. The compounds with SR ≥ 1.3 or SR ≤ −1.3 were considered as CB1/CB2 selective ones. As shown in Figure 5 and Tables S9 and S10, prediction regression and discriminant models were successfully constructed. The performances of the models improved with increasing of used features in the model development. The regression model reached satisfactory performance with the cross-validated Q2 = 0.650 for the training set and R2 = 0.753 for the test set (Figure 5A), when 20% features (60 variables) were employed. The RMSE of the training set and test set were lower than 1 unit (Figure 5B). Considering that the compounds in ChEMBL were collected from different laboratories, the results were excellent. We conducted a data resampling (100 times, Figure S4A,B) and Y-randomization test (500 times, Figure S4C) to evaluate the stability of the prediction models and to exclude possible chance correlation. We also analyzed the applicability domain of the model (Figure S4D). The discriminant models were simpler than regression models, since only 10% features (30 variables) were needed (Figure 5D) when over 90% of selective ligands could be distinguished correctly. The prediction accuracy (0.896) was still satisfactory, with only 3 BRS-3D features (1%).
We analyzed the distribution of the selective compounds in the chemical space composed of the most important BRS-3D features (Figure 5E,F). As the results show, the CB1-selective and CB2-selective compounds distributed in different zones in the space. The compounds with higher similarity to the 105th and 228th BRCD-3D ligands and with lower similarity to the 122th BRCD-3D ligand were biased to bind to the CB2 receptor, and vice versa. We calculated the similarity of CB1/CB2- selective compounds with active compounds in ChEMBL of the corresponding targets of BRS228, BRS122 and BRS105 (with PDB ID 2WIH, 1QBR and 1R1H, respectively). Surprisingly, the similarities of CB1- and CB2-selective compounds with the active compounds had no difference (BRS122 and BRS105) or weak inversed trends (BRS128). The results demonstrated the advantages of 3D approaches relative to 2D ones. That is, activity relationships related to molecular shapes could be discovered with 3D methods, even when the topological structures of the compounds were dissimilar to each other. Figure S5 gives examples of the five most selective compounds of CB1 and CB2. Thus, 3D methods were more suitable for scaffold hopping [12].

3. Discussion

In this paper, we introduced a multi-dimensional molecular descriptor, BRS-3D. The compounds under scrutiny were superimposed on a diverse set of 300 ligands selected from the sc-PDB. Then, the shape similarity profile was used as a multi-dimensional descriptor for QSAR studies. Predictive SVM models were successfully constructed to discriminate active molecules from decoys, and most of the models performed well. Comparison with two other state-of-the-art molecular descriptors showed that the models based on BRS-3D achieved the best prediction performance. We also applied this approach in a real screening project for HDAC1 inhibitors. Two of the 30 compounds showed moderate activity, with IC50 values of 30.07 and 43.99 μM. Predictive regression models and discriminant models were constructed for CB1/CB2 subtype selectivity prediction. Therefore, we believe that BRS-3D is a valid molecular descriptor for ligand-based virtual screening and QSAR studies.
Recently, ligand profiling, such as the Cerep BioPrint® profile and Novartis HTSFPs high-throughput screening fingerprints), gained much attention [44,45,46,47,48,49]. Helal and co-workers used the PubChem Bioassay database to build a publicly available version of HTSFPs [50]. In these works, a compound was encoded with its biological activities profile, which was collected from a battery of in vitro pharmacology or ADME assays. Gene-expression profile (C-map) was also used as a molecular descriptor to study the relationships between small molecules, genes, and diseases [51]. All of these works demonstrated that the biological activity, or similar profile approaches, were efficient for virtual screening and target prediction. In fact, ligand profiling can also be performed with theoretical calculation, e.g., inverse docking [52] and pharmacophore database mapping [53]. In 2012, Sato et al. proposed a shape overlay similarity profile with known active compounds and used it as molecular descriptors in machine learning for ligand-based virtual screening [54]. Taking known active compounds as templates, they calculated the 3D similarity profile (the array of overlay scoring) and used these profiles as explanatory variables. Predictive discriminant models were constructed using the support vector machine (SVM). No active conformations are needed during this process. When diverse active compounds are available, this protocol can overcome the shortcomings of traditional 3D methods. That is to say, without strict substructure alignment or active conformation, the screening protocol can be processed automatically. However, using active compounds of a specific target as superimposition templates made this descriptor not reusable. When the target changes, the templates must be renovated and the similarity array has to be calculated again.
Similar to Sato’s protocol, our approach is also a theoretically implement of activity profiling. Nevertheless, there are two differences between these two approaches. Firstly, we used a fixed set of templates to make the calculated descriptor reusable for new systems. Secondly, using a diverse set of irrelevant ligands as references attach more biological significance to the BRS-3D scores. The high similarity means that the objective compound can form a similar shape with the corresponding ligand and can bind to the corresponding target, while the dissimilar (with a low superimposing score) indicates that the compound under scrutiny cannot form similar conformations with the PDB ligands (forbidden conformations), which may cause possible confliction with the target or shape mismatching.
In fact, the fitting profile of an objective molecule against the 300 targets in the BRCD-3D can also be calculated by reverse docking. We compared the performances of docking-based BRS-3D and superimposing-based BRS-3D using the data set of 1189 AChE inhibitors and 2000 diverse compounds from the ACD database. Surflex-Sim outperformed Surflex-Dock (Supplementary Materials Table S11), possibly due to the poor scoring functions of current docking programs.
BRS-3D is a ligand-based method. Therefore, it can be used for systems without crystallized target structures. As an example, GPCRs are the targets of over 30% of marketed prescription drugs [55]. Only a few crystal structures of GPCRs have been resolved, limiting the application of structure-based methods. In this paper, the results show that BRS-3D can be applied to GLL/GDD discrimination. BRS-3D is calculated with multiple templates. Therefore, BRS-3D reflects the conformational ensemble (300 possible binding modes), which is useful for modeling of the dynamic binding process between the objective compound and its potential targets. Also, for the same reason, BRS-3D may also be applied in drug discovery for multi-target projects. Different from conventional 3D-QSAR methods, such as CoMFA and CoMSIA, BRS-3D belongs to the second type of QSAR method discussed by Fujita and Winkler [56]. When the BRS-3D model was constructed and validated, it could be used in preliminary virtual screening to identify new scaffolds automatically.
In addition to the advantages discussed above, BRS-3D also has some drawbacks. First, shape similarity calculation is highly computationally sensitive: it takes approximately 30 min to calculate the BRS-3D for a typical molecule on a modern CPU core (we used the Intel Xeon E5-2609 v2 @ 2.50 GHz). Therefore, this method is not suitable for on-the-fly analysis. Nevertheless, the BRS-3D descriptor only needs to be calculated once, and the results can then be reused in different projects. Currently, we have finished the calculations for more than 800,000 drug-like compounds. The calculated profiles have been stored in an in-house database for further usage. Second, although each BRS-3D element has a definite meaning, i.e., the similarity to a BRCD-3D ligand, models developed using this descriptor and machine learning methods were less amenable to interpretation. Therefore, it is difficult to draw a rule and to guide the rational design of new active compounds. However, as illustrated in Figure 5E,F, the distributions of the compounds in a BRS-3D space can provide valuable information for inferring the relationships between objective compounds and BRCD-3D targets, which is useful in lead optimization and also in drug repositioning [57]. Third, we used Surflex-Sim for shape similarity calculation. However, the similarity scores calculated with this method have a centralized distribution, i.e., most similarity scores ranged from 0.3 to 0.7. BRS-3D can characterize the surface and shape properties of the molecules under study. As the shape similarity cannot encode electronic and other polar features, we combined the pharmacophore method with our method for screening HDAC1 inhibitors (which also enhances the screening speed).
In summary, BRS-3D can be used as a multi-dimensional molecular descriptor in ligand-based studies. Calculated using multiple templates, BRS-3D can reflect the transformation pattern between the active and inactive conformations of the molecule under scrutiny. Of course, as required for all 3D methods, the active compounds should bind to the same pocket of the target in a similar manner.

4. Materials and Methods

4.1. Workflow of BRS-3D-Based Virtual Screening

This study was divided into three steps: templates preparation, BRS-3D calculation, and model development and validation (Scheme 1). First, 3D shape similarity calculations and structural clustering were used to extract a set of 300 diverse ligands from the druggable protein–ligand complex database, sc-PDB, which was a subset of the original PDB [58]. This ligand set was named the BRCD-3D. Then, each of the molecules under scrutiny was flexibly superimposed onto the 300 ligands in the BRCD-3D. These superimpositions were scored according to the degree of shape overlap and property similarity, producing a 300 dimensional similarity array called the BRS-3D. Finally, quantitative or discriminant models were developed using the BRS-3D and various machine learning methods.

4.2. Surflex-Sim Superimposition

A variety of chemoinformatics tools are available for superimposition, such as FLEXS [59] and ROCS [60]. In this paper, we used Surflex-Sim for BRCD-3D construction and BRS-3D calculation. Surflex-Sim is the molecular similarity computing module of Surflex [61]. It measures the 3D similarity between two molecules based on the morphological similarity algorithm, which takes into account both the surface shape match and the similarity of charge characteristics [62]. The superimposing process can be divided into four steps, including fragmentation, conformational search, alignment, and scoring, which will be performed automatically by the Surflex-Sim program. More details about the superimposition method can be found in Jain’s paper [62]. Surflex-Sim similarity scores range from 0 to 1. A score greater than 0.7 is generally considered to indicate a significant functional relationship between molecules [62]. Default parameters were used for all of the calculations.

4.3. Construction of the BRCD-3D

The BRCD-3D is a representative collection of the active conformations of ligands. We used ligands in the sc-PDB to build the BRCD-3D. The sc-PDB is a ligand-target complex database derived from the PDB [58]. As this database was developed for drug discovery, only druggable binding sites and their corresponding ligands were included in the sc-PDB. Therefore, the sc-PDB could be used to represent the known bioactive conformational space. However, some ligands were co-crystallized in more than one PDB entry, such as most cofactors. And, the ligands that bound to the same pocket were highly similar. These redundancies should be removed.
We extracted all 9878 ligands from the sc-PDB (version 2011) [63]. A self-similarity matrix of the 9878 ligands was calculated through an iterative process of rigid molecular superimposition. In each iteration, a sc-PDB ligand was used as the template and kept rigid. All other ligands were also kept rigid and superimposed onto the template. The superimpositions were scored with the default Surflex-Sim parameters. These pairwise similarity scores constituted the self-similarity matrix.
Then, based on the similarity matrix and cluster analysis, a diverse subset of the 9878 ligands was extracted to compile the BRCD-3D. The 9878 ligands were clustered into several groups via an in-house protocol utilizing the component “Cluster Data” in Pipeline Pilot 8.5 [64]. In the clustering protocol, a row of the self-similarity matrix was used as a numeric descriptor to compute the distance between two ligands, and the distance function was set to “One Minus Pearson correlation”. Clustering was performed using the maximum dissimilarity method. The cluster centers were output as members of the BRCD-3D. We built five versions of the BRCD-3D database with different numbers of ligands (500, 300, 200, 100, 50) to compare their performances.

4.4. Calculation of BRS-3D

BRS-3D was defined as the shape similarity profile between the molecule under scrutiny and all of the ligands in the BRCD-3D. As the BRCD-3D consisted of a diverse range of ligands, the BRS-3D could serve as a GPS-like location system within the bioactive-conformation chemical space. To calculate the BRS-3D, the molecules under scrutiny (objective molecules) were flexibly superimposed onto BRCD-3D ligands that were kept rigid. By default, 10 overlapped conformations and similarity scores between the objective molecule and a template were output. Only the highest score was kept as one element of the BRS-3D. Therefore, the dimension of the BRS-3D was equal to the number of ligands in the BRCD-3D. BRS-3D was used as a multi-dimensional descriptor for the development of QSAR models.

4.5. The Benchmark Data Sets

We used the G protein-coupled receptor (GPCR) ligand library and the GPCR decoy database (GLL/GDD) to evaluate the efficiency of BRS-3D for ligand-based studies. GLL/GDD were compiled by the Cavasotto Laboratory [32], including active ligands (agonist and antagonist) of 147 human Class A rhodopsin-like GPCR targets and corresponding decoys. For each GLL ligand, there were 39 decoys. These decoys were selected from the ZINC database, with similar physicochemical properties (molecular weight, formal charge, hydrogen bond donors and acceptors, rotatable bonds, and logP) but dissimilar structure to the corresponding GLL ligand. GLL/GDD was originally developed for docking approaches, but also used for performance evaluation of ligand-based methods [65].
To evaluate the efficiency of our approach, SVM discriminant models were constructed for 42 GLL/GDD data sets with more than 200 ligands. The structures of the GLL/GDD data sets were downloaded from the website of the Cavasotto Laboratory [66]. Each data set was randomly divided into a training set and a test set at a ratio of 4:1. The training set was used to select the optimal parameter settings by cross-validation and to build prediction models. The test set was used only for model evaluation.
The ligands and decoys of the GLL/GDD data sets were unbalanced, with a ligand:decoys ratio of 1:39. We compared three methods of handling this unbalance. First, the original data were used without any special treatment (1:39 method). Second, we reduced the ligand:decoy ratio from 1:39 to 1:10 (1:10 method). For each ligand, only 10 of the 39 original decoys were randomly selected and used for model development, and the other decoys were discarded. In the third approach, we assigned different weight factors on parameter C (39 and 1 for the ligands and decoys, respectively) in the SVM models development (weighted method). The 42 data sets are summarized in Table 3.

4.6. Model Development and Validation

SVM is a promising machine learning method and has been extensively applied in various pattern recognition systems and across all fields of informatics, including bioinformatics and chemoinformatics [54,67,68,69,70,71]. Combined with molecular fingerprints or descriptors, SVM can be easily utilized for virtual screening with good prediction performance. In this study, LIBSVM (v3.16) [72], an implementation of SVM for classification, regression, and distribution estimation, was adopted to develop the discriminant models based on BRS-3D. GLL active compounds (agonists or antagonists) and GDD decoys were assigned as positive and negative samples, respectively. The RBF (radial basis function) was used as the kernel function, and 80% of the data set was used as the training set, while the rest was used as the test set. We used 10-fold cross-validation to verify the learning ability of the models. The parameter gamma of the kernel function and parameter C were optimized with grid searching. For the classification models, the area under the receiver operating characteristic curve (AUC) was used to evaluate the cross-validation results and to determine the best parameter settings. The parameter settings are generally considered to be acceptable when the cross-validation AUC value of a classifier is greater than 0.9. For the regression models, the cross-validation root-mean-square error (RMSE_CV) was used for parameter optimization.
In addition to cross-validation, we also verified the performance of the models with test sets (20% of the original data set). Statistical parameters including Accuracy (ACC), Precision, Recall, and Matthew’s correlation coefficient (MCC) were computed to assess the performance of the model.
A C C = T P + T N T P + T N + F P + F N
P r e c i s i o n = T P T P + F P
R e c a l l = S e n s i t i v i t y = T P T P + F N
S p e c i f i c i t y = T N T N + F P
M C C = T P × T N F P × F N ( T P + F P ) ( T P + F N ) ( T N + F P ) ( T N + F N )
Here, TP, TN, FP, and FN denote the number of true positives, true negatives, false positives, and false negatives, respectively. ACC is the overall prediction accuracy of a classifier. Higher ACC values indicate higher predictive power. However, if the data points of the two classes are highly unbalanced, ACC cannot correctly reflect predictive power. In this situation, MCC is preferred. MCC ranges from −1 to 1. MCC = 1 indicates an ideal prediction, while MCC = 0 represents a random prediction.
The RMSE, squared correlation coefficients of cross-validation (Q2) and the determination coefficient (R2) of test sets were calculated for regression models.
R M S E = 1 n ( y y ^ ) 2
Q 2 = ( ( y y ¯ ) ( y ^ y ^ ¯ ) ( y y ¯ ) 2 ( y ^ y ^ ¯ ) 2 ) 2
R 2 = 1 ( y y ^ ) 2 ( y y ¯ ) 2
Here, n is the number of samples, y is the observed response variable, y ^ is the corresponding predicted value, y ¯ and y ^ ¯ are the mean values of y and y ^ , respectively. As recommended by Alexander et al. [73], R2 > 0.6 and low RMSE for the test set indicate the satisfied prediction ability of the regression models.

4.7. Feature Selection

To assess the influence of feature size on SVM model performance, several different feature subsets of BRS-3D were selected to build the prediction models. A total of 42 Random Forest (RF) models were built, using BRS-3D as variables. This process was implemented by the component “Learn R Forest Model” in Pipeline Pilot 8.5. First, the importance of each variable in the BRS-3D was measured with RF models, using the method of permutation accuracy importance [74]. Then, feature subsets with 15 (top 5%), 30 (top 10%), 90 (top 30%), 150 (top 50%), and 210 (top 70%) variables were selected according to the ranked importance of the variables. The feature subsets were used to build LIBSVM discriminant models. The feature selection process was performed with the original 1:39 unbalanced data sets. The SVM model performances were compared with and without feature selection.

4.8. Dragon 2D Descriptors and MOE 3D Descriptors

Two kinds of state-of-the-art molecular descriptors were adopted to build SVM discriminant models for the same 42 benchmark data sets. Their prediction performances were compared with BRS-3D. Totally, 107 topological (2D) descriptors were calculated with Dragon (version 5.4) [75]; 91 surface area-, volume-, and shape-related 3D descriptors were also calculated, using MOE 2009 [76]. The details of Dragon 2D and MOE 3D descriptors are listed in the Supplementary Materials, Tables S12 and S13. These two kinds of molecular descriptors were used as explanatory variables to build SVM models. The performance (ACC, Precision, Recall, and MCC) of Dragon 2D-, MOE 3D-, and BRS-3D-based models were compared.

Supplementary Materials

Supplementary materials can be accessed at: https://www.mdpi.com/1420-3049/21/11/1554/s1.

Acknowledgments

This work was supported by Fundamental Research Funds for the Central Universities (grant 2014PY007), Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase) and the National Natural Science Foundation of China (grant 21075046 and 21275061). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author Contributions

Conceived and designed the researches: De-Xin Kong; Performed the researches: Ben Hu and Zheng-Kun Kuang; Analyzed the data: Shi-Yu Feng, Dong Wang and Song-Bing He; Wrote the paper: De-Xin Kong and Ben Hu. All authors read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Kitchen, D.B.; Decornez, H.; Furr, J.R.; Bajorath, J. Docking and scoring in virtual screening for drug discovery: Methods and applications. Nat. Rev. Drug Discov. 2004, 3, 935–949. [Google Scholar] [CrossRef] [PubMed]
  2. Lionta, E.; Spyrou, G.; Vassilatis, D.K.; Cournia, Z. Structure-based virtual screening for drug discovery: Principles, applications and recent advances. Curr. Top. Med. Chem. 2014, 14, 1923–1938. [Google Scholar] [CrossRef] [PubMed]
  3. Liu, L.J.; Leung, K.H.; Chan, D.S.; Wang, Y.T.; Ma, D.L.; Leung, C.H. Identification of a natural product-like STAT3 dimerization inhibitor by structure-based virtual screening. Cell Death Dis. 2014, 5, e1293. [Google Scholar] [CrossRef] [PubMed]
  4. Gaulton, A.; Bellis, L.J.; Bento, A.P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; et al. ChEMBL: A large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012, 40, D1100–D1107. [Google Scholar] [CrossRef] [PubMed]
  5. Wang, Y.; Suzek, T.; Zhang, J.; Wang, J.; He, S.; Cheng, T.; Shoemaker, B.A.; Gindulyte, A.; Bryant, S.H. PubChem BioAssay: 2014 update. Nucleic Acids Res. 2014, 42, D1075–D1082. [Google Scholar] [CrossRef] [PubMed]
  6. Gilson, M.K.; Liu, T.; Baitaluk, M.; Nicola, G.; Hwang, L.; Chong, J. BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res. 2016, 44, D1045–D1053. [Google Scholar] [CrossRef] [PubMed]
  7. Keiser, M.J.; Roth, B.L.; Armbruster, B.N.; Ernsberger, P.; Irwin, J.J.; Shoichet, B.K. Relating protein pharmacology by ligand chemistry. Nat. Biotechnol. 2007, 25, 197–206. [Google Scholar] [CrossRef] [PubMed]
  8. Cramer, R.D.; Patterson, D.E.; Bunce, J.D. Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J. Am. Chem. Soc. 1988, 110, 5959–5967. [Google Scholar] [CrossRef] [PubMed]
  9. Klebe, G.; Abraham, U.; Mietzner, T. Molecular similarity indices in a comparative analysis (CoMSIA) of drug molecules to correlate and predict their biological activity. J. Med. Chem. 1994, 37, 4130–4146. [Google Scholar] [CrossRef] [PubMed]
  10. Sciabola, S.; Carosati, E.; Cucurull-Sanchez, L.; Baroni, M.; Mannhold, R. Novel TOPP descriptors in 3D-QSAR analysis of apoptosis inducing 4-aryl-4h-chromenes: Comparison versus other 2D- and 3D-descriptors. Bioorg. Med. Chem. 2007, 15, 6450–6462. [Google Scholar] [CrossRef] [PubMed]
  11. Sciabola, S.; Morao, I.; de Groot, M.J. Pharmacophoric fingerprint method (TOPP) for 3D-QSAR modeling: Application to CYP2D6 metabolic stability. J. Chem. Inf. Model. 2007, 47, 76–84. [Google Scholar] [CrossRef] [PubMed]
  12. Nettles, J.H.; Jenkins, J.L.; Bender, A.; Deng, Z.; Davies, J.W.; Glick, M. Bridging chemical and biological space: “Target fishing” using 2D and 3D molecular descriptors. J. Med. Chem. 2006, 49, 6802–6810. [Google Scholar] [CrossRef] [PubMed]
  13. Venkatraman, V.; Perez-Nueno, V.I.; Mavridis, L.; Ritchie, D.W. Comprehensive comparison of ligand-based virtual screening tools against the DUD data set reveals limitations of current 3D methods. J. Chem. Inf. Model. 2010, 50, 2079–2093. [Google Scholar] [CrossRef] [PubMed]
  14. Hu, G.P.; Kuang, G.L.; Xiao, W.; Li, W.H.; Liu, G.X.; Tang, Y. Performance evaluation of 2D fingerprint and 3D shape similarity methods in virtual screening. J. Chem. Inf. Model. 2012, 52, 1103–1113. [Google Scholar] [CrossRef] [PubMed]
  15. Berman, H.M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T.N.; Weissig, H.; Shindyalov, I.N.; Bourne, P.E. The protein data bank. Nucleic Acids Res. 2000, 28, 235–242. [Google Scholar] [CrossRef] [PubMed]
  16. Chothia, C.; Lesk, A.M. The relation between the divergence of sequence and structure in proteins. EMBO J. 1986, 5, 823–826. [Google Scholar] [PubMed]
  17. Lo Conte, L.; Brenner, S.E.; Hubbard, T.J.P.; Chothia, C.; Murzin, A.G. SCOP database in 2002: Refinements accommodate structural genomics. Nucleic Acids Res. 2002, 30, 264–267. [Google Scholar] [CrossRef] [PubMed]
  18. Andreeva, A.; Howorth, D.; Brenner, S.E.; Hubbard, T.J.P.; Chothia, C.; Murzin, A.G. SCOP database in 2004: Refinements integrate structure and sequence family data. Nucleic Acids Res. 2004, 32, D226–D229. [Google Scholar] [CrossRef] [PubMed]
  19. Andreeva, A.; Howorth, D.; Chandonia, J.M.; Brenner, S.E.; Hubbard, T.J.P.; Chothia, C.; Murzin, A.G. Data growth and its impact on the SCOP database: New developments. Nucleic Acids Res. 2008, 36, D419–D425. [Google Scholar] [CrossRef] [PubMed]
  20. Andreeva, A.; Howorth, D.; Chothia, C.; Kulesha, E.; Murzin, A.G. SCOP2 prototype: A new approach to protein structure mining. Nucleic Acids Res. 2014, 42, D310–D314. [Google Scholar] [CrossRef] [PubMed]
  21. Sillitoe, I.; Lewis, T.E.; Cuff, A.; Das, S.; Ashford, P.; Dawson, N.L.; Furnham, N.; Laskowski, R.A.; Lee, D.; Lees, J.G.; et al. CATH: Comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res. 2015, 43, D376–D381. [Google Scholar] [CrossRef] [PubMed]
  22. Hannon, J.; Hoyer, D. Molecular biology of 5-HT receptors. Behav. Brain Res. 2008, 195, 198–213. [Google Scholar] [CrossRef] [PubMed]
  23. Deng, Z.L.; Du, C.X.; Li, X.; Hu, B.; Kuang, Z.K.; Wang, R.; Feng, S.Y.; Zhang, H.Y.; Kong, D.X. Exploring the biologically relevant chemical space for drug discovery. J. Chem. Inf. Model. 2013, 53, 2820–2828. [Google Scholar] [CrossRef] [PubMed]
  24. Available Chemicals Directory (ACD), version 2004.1; MDL Information Systems Inc.: San Leandro, CA, USA, 2004.
  25. Lipinski, C.A.; Lombardo, F.; Dominy, B.W.; Feeney, P.J. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 1997, 23, 3–25. [Google Scholar] [CrossRef]
  26. Lipinski, C.A. Lead- and drug-like compounds: The rule-of-five revolution. Drug Discov. Today Technol. 2004, 1, 337–341. [Google Scholar] [CrossRef] [PubMed]
  27. Rask-Andersen, M.; Almen, M.S.; Schioth, H.B. Trends in the exploitation of novel drug targets. Nat. Rev. Drug Discov. 2011, 10, 579–590. [Google Scholar] [CrossRef] [PubMed]
  28. George, S.R.; O’Dowd, B.F.; Lee, S.R. G-protein-coupled receptor oligomerization and its potential for drug discovery. Nat. Rev. Drug Discov. 2002, 1, 808–820. [Google Scholar] [CrossRef] [PubMed]
  29. Lagerstrom, M.C.; Schioth, H.B. Structural diversity of G protein-coupled receptors and significance for drug discovery. Nat. Rev. Drug Discov. 2008, 7, 339–357. [Google Scholar] [CrossRef] [PubMed]
  30. Heilker, R.; Wolff, M.; Tautermann, C.S.; Bieler, M. G-protein-coupled receptor-focused drug discovery using a target class platform approach. Drug Discov. Today 2009, 14, 231–240. [Google Scholar] [CrossRef] [PubMed]
  31. Shoichet, B.K.; Kobilka, B.K. Structure-based drug screening for G-protein-coupled receptors. Trends Pharmacol. Sci. 2012, 33, 268–272. [Google Scholar] [CrossRef] [PubMed]
  32. Gatica, E.A.; Cavasotto, C.N. Ligand and decoy sets for docking to G protein-coupled receptors. J. Chem. Inf. Model. 2012, 52, 1–6. [Google Scholar] [CrossRef] [PubMed]
  33. Perez-Garrido, A.; Helguera, A.M.; Borges, F.; Cordeiro, M.N.D.S.; Rivero, V.; Escudero, A.G. Two new parameters based on distances in a receiver operating characteristic chart for the selection of classification models. J. Chem. Inf. Model. 2011, 51, 2746–2759. [Google Scholar] [CrossRef] [PubMed]
  34. Johnstone, R.W. Histone-deacetylase inhibitors: Novel drugs for the treatment of cancer. Nat. Rev. Drug Discov. 2002, 1, 287–299. [Google Scholar] [CrossRef] [PubMed]
  35. Marks, P.A.; Breslow, R. Dimethyl sulfoxide to vorinostat: Development of this histone deacetylase inhibitor as an anticancer drug. Nat. Biotechnol. 2007, 25, 84–90. [Google Scholar] [CrossRef] [PubMed]
  36. Beaulieu, J.M.; Gainetdinov, R.R. The physiology, signaling, and pharmacology of dopamine receptors. Pharmacol. Rev. 2011, 63, 182–217. [Google Scholar] [CrossRef] [PubMed]
  37. Kuang, Z.K.; Feng, S.Y.; Hu, B.; Wang, D.; He, S.B.; Kong, D.X. Predicting subtype selectivity of dopamine receptor ligands with three-dimensional biologically relevant spectrum. Chem. Biol. Drug Des. 2016, 88, 859–872. [Google Scholar] [CrossRef] [PubMed]
  38. He, S.B.; Ben, H.; Kuang, Z.K.; Wang, D.; Kong, D.X. Predicting subtype selectivity for adenosine receptor ligands with three-dimensional biologically relevant spectrum (BRS-3D). Sci. Rep. 2016, 6, 36595. [Google Scholar] [CrossRef] [PubMed]
  39. Lange, J.H.; Kruse, C.G. Keynote review: Medicinal chemistry strategies to CB1 cannabinoid receptor antagonists. Drug Discov. Today 2005, 10, 693–702. [Google Scholar] [CrossRef]
  40. Le Foll, B.; Goldberg, S.R. Cannabinoid CB1 receptor antagonists as promising new medications for drug dependence. J. Pharmacol. Exp. Ther. 2005, 312, 875–883. [Google Scholar] [CrossRef] [PubMed]
  41. Whiteside, G.T.; Lee, G.P.; Valenzano, K.J. The role of the cannabinoid CB2 receptor in pain transmission and therapeutic potential of small molecule CB2 receptor agonists. Curr. Med. Chem. 2007, 14, 917–936. [Google Scholar] [CrossRef] [PubMed]
  42. Maccarrone, M.; Battista, N.; Centonze, D. The endocannabinoid pathway in Huntington’s disease: A comparison with other neurodegenerative diseases. Prog. Neurobiol. 2007, 81, 349–379. [Google Scholar] [CrossRef] [PubMed]
  43. Centonze, D.; Finazzi-Agro, A.; Bernardi, G.; Maccarrone, M. The endocannabinoid system in targeting inflammatory neurodegenerative diseases. Trends Pharmacol. Sci. 2007, 28, 180–187. [Google Scholar] [CrossRef] [PubMed]
  44. Fliri, A.F.; Loging, W.T.; Thadeio, P.F.; Volkmann, R.A. Analysis of drug-induced effect patterns to link structure and side effects of medicines. Nat. Chem. Biol. 2005, 1, 389–397. [Google Scholar] [CrossRef] [PubMed]
  45. Fliri, A.F.; Loging, W.T.; Thadeio, P.F.; Volkmann, R.A. Biospectra analysis: Model proteome characterizations for linking molecular structure and biological response. J. Med. Chem. 2005, 48, 6918–6925. [Google Scholar] [CrossRef] [PubMed]
  46. Fliri, A.F.; Loging, W.T.; Thadeio, P.F.; Volkmann, R.A. Biological spectra analysis: Linking biological activity profiles to molecular structure. Proc. Natl. Acad. Sci. USA 2005, 102, 261–266. [Google Scholar] [CrossRef] [PubMed]
  47. Petrone, F.M.; Simms, B.; Nigsch, F.; Lounkine, E.; Kutchukian, P.; Cornett, A.; Deng, Z.; Davies, J.W.; Jenkins, J.L.; Glick, M. Rethinking molecular similarity: Comparing compounds on the basis of biological activity. ACS Chem. Biol. 2012, 7, 1399–1409. [Google Scholar] [CrossRef] [PubMed]
  48. Wassermann, A.M.; Kutchukian, P.S.; Lounkine, E.; Luethi, T.; Hamon, J.; Bocker, M.T.; Malik, H.A.; Cowan-Jacob, S.W.; Glick, M. Efficient search of chemical space: Navigating from fragments to structurally diverse chemotypes. J. Med. Chem. 2013, 56, 8879–8891. [Google Scholar] [CrossRef] [PubMed]
  49. Wassermann, A.M.; Lounkine, E.; Urban, L.; Whitebread, S.; Chen, S.N.; Hughes, K.; Guo, H.Q.; Kutlina, E.; Fekete, A.; Klumpp, M.; et al. A screening pattern recognition method finds new and divergent targets for drugs and natural products. ACS Chem. Biol. 2014, 9, 1622–1631. [Google Scholar] [CrossRef] [PubMed]
  50. Helal, K.Y.; Maciejewski, M.; Gregori-Puigjane, E.; Glick, M.; Wassermann, A.M. Public domain HTS fingerprints: Design and evaluation of compound bioactivity profiles from PubChem’s bioassay repository. J. Chem. Inf. Model. 2016, 56, 390–398. [Google Scholar] [CrossRef] [PubMed]
  51. Lamb, J.; Crawford, E.D.; Peck, D.; Modell, J.W.; Blat, I.C.; Wrobel, M.J.; Lerner, J.; Brunet, J.P.; Subramanian, A.; Ross, K.N.; et al. The Connectivity Map: Using gene-expression signatures to connect small molecules, genes, and disease. Science 2006, 313, 1929–1935. [Google Scholar] [CrossRef] [PubMed]
  52. Kellenberger, E.; Foata, N.; Rognan, D. Ranking targets in structure-based virtual screening of three-dimensional protein libraries: Methods and problems. J. Chem. Inf. Model. 2008, 48, 1014–1025. [Google Scholar] [CrossRef] [PubMed]
  53. Steindl, T.M.; Schuster, D.; Laggner, C.; Langer, T. Parallel screening: A novel concept in pharmacophore modeling and virtual screening. J. Chem. Inf. Model. 2006, 46, 2146–2157. [Google Scholar] [CrossRef] [PubMed]
  54. Sato, T.; Yuki, H.; Takaya, D.; Sasaki, S.; Tanaka, A.; Honma, T. Application of support vector machine to three-dimensional shape-based virtual screening using comprehensive three-dimensional molecular shape overlay with known inhibitors. J. Chem. Inf. Model. 2012, 52, 1015–1026. [Google Scholar] [CrossRef] [PubMed]
  55. Hopkins, A.L.; Groom, C.R. The druggable genome. Nat. Rev. Drug Discov. 2002, 1, 727–730. [Google Scholar] [CrossRef] [PubMed]
  56. Fujita, T.; Winkler, D.A. Understanding the roles of the “two QSARs”. J. Chem. Inf. Model. 2016, 56, 269–274. [Google Scholar] [CrossRef] [PubMed]
  57. Ma, D.L.; Chan, D.S.; Leung, C.H. Drug repositioning by structure-based virtual screening. Chem. Soc. Rev. 2013, 42, 2130–2141. [Google Scholar] [CrossRef] [PubMed]
  58. Meslamani, J.; Rognan, D.; Kellenberger, E. sc-PDB: A database for identifying variations and multiplicity of ‘druggable’ binding sites in proteins. Bioinformatics 2011, 27, 1324–1326. [Google Scholar] [CrossRef] [PubMed]
  59. Lemmen, C.; Lengauer, T.; Klebe, G. FLEXS: A method for fast flexible ligand superposition. J. Med. Chem. 1998, 41, 4502–4520. [Google Scholar] [CrossRef] [PubMed]
  60. Grant, J.A.; Gallardo, M.A.; Pickup, B.T. A fast method of molecular shape comparison: A simple application of a Gaussian description of molecular shape. J. Comput. Chem. 1996, 17, 1653–1666. [Google Scholar] [CrossRef]
  61. Jain, A.N. Surflex: Fully automatic flexible molecular docking using a molecular similarity-based search engine. J. Med. Chem. 2003, 46, 499–511. [Google Scholar] [CrossRef] [PubMed]
  62. Jain, A.N. Morphological similarity: A 3D molecular similarity method correlated with protein-ligand recognition. J. Comput. Aided Mol. Des. 2000, 14, 199–213. [Google Scholar] [CrossRef] [PubMed]
  63. sc-PDB. An Annotated Database of Druggable Binding Sites from the Protein Data Bank. Available online: http://bioinfo-pharma.u-strasbg.fr/scPDB/ (accessed on 31 August 2013).
  64. Pipeline Pilot, version 8.5; Accerlrys Software Inc.: San Diego, CA, USA, 2011.
  65. Shiraishi, A.; Niijima, S.; Brown, J.B.; Nakatsui, M.; Okuno, Y. Chemical genomics approach for GPCR-ligand interaction prediction and extraction of ligand binding determinants. J. Chem. Inf. Model. 2013, 53, 1253–1262. [Google Scholar] [CrossRef] [PubMed]
  66. Computaional Chemistry & Drug Design. Available online: http://cavasotto-lab.net/Databases/GDD/Download/ (accessed on 15 July 2014).
  67. Hinselmann, G.; Rosenbaum, L.; Jahn, A.; Fechner, N.; Ostermann, C.; Zell, A. Large-scale learning of structure-activity relationships using a linear support vector machine and problem-specific metrics. J. Chem. Inf. Model. 2011, 51, 203–213. [Google Scholar] [CrossRef] [PubMed]
  68. Fang, J.; Yang, R.; Gao, L.; Zhou, D.; Yang, S.; Liu, A.L.; Du, G.H. Predictions of BuChE inhibitors using support vector machine and naive Bayesian classification techniques in drug discovery. J. Chem. Inf. Model. 2013, 53, 3009–3020. [Google Scholar] [CrossRef] [PubMed]
  69. Heikamp, K.; Bajorath, J. Comparison of confirmed inactive and randomly selected compounds as negative training examples in support vector machine-based virtual screening. J. Chem. Inf. Model. 2013, 53, 1595–1601. [Google Scholar] [CrossRef] [PubMed]
  70. Heikamp, K.; Bajorath, J. Prediction of compounds with closely related activity profiles using weighted support vector machine linear combinations. J. Chem. Inf. Model. 2013, 53, 791–801. [Google Scholar] [CrossRef] [PubMed]
  71. Li, L.; Khanna, M.; Jo, I.; Wang, F.; Ashpole, N.M.; Hudmon, A.; Meroueh, S.O. Target-specific support vector machine scoring in structure-based virtual screening: Computational validation, in vitro testing in kinases, and effects on lung cancer cell proliferation. J. Chem. Inf. Model. 2011, 51, 755–759. [Google Scholar] [CrossRef] [PubMed]
  72. Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 27. [Google Scholar] [CrossRef]
  73. Alexander, D.L.; Tropsha, A.; Winkler, D.A. Beware of R2: Simple, unambiguous assessment of the prediction accuracy of QSAR and QSPR models. J. Chem. Inf. Model. 2015, 55, 1316–1322. [Google Scholar] [CrossRef] [PubMed]
  74. Strobl, C.; Boulesteix, A.L.; Zeileis, A.; Hothorn, T. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinform. 2007, 8, 25. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  75. Dragon (for Windows), version 5.4; Talete srl: Milano, Italy, 2006.
  76. Molecular Operating Environment (MOE), version 2009.10; Chemical Computing Group Inc.: Montreal, QC, Canada, 2009.
  • Sample Availability: Not available.
Figure 1. Physicochemical properties of BRCD-3D (3D Biologically-relevant Representative Compound Database) ligands and classifications of their corresponding targets. (AF) Properties distribution of the ligands, calculated by Pipeline Pilot 8.5. MW: molecular weight; AlogP: the octanol-water partition coefficient; HBAs: the count of hydrogen bond acceptors; HBDs: the count of hydrogen bond donors; PSA: polar surface area; RBs: the number of rotatable bonds; (G) Pie chart of the enzyme types of the targets. Details are shown in Supplementary Materials Table S3; (H) Pie chart of the SCOP classification of the targets. Entries without SCOP annotations are not taken into account. Details are shown in Supplementary Materials Table S4.
Figure 1. Physicochemical properties of BRCD-3D (3D Biologically-relevant Representative Compound Database) ligands and classifications of their corresponding targets. (AF) Properties distribution of the ligands, calculated by Pipeline Pilot 8.5. MW: molecular weight; AlogP: the octanol-water partition coefficient; HBAs: the count of hydrogen bond acceptors; HBDs: the count of hydrogen bond donors; PSA: polar surface area; RBs: the number of rotatable bonds; (G) Pie chart of the enzyme types of the targets. Details are shown in Supplementary Materials Table S3; (H) Pie chart of the SCOP classification of the targets. Entries without SCOP annotations are not taken into account. Details are shown in Supplementary Materials Table S4.
Molecules 21 01554 g001
Figure 2. Comparison of the three methods in handling the data imbalance. The red circle denotes model results based on data sets with the proportion of 1:10 (ligands:decoys). The purple triangle denotes model results based on data sets with the original ratio (1:39). The blue diamond denotes results of models with different weight for ligands class and decoys class in the SVM model training.
Figure 2. Comparison of the three methods in handling the data imbalance. The red circle denotes model results based on data sets with the proportion of 1:10 (ligands:decoys). The purple triangle denotes model results based on data sets with the original ratio (1:39). The blue diamond denotes results of models with different weight for ligands class and decoys class in the SVM model training.
Molecules 21 01554 g002
Figure 3. Comparison of SVM models using different BRS-3D feature subsets. For most data sets, the prediction performances were improved with the increasing of feature numbers.
Figure 3. Comparison of SVM models using different BRS-3D feature subsets. For most data sets, the prediction performances were improved with the increasing of feature numbers.
Molecules 21 01554 g003aMolecules 21 01554 g003b
Figure 4. The MCC values of SVM models based on BRS-3D and other two state-of-the-art descriptors.
Figure 4. The MCC values of SVM models based on BRS-3D and other two state-of-the-art descriptors.
Molecules 21 01554 g004
Figure 5. The CB1/CB2 subtype selectivity prediction models. (A) Cross-validation Q2 and test set R2 of the regression models with different feature subsets; (B) cross-validation and test set RMSE of the regression models with different feature subsets; (C) relationship between experimental and predicted SR of the model with 20% BRS-3D features; (D) discriminant models with different feature subsets; and (E,F) distribution of the selective compounds in the chemical space, composed of the most important features.
Figure 5. The CB1/CB2 subtype selectivity prediction models. (A) Cross-validation Q2 and test set R2 of the regression models with different feature subsets; (B) cross-validation and test set RMSE of the regression models with different feature subsets; (C) relationship between experimental and predicted SR of the model with 20% BRS-3D features; (D) discriminant models with different feature subsets; and (E,F) distribution of the selective compounds in the chemical space, composed of the most important features.
Molecules 21 01554 g005
Scheme 1. Workflow of QSAR study based on BRS-3D. The process contains three steps: (A) Construction of BRCD-3D. 3D shape similarity calculations and structural clustering were used to extract a set of 300 diverse ligands from the druggable protein-ligand database, sc-PDB. This ligand set was named the BRCD-3D; (B) Calculation of BRS-3D. The objective compound was flexibly superimposed onto the 300 BRCD-3D ligands (magenta ones), resulting in 300 similarity scores. The array of the scores was defined as BRS-3D, which could be used as a multi-dimensional molecular descriptor in virtual screening and QSAR studies; (C) Model development. Discriminant or regression models were developed with the machine learning methods (e.g., SVM), taking BRS-3D as the independent variable.
Scheme 1. Workflow of QSAR study based on BRS-3D. The process contains three steps: (A) Construction of BRCD-3D. 3D shape similarity calculations and structural clustering were used to extract a set of 300 diverse ligands from the druggable protein-ligand database, sc-PDB. This ligand set was named the BRCD-3D; (B) Calculation of BRS-3D. The objective compound was flexibly superimposed onto the 300 BRCD-3D ligands (magenta ones), resulting in 300 similarity scores. The array of the scores was defined as BRS-3D, which could be used as a multi-dimensional molecular descriptor in virtual screening and QSAR studies; (C) Model development. Discriminant or regression models were developed with the machine learning methods (e.g., SVM), taking BRS-3D as the independent variable.
Molecules 21 01554 sch001
Table 1. Results of the discriminant models for the 42 GLL/GDD (G protein-coupled receptor (GPCR) ligand library and the GPCR decoy database) data sets. Models were built with the “weighted” method for handling the data imbalance. The CV AUC is a 10-fold cross-validation result of the training set. Accuracy, Precision, Recall, and MCC are the prediction results for the test set. Results of the other two treatments for data imbalance can be found in Supplementary Materials Table S6.
Table 1. Results of the discriminant models for the 42 GLL/GDD (G protein-coupled receptor (GPCR) ligand library and the GPCR decoy database) data sets. Models were built with the “weighted” method for handling the data imbalance. The CV AUC is a 10-fold cross-validation result of the training set. Accuracy, Precision, Recall, and MCC are the prediction results for the test set. Results of the other two treatments for data imbalance can be found in Supplementary Materials Table S6.
No.Data SetsCV AUCAccurayPrecisionRecallMCC
15HT1A_Agonist0.9890.9940.9860.7630.865
25HT1A_Antagonist0.9750.9920.8880.7820.829
35HT1D_Agonist0.9880.9931.0000.7030.835
45HT1D_Antagonist0.9800.9950.9810.8250.898
55HT2A_Antagonist0.9810.9920.8940.7590.820
65HT2C_Agonist0.9830.9860.7210.7380.722
75HT2C_Antagonist0.9570.9911.0000.6250.787
85HT4R_Agonist0.9920.9911.0000.6380.795
95HT4R_Antagonist0.9930.9971.0000.8750.933
10AA1R_Antagonist0.9860.9920.8940.7500.814
11AA2AR_Antagonist0.9850.9950.9830.8080.889
12AA2BR_Antagonist0.9840.9930.8940.7970.841
13ACM1_Agonist0.9850.9920.8510.8200.831
14ACM3_Antagonist0.9830.9910.9300.6780.790
15ADA1A_Antagonist0.9830.9930.9680.7630.856
16ADA1B_Antagonist0.9880.9940.8890.8730.878
17ADA1D_Antagonist0.9870.9940.9480.8070.872
18ADA2A_Antagonist0.9530.9910.9830.6480.794
19ADA2B_Antagonist0.9590.9790.5620.8390.677
20ADA2C_Antagonist0.9610.9920.9670.6860.811
21ADRB1_Agonist0.9950.9920.9120.7380.816
22ADRB1_Antagonist0.9860.9910.9640.6430.783
23ADRB2_Agonist0.9920.9960.9040.9270.914
24ADRB2_Antagonist0.9900.9950.9710.8290.895
25ADRB3_Agonist0.9940.9960.9820.8600.917
26AG2R_Antagonist0.9960.9980.9960.9070.949
27CCKAR_Antagonist0.9860.9931.0000.7220.847
28CLTR1_Antagonist0.9810.9920.9790.7010.825
29DRD2_Antagonist0.9770.9920.9510.7260.827
30DRD3_Antagonist0.9820.9930.9410.7500.837
31DRD4_Antagonist0.9930.9950.9820.8270.899
32EDNRA_Antagonist0.9870.9940.9320.8090.865
33EDNRB_Antagonist0.9860.9930.9020.8140.853
34GASR_Antagonist0.9900.9950.9790.8160.891
35HRH3_Antagonist0.9970.9920.9580.7300.833
36LSHR_Antagonist0.9900.9891.0000.5430.733
37NK1R_Antagonist0.9800.9910.9140.7110.802
38OPRD_Agonist0.9900.9931.0000.7220.847
39OPRK_Agonist0.9900.9901.0000.5960.768
40TA2R_Antagonist0.9910.9940.9740.7720.864
41V1AR_Antagonist0.9860.9930.9710.7330.840
42V1BR_Antagonist0.9830.9920.9690.6890.813
Table 2. SVM models based on Dragon 2D, MOE 3D, and BRS-3D descriptors.
Table 2. SVM models based on Dragon 2D, MOE 3D, and BRS-3D descriptors.
Data SetsAccuracyPrecisionRecallMCC
Dragon 2DMOE 3DBRS-3DDragon 2DMOE 3DBRS-3DDragon 2DMOE 3DBRS-3DDragon 2DMOE 3DBRS-3D
10.9930.9510.9940.8490.3300.9860.8890.9320.7630.8660.5390.865
20.9920.9910.9920.8190.8140.8880.8510.8220.7820.8310.8130.829
30.9870.9770.9930.6710.5261.0000.9190.9190.7030.7790.6860.835
40.9810.9920.9950.5680.8310.9810.9210.8570.8250.7150.8400.898
50.9640.9450.9920.4010.3000.8940.8830.9030.7590.5810.5020.820
60.9870.9800.9860.7440.5710.7210.7620.8570.7380.7470.6910.722
70.9830.9490.9910.6360.3171.0000.7660.9060.6250.6900.5190.787
80.9980.9790.9910.9390.5411.0000.9790.9790.6380.9570.7190.795
90.9950.9940.9970.8550.8631.0000.9790.9170.8750.9120.8860.933
100.9900.9860.9920.7740.7050.8940.8570.7690.7500.8100.7290.814
110.9890.9820.9950.7560.6020.9830.8080.8080.8080.7760.6890.889
120.9870.9780.9930.7940.5390.8940.6760.8650.7970.7260.6720.841
130.9860.9620.9920.6620.3830.8510.8880.8820.8200.7600.5670.831
140.9880.9820.9910.7140.6050.9300.8470.8310.6780.7720.7000.790
150.9800.9600.9930.5570.3730.9680.9150.8810.7630.7050.5580.856
160.9900.9550.9940.7580.3490.8890.8820.9270.8730.8120.5540.878
170.9770.9550.9940.5280.3470.9480.9040.9040.8070.6810.5440.872
180.9570.9380.9910.3590.2700.9830.9090.8640.6480.5560.4620.794
190.9730.9820.9790.4710.6320.5620.7360.6900.8390.5760.6510.677
200.9550.9780.9920.3350.5420.9670.8370.6740.6860.5130.5930.811
210.9860.9740.9920.6550.4940.9120.9050.9050.7380.7630.6580.816
220.9890.9770.9910.7020.5220.9640.9520.8570.6430.8120.6590.783
230.9950.9870.9960.9000.6940.9040.8780.8290.9270.8860.7520.914
240.9740.9710.9950.4520.4290.9710.9170.9170.8290.6330.6160.895
250.9900.9860.9960.7380.6650.9820.9380.9070.8600.8270.7700.917
260.9960.9940.9980.8770.8430.9960.9670.9270.9070.9180.8810.949
270.9920.9720.9930.8570.4671.0000.8330.8750.7220.8410.6270.847
280.9940.9680.9920.8570.4410.9790.8960.9400.7010.8730.6320.825
290.9640.9870.9920.4030.6990.9510.9250.8110.7260.5970.7460.827
300.9480.9750.9930.3130.5000.9410.8910.7340.7500.5100.5940.837
310.9810.9750.9950.5750.5020.9820.9550.9550.8270.7330.6830.899
320.9940.9510.9940.8640.3290.9320.8900.9120.8090.8740.5310.865
330.9900.9750.9930.7580.5050.9020.8580.9120.8140.8010.6680.853
340.9940.9810.9950.8610.5820.9790.9210.9040.8160.8870.7170.891
350.9770.9880.9920.5240.7360.9580.8570.8410.7300.6600.7810.833
360.9860.9850.9890.7330.7441.0000.7170.6300.5430.7180.6770.733
370.9920.9920.9910.8050.8300.9140.8720.8390.7110.8340.8300.802
380.9940.9460.9930.8670.3071.0000.9030.9170.7220.8820.5130.847
390.9950.9350.9900.8690.2511.0000.9300.8070.5960.8960.4280.768
400.9730.9880.9940.4790.7210.9740.9450.8760.7720.6620.7890.864
410.9940.9870.9930.8700.8000.9710.8890.6220.7330.8760.6990.840
420.9920.9690.9920.7680.4350.9690.9560.8220.6890.8530.5850.813
Table 3. The 42 GLL/GDD data sets with more than 200 ligands.
Table 3. The 42 GLL/GDD data sets with more than 200 ligands.
No.TargetTarget NameLigand TypeLigand CountDecoy Count
15HT1A5-hydroxytryptamine receptor 1AAgonist95237,128
25HT1A5-hydroxytryptamine receptor 1AAntagonist50619,734
35HT1D5-hydroxytryptamine receptor 1DAgonist55821,762
45HT1D5-hydroxytryptamine receptor 1DAntagonist31512,285
55HT2A5-hydroxytryptamine receptor 2AAntagonist72528,275
65HT2C5-hydroxytryptamine receptor 2CAgonist2098151
75HT2C5-hydroxytryptamine receptor 2CAntagonist31812,402
85HT4R5-hydroxytryptamine receptor 4Agonist2359165
95HT4R5-hydroxytryptamine receptor 4Antagonist2419399
10AA1RAdenosine receptor A1Antagonist28010,920
11AA2ARAdenosine receptor A2aAntagonist36114,079
12AA2BRAdenosine receptor A2bAntagonist37014,430
13ACM1Muscarinic acetylcholine receptor M1Agonist80631,434
14ACM3Muscarinic acetylcholine receptor M3Antagonist29511,505
15ADA1AAlpha-1A adrenergic receptorAntagonist58822,932
16ADA1BAlpha-1B adrenergic receptorAntagonist55021,450
17ADA1DAlpha-1D adrenergic receptorAntagonist56822,152
18ADA2AAlpha-2A adrenergic receptorAntagonist44017,160
19ADA2BAlpha-2B adrenergic receptorAntagonist43717,043
20ADA2CAlpha-2C adrenergic receptorAntagonist43316,887
21ADRB1Beta-1 adrenergic receptorAgonist2098151
22ADRB1Beta-1 adrenergic receptorAntagonist2118229
23ADRB2Beta-2 adrenergic receptorAgonist2068034
24ADRB2Beta-2 adrenergic receptorAntagonist2047956
25ADRB3Beta-3 adrenergic receptorAgonist64325,077
26AG2RType-1 angiotensin II receptorAntagonist150258,578
27CCKARCholecystokinin receptor type AAntagonist36014,040
28CLTR1Cysteinyl leukotriene receptor 1Antagonist33312,987
29DRD2D2 dopamine receptorAntagonist52920,631
30DRD3D3 dopamine receptorAntagonist31712,363
31DRD4D4 dopamine receptorAntagonist66525,935
32EDNRAEndothelin-1 receptorAntagonist67626,364
33EDNRBEndothelin B receptorAntagonist56121,879
34GASRGastrin/cholecystokinin type B receptorAntagonist56722,113
35HRH3Histamine H3 receptorAntagonist31312,207
36LSHRLutropin-choriogonadotropic hormone receptorAntagonist2308970
37NK1RSubstance-P receptorAntagonist90035,100
38OPRDDelta-type opioid receptorAgonist36114,079
39OPRKKappa-type opioid receptorAgonist28411,076
40TA2RThromboxane A2 receptorAntagonist72528,275
41V1ARVasopressin V1a receptorAntagonist2258775
42V1BRVasopressin V1b receptorAntagonist2258775

Share and Cite

MDPI and ACS Style

Hu, B.; Kuang, Z.-K.; Feng, S.-Y.; Wang, D.; He, S.-B.; Kong, D.-X. Three-Dimensional Biologically Relevant Spectrum (BRS-3D): Shape Similarity Profile Based on PDB Ligands as Molecular Descriptors. Molecules 2016, 21, 1554. https://doi.org/10.3390/molecules21111554

AMA Style

Hu B, Kuang Z-K, Feng S-Y, Wang D, He S-B, Kong D-X. Three-Dimensional Biologically Relevant Spectrum (BRS-3D): Shape Similarity Profile Based on PDB Ligands as Molecular Descriptors. Molecules. 2016; 21(11):1554. https://doi.org/10.3390/molecules21111554

Chicago/Turabian Style

Hu, Ben, Zheng-Kun Kuang, Shi-Yu Feng, Dong Wang, Song-Bing He, and De-Xin Kong. 2016. "Three-Dimensional Biologically Relevant Spectrum (BRS-3D): Shape Similarity Profile Based on PDB Ligands as Molecular Descriptors" Molecules 21, no. 11: 1554. https://doi.org/10.3390/molecules21111554

Article Metrics

Back to TopTop