^{1}

^{†}

^{1}

^{†}

^{2}

^{2}

^{*}

^{1}

^{*}

These authors contributed equally to this work.

This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

The hedgehog signal pathway is an essential agent in developmental patterning, wherein the local concentration of the Hedgehog morphogens directs cellular differentiation and expansion. Furthermore, the Hedgehog pathway has been implicated in tumor/stromal interaction and cancer stem cell. Nowadays searching novel inhibitors for Hedgehog Signal Pathway is drawing much more attention by biological, chemical and pharmological scientists. In our study, a solid computational model is proposed which incorporates various statistical analysis methods to perform a Quantitative Structure-Activity Relationship (QSAR) study on the inhibitors of Hedgehog signaling. The whole QSAR data contain 93 cyclopamine derivatives as well as their activities against four different cell lines (NCI-H446, BxPC-3, SW1990 and NCI-H157). Our extensive testing indicated that the binary classification model is a better choice for building the QSAR model of inhibitors of Hedgehog signaling compared with other statistical methods and the corresponding

The hedgehog signaling pathway plays a key role in the control of cell differentiation, growth, and proliferation [

Due to the direct relationship between the activation of hedgehog signaling pathway and oncogenesis, cancer researchers have been dedicated to find specific inhibitors of hedgehog signaling since it will provide efficient therapies for a wide range of malignancies [

In order to better understand Hedgehog signal pathway as well as design efficient inhibitors for this pathway, 93 cyclopamine derivatives were synthesized and their activities were tested against four different cell lines (BxPC-3, NCI-H446, SW1990 and NCI-H157) respectively [

Based on the computational framework outlined in Material and Methods, the following results or clues were obtained for the QSAR modeling of inhibitors of Hedgehog signal pathway.

As mentioned above, two distinct sets of descriptors were tested to describe the 93 chemical compounds respectively (

In conclusion, models derived from DLI are much more stable for both training data and testing data, while general descriptors cannot guarantee such stability and scale in independent data.

It is normally known that QSAR predictions are only reliable within or near the property space used to train the model. Preparing a robust, unbiased and sufficiently large training set is critically important for the building of a proper statistical model. As mentioned above, two data division methods,

In order to statistically reveal the difference between the results influenced by two such kinds of data divisions, pair t-test was performed and the p-value derived from the above two tables (

When building a QSAR model, linear regression methods are normally preferred to the advanced non-linear methods, since the linear models are easier to use for a physical explanation of the prediction results. The most classical liner model in QSAR is PLS, which have been widely used in popular computer-aided drug design software [

Since advanced machine learning methods such as ANN [

When the qualities of the data or the underlying mechanism are not suitable for regression modeling, the binary classification was applied on the data to uncover their probabilities to be active or inactive. MOE has offered a binary filter to filtering the numerical data. Any properties which can be represented in a binary (yes/no) way (like active/inactive, toxic/non-toxic, drug-like/non-drug-like, permeable/non-permeable,

As shown in

The SVM classification was also applied to further validate the efficiency of binary classification models compared with regression models. The results shown in

Four different cell lines (NCI-H446, NCI-H157, SW1990 and BxPC-3) were used to test the cytotoxicity of the 93 compounds. However, only the data of NCI-H446 can produce a reasonable model by QSAR analysis; the prediction accuracy of the models against all the other cell lines is about 0.6.

Why do some specific cell lines not fit well to our QSAR analysis? We speculate that the most likely reason is the non-specific cytotoxicity effect of these compounds to the other three cell lines. For example, HCI-H157 and BxPC-3 do not express the Gli and Smoothened protein, respectively [

In our study,

The first important finding is that through such

A comprehensive computational workflow was designed to perform QSAR analysis on the inhibitors of Hedgehog signaling. This workflow is outlined in

Our analysis started by using two different descriptors,

93 cyclopamine derivatives together with their activities against four different cell lines (BxPC-3, NCI-H446, SW1990 and NCI-H157) were tested and are listed in the

Two different approaches were applied to divide these experimental data into training set and testing set for our following statistical modeling. Details followed.

Briefly, the

Compared with the above method, a clustering process is used here before Diverse Subset. Then the Diverse Subset is performed on each cluster to rank them respectively. Finally the training dataset and testing dataset are generated by summarizing the sub-training dataset (65% of every sub-cluster dataset) and testing dataset (35% of the every sub-cluster dataset) from every sub-cluster, respectively. It should be noted that MOE can cluster the whole data based on the descriptors or fingerprints. For time purposes, the descriptor-based clustering in MOE was used in our study because it is a simple 3N algorithm whereas fingerprint-based clustering uses the N2 Jarvis-Patrick algorithm.

There are lots of descriptors to describe a chemical compound, including constitutional descriptors, physiochemical property descriptors, electronic descriptors, topological indices, geometrical descriptors, and quantum chemistry descriptors,

General descriptors include atomic contributions to van der Waals surface area, log P (octanol/water), molar refractivity and partial charge. These descriptors are applied to the construction of QSAR models for boiling point, vapor pressure, free energy of salvation in water, solubility in water, thrombin/trypsin/factor Xa activity, blood-brain barrier permeability and compound classification. The wide applications of these descriptors have suggested their important usage in the QSAR modeling, combinatorial library design and molecular diversity work.

On the other hand, DLI descriptors acts as an approach to measure drug-like compounds, as first presented by Xu

Although these two sets of descriptors are both computable from connection table information, they partly complement each other. Normally, general descriptors have a preference for physical prosperities of compounds, while DLI descriptors favor simple topological indices of compounds.

In our computational framework, various statistical models were incorporated to evaluate their performance in QSAR analysis of inhibitors of Hedgehog signal pathway, and we wanted to find the most suitable statistical analysis method for the QSAR modeling of such data. Detailed descriptions of each statistical method are listed below.

The PLS QSAR method [^{6} which is a very high setting. The leave-one-out cross validation (LOO-CV) scheme was used to validate the models and the correlation coefficient (Q2) and root-mean-square error (RMSE) were reported.

SVR was used here to compare with PLS regression, which has proven to be a powerful regression technique in many applications. SVR is the regression version derived from SVM which was proposed in 1996 by Vladimir Vapnik

The binary bayesian QSAR method was employed by using the QuaSAR-Model module of MOE 2008. In this modeling, the numerical values of inhibitor activity were transferred to binary classification labels, thus greatly reduced the noise of the data. That is, the binary model is used to predict a probability of a given compound to be either active or inactive rather than their numerical values. Since no quantitative estimation of the actual activity is derived, the compounds are referred to as “active” if its predicted probability of being active is more than 0.5.

In binary Bayesian inference for each compound, the following steps were applied to predict their probability of being active [

Estimates two distributions: one for the active compounds and one for the inactive ones in the training set. The separation of active and inactive sets is manually defined by a Binary Threshold.

Counts the frequency of occurrence of a particular descriptor value in active and inactive cases.

Accumulates a histogram of the observed sample values over the classes. The distribution is convoluted with a Gaussian (σ = 0.25, the smoothing width) to avoid sensitivity to bin boundaries.

A histogram of property distributions is derived for each descriptor for “active ” and “inactive” (yes/no) sets. Those descriptors which differentiate the two sets will have a high impact in the model, those which do not, will drop out.

Compared with binary Bayesian classification, the SVM classification was also applied for our QSAR data. SVM works by mapping the training data into a feature space with the aid of a so-called kernel function and then separating the data using a large margin hyperplane. Intuitively, the kernel computes a similarity between two given examples. Most commonly used kernel functions are radial basis function kernels and was used in our experiments. SVM classifiers are generated by a two-step procedure: First, the sample data vectors are mapped (“projected”) to a very high-dimensional space. The dimension of this space is significantly larger than the dimension of the original data space. Then, the algorithm finds a hyperplane in this space with the largest margin separating classes of data. It was shown that classification accuracy usually depends only weakly on the specific projection, provided that the target space is sufficiently high dimensional. Sometimes it is not possible to find the separating hyperplane even in a very high-dimensional space. In this case a tradeoff is introduced between the size of the separating margin and penalties for every vector which is within the margin.

SAReport [

Briefly, the Suggestions table in

In this study, different descriptors, different data dividing approaches as well as different statistic methods are used to build QSAR models for inhibitors of Hedgehog signal pathway on 93 cyclopamine derivatives together with their activities against four different cell lines. Our investigation has shown that NCI-466 may serve as the best cell line for testing the activities of inhibitors of Hedgehog signal pathway. Due to the lower qualities of the data, the binary classification method is a much better choice in building QSAR models than regression. Furthermore, for synthesis and medical scientists, our results indicate that demethylation, methylation and hydroxylation at a specific position may highly improve the activity of inhibitors of Hedgehog signal pathway. Demethylation is also found to be a better choice than methylation or hydroxylation for compound modification. Based on these conclusions, demethylation is preferred to methylation or hydroxylation in compound modification and such work is currently being actively pursued in our laboratory.

correlation coefficient in self fitting of training data set

correlation coefficient in cross validation fitting of training data set

correlation coefficient in fitting of test data set

percentage accuracy of binary model = Total accuracy

percentage accuracy of inactive subset

percentage accuracy of active subset

A in self fitting of training data set

A in cross validation fitting of training data set

A in fitting of test data set

Drug-like Index

Partial Least Squares

Support Vector Regression

Support Vector Machine

Artificial Neural Networks

Structure-Activity Report

We would like to thank Baowei Zhao in GSK for his proofread and valuable suggestions. This work was supported in part by grants from Ministry of Science and Technology China (2009ZX10004-601), National Natural Science Foundation of China (30976611), and Research Fund for the Doctoral Program of Higher Education of China (20100072110008, 20100072120050).

Four scaffolds found in our experimental data.

Six molecules that did not match any of the scaffolds, as mentioned above.

General computational workflow used in our study.

QSAR results derived from the data divided by Diverse Subset (σ indicates difference).

σ | σ | σ | σ | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0.552 | 0.494 | −0.058 | 0.659 | 0.526 | −0.133 | 0.644 | 0.585 | −0.059 | 0.527 | 0.531 | 0.004 | ||

0.000 | 0.035 | 0.035 | 0.001 | 0.026 | 0.025 | 0.021 | 0.158 | 0.137 | 0.038 | 0.106 | 0.068 | ||

0.102 | 0.307 | 0.205 | 0.218 | 0.025 | −0.193 | 0.084 | 0.193 | 0.109 | 0.019 | 0.118 | 0.099 | ||

0.994 | 0.686 | 0.308 | 0.966 | 0.763 | −0.203 | 0.993 | 0.808 | −0.185 | 0.988 | 0.705 | −0.283 | ||

0.994 | 0.000 | −0.994 | 0.962 | 0.002 | −0.96 | 0.992 | 0.069 | −0.923 | 0.987 | 0.001 | −0.986 | ||

0.000 | 0.396 | 0.396 | 0.088 | 0.110 | 0.022 | 0.025 | 0.258 | 0.233 | 0.023 | 0.077 | 0.054 | ||

0.883 | 0.917 | 0.034 | 1.000 | 0.967 | −0.033 | 0.900 | 0.933 | 0.033 | 0.967 | 0.933 | −0.034 | ||

0.783 | 0.817 | 0.034 | 0.917 | 0.917 | 0 | 0.883 | 0.783 | −0.1 | 0.867 | 0.867 | 0 | ||

0.606 | 0.576 | −0.03 | 0.758 | 0.121 | 0.576 | 0.667 | 0.091 | 0.485 | 0.636 | 0.151 | |||

1.000 | 1.000 | 0 | 1.000 | 1.000 | 0 | 1.000 | 1.000 | 0 | 1.000 | 1.000 | 0 | ||

0.550 | 0.500 | −0.05 | 0.867 | 0.817 | −0.05 | 0.650 | 0.533 | −0.117 | 0.633 | 0.617 | −0.016 | ||

0.455 | 0.636 | 0.181 | 0.788 | 0.091 | 0.545 | 0.758 | 0.213 | 0.697 | 0.636 | −0.061 |

QSAR results derived from the data divided by

σ | σ | σ | σ | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0.506 | 0.474 | −0.032 | 0.593 | 0.396 | −0.197 | 0.542 | 0.493 | −0.049 | 0.587 | 0.542 | −0.045 | ||

0.011 | 0.007 | −0.004 | 0.015 | 0.019 | 0.004 | 0.005 | 0.002 | −0.003 | 0.006 | 0.040 | 0.034 | ||

0.178 | 0.215 | 0.037 | 0.055 | 0.201 | 0.146 | 0.000 | 0.222 | 0.222 | 0.087 | 0.056 | −0.031 | ||

0.997 | 0.716 | −0.281 | 0.965 | 0.756 | −0.209 | 0.993 | 0.839 | −0.154 | 0.987 | 0.655 | −0.332 | ||

0.997 | 0.021 | −0.976 | 0.962 | 0.025 | −0.937 | 0.993 | 0.124 | −0.869 | 0.986 | 0.019 | −0.967 | ||

0.008 | 0.139 | 0.131 | 0.029 | 0.001 | −0.028 | 0.040 | 0.075 | 0.035 | 0.019 | 0.087 | 0.068 | ||

0.967 | 0.885 | −0.082 | 0.951 | 0.934 | −0.017 | 0.934 | 0.918 | −0.016 | 0.984 | 0.885 | −0.099 | ||

0.852 | 0.803 | −0.049 | 0.934 | 0.918 | −0.016 | 0.852 | 0.836 | −0.016 | 0.820 | 0.820 | 0 | ||

0.656 | 0.625 | −0.031 | 0.625 | 0.281 | 0.625 | 0.656 | 0.031 | 0.625 | 0.625 | 0 | |||

1.000 | 0.984 | −0.016 | 1.000 | 1.000 | 0 | 1.000 | 1.000 | 0 | 1.000 | 0.984 | −0.016 | ||

0.505 | 0.475 | −0.03 | 0.803 | 0.852 | 0.049 | 0.590 | 0.623 | 0.033 | 0.656 | 0.623 | −0.033 | ||

0.656 | 0.719 | 0.063 | 0.875 | 0 | 0.625 | 0.719 | 0.094 | 0.688 | 0.719 | 0.031 |