Toward the Prediction of FBPase Inhibitory Activity Using Chemoinformatic Methods

Currently, Chemoinformatic methods are used to perform the prediction for FBPase inhibitory activity. A genetic algorithm-random forest coupled method (GA-RF) was proposed to predict fructose 1,6-bisphosphatase (FBPase) inhibitors to treat type 2 diabetes mellitus using the Mold2 molecular descriptors. A data set of 126 oxazole and thiazole analogs was used to derive the GA-RF model, yielding the significant non-cross-validated correlation coefficient r2ncv and cross-validated r2cv values of 0.96 and 0.67 for the training set, respectively. The statistically significant model was validated by a test set of 64 compounds, producing the prediction correlation coefficient r2pred of 0.90. More importantly, the building GA-RF model also passed through various criteria suggested by Tropsha and Roy with r2o and r2m values of 0.90 and 0.83, respectively. In order to compare with the GA-RF model, a pure RF model developed based on the full descriptors was performed as well for the same data set. The resulting GA-RF model with significantly internal and external prediction capacities is beneficial to the prediction of potential oxazole and thiazole series of FBPase inhibitors prior to chemical synthesis in drug discovery programs.


Introduction
Diabetes is one of the most prevalent diseases worldwide, and the incidence of this continues to grow, which causes a global public health burden. It is estimated that by 2025, India, China and the United States will possess the largest number of people with diabetes [1]. The more prevalent form, type 2 diabetes, accounts for more than 90% of cases. The pathogenesis of type 2 diabetes is complex, involving progressive development of insulin resistance and a relative deficiency in insulin secretion, which may lead to overt hyperglycemia [2]. Type 2 diabetes is associated with the metabolic syndrome that comprises a set of alterations that include glucose intolerance, truncal obesity, hypertension and dyslipidemia [3]. It has been reported that the metabolic syndrome is associated with a markedly increased risk of coronary artery disease [4]. The risk of myocardial infarction in patients with diabetes and no history of cardiac disease roughly equates the risk in non-diabetic patients with known cardiac diseases [5]. In addition, diabetes often conspires to cause various complications such as eye diseases. It has been documented that diabetes is the primary reason for loss of vision [6]. Patients with diabetes suffer from great mental and physical pain. Consequently, it is necessary to develop an effective drug for the treatment of this disease. Since hyperglycemia leads to severe microvascular and macrovascular complications, the primary treatment goal is to reduce the glucose level. Several popular classes of oral diabetes therapies on the market include sulfonylureas, peroxisome proliferator-activated receptor-γ (PPAR-γ) agonists, metformin and so forth [7,8], which lower glucose by increasing glucose metabolism either via enhanced insulin secretion or improved insulin sensitivity. However, therapies with these drugs still have several drawbacks. For example, sulfonylurea therapy is usually associated with weight gain [9] and metformin therapy is forbidden in patients with renal and hepatic diseases, respiratory insufficiency and alcohol abuse [10]. As a consequence, there is a need for novel, more effective drugs with more safe profiles and fewer side effects to the treatment of diabetes.
Fructose 1,6-bisphosphatase (FBPase), a highly regulated enzyme that catalyzes the second to last step in gluconeogenesis, draws attention as a potential therapeutic target to treat type 2 diabetes mellitus [10,11]. FBPase enables the inhibition of gluconeogenesis from all the corresponding substrates while avoiding direct effects on glycogenolysis, glycolysis and the tricarboxylic acid cycle. In addition, evidence from clinical research suggests that FBPase inhibitors may show an adequate safety margin [11]. Several classes of agents against FBPase have recently been reported including anilinoquinazolines [12], benzoxazole benzenesulfonamides [13], MDL-29951 [14], adenosine 5'-monophosphate (AMP) mimics, etc. Some of these series of drugs, however, were explored without successfully achieving acceptable oral bioavailability, thus, it is still required to research novel drugs with desirable characteristics.
Chemoinformatic methods, as a complementary approach of experiment technology, have found wide utility and acceptance, and these methods are playing a central role in drug design [15]. In view of this, previous reports [16] have performed a computational study based on a series of FBPase inhibitors. However, these methods were developed and tested on just a few compounds. In the current work, a larger dataset was used to derive a statistical model with a high prediction power.
Random forest (RF), a new regression tool, has been reported to be a combination of relatively high prediction accuracy and a collection of desired features that makes RF uniquely suited for modeling in chemoinformatics [17] based on a quantitative description of the compound's molecular structure. RF can show an excellent performance even when most predictive variables are noise, it can be used when the number of variables is much larger than the number of observations, and it returns measures of variable importance. It is well known that an ideal regression model should have a high performance with few descriptors. Thus in the present work, to optimize the Mold 2 molecular descriptor subset [18], with the statistical performance and efficiency of the model being simultaneously enhanced, the genetic algorithm-random forest coupled method is selected to perform a regression task to investigate whether the proposed GA-RF method can construct an ideal prediction model for the FBPase inhibitors. In addition, the derived optimal model was checked by Y-randomization to ensure that the prediction model was not obtained by chance correlation. Moreover, the rigorous statistical criteria suggested by Thropsha and Roy [19,20] were used to validate the model as well. For comparison with the GA-RF, the pure RF model, which means that the model was developed in the full descriptors without variable selection, was also applied using the same dataset.

Descriptor Calculation and Preprocessing
For the quantitative structure-activity relationship (QSAR) investigations, one of the important factors affecting the quality of the model is the choice of molecular descriptors used to obtain the structural information suitable for model development. The software Mold 2 [18] enables a rapid calculation of a large and diverse set of descriptors encoding two-dimensional chemical structure information. A comparative analysis of Mold 2 descriptors with those calculated by commercial tools such as Cerius 2 [21], Dragon [22] on several data sets demonstrated that Mold 2 descriptors can convey a similar amount of information as those widely-used software packages [18]. Although acting as a freely available tool, Mold 2 has been proved suitable not only for the QSAR research [23][24][25], but also for virtual screening of large databases in drug development [18]. In the present work, a total of 777 Mold 2 descriptors were calculated based on the SDF file format of all the studied FBPase inhibitors. All these descriptors were then preprocessed as follows: (1) descriptors containing values greater than 85% zero were removed; (2) zero-and near zero-variance predictors were removed, as descriptors like this may cause the model to crash or the fit to be unstable; and (3) one of the two descriptors that had absolute correlations above 0.75 was omitted. After these steps, the number of the descriptors was reduced to 108 for further research.

Split of the Training and Test Sets
Rational selection of training and test sets is also one of the important and challenging steps for the development of validated QSAR models. Self-organizing maps (SOM) can be employed for data survey, which has been successfully applied to split datasets [26,27]. In the current study, in order to probe the descriptor space, a total of 108 Mold 2 descriptors were used to obtain a SOM map. A small Kohonen network with 6 × 6 = 36 neurons was employed, producing a map with 36 positions. All the compounds were placed onto the 36 positions of the Kohonen map. Figure 1 demonstrates the distribution of the compounds, where the black dot denotes the training set, while the red asterisk stands for the compounds from the test set. As can be seen from this figure, firstly, the representative points of the test set are close to those of the training set, and secondly, the training and test sets uniformly fill the whole chemical space, indicating a rational selection of the training and test inhibitors in the present work [28]. The training set was applied for the development of the model and the external test set was used for the assessment of the built model. The training set and test sets include 126 and 64 compounds, respectively.

Set Parameters of GA-RF Algorithm
In the present work, all the genetic parameters are set as follows: The number of maximum generations is set to 200 and the number of individuals to 50. Individuals are then selected from the population using the stochastic universal sampling algorithm, with a generation gap of 0.9. The double-point crossover is adopted with the probabilities of 0.7. The mutation operation is performed based on the default value included in the genetic algorithm toolbox developed by the Evolutionary Computation Research Team at The University of Sheffield, UK. Herein, the minimum out-of-bag (OOB) mean squared error (MSE) is used as the fitness function to obtain the optimal individual.

Statistical Results
Apart from the quality of the used data sets, the selection of proper descriptors relevant to the FBPase inhibitory activity is crucial for optimizing the prediction system by reducing the noise in a statistical learning process. After GA-RF, the final 40 Mold 2 descriptors are selected. Table 1 illustrates the names of these selected descriptors and the corresponding definitions [18]. Based on the determined optimal parameters by GA, the GA-RF model presents an (root-mean-square error) RMSE of 0.25 and 0.34 for the training and test sets, respectively. The determined coefficient r 2 ncv reaches a value as high as 0.96 with r 2 cv = 0.67 for the training set. The model predictability is evaluated by an external prediction set, which illustrates r 2 ts and r 2 pred values of 0.91 and 0.90, respectively. It is well known that the random forest algorithm can manipulate the data set even with a large number of descriptors [17]. Thus we compare the GA-RF with pure RF, which means that the latter model is built based on the whole 108 descriptors. It can be seen from Table 2 that the pure RF performs comparably but relatively low statistics compared with GA-RF. The scatter plots of the experimental versus predicted FBPase inhibitory activity based on the GA-RF and RF models are shown in Figure 2, where the proposed GA-RF model presents a relatively better performance than RF. For the former, the data points of training and test sets distribute more closely in a straight line (y = x), indicating that GA-RF exhibits both the inner and external prediction power. Table 3 gives the experimental and predicted results for both models.

Further Test for the External Prediction Power
To validate firmly the performances of the prediction, the squared correlation coefficient values between the observed and predicted values for the test set compounds with intercept (r 2 ts ) and without intercept (r 2 o ) are also calculated. Table 4 presents the values of the parameters for all models in the present work. According to references [19,34], models are considered acceptable if they satisfy all the following conditions: (1) r 2 pred > 0.5, (2) r 2 ts > 0.6, and (3) r 2 o is close to r 2 ts , such that the [(r 2 ts − r 2 o )/r 2 ts ] < 0.1 and 0.85 ≤ k ≤ 1.15. When the predicted values of the test set compounds (x axis) are plotted against the observed values of the compounds (y axis) with the intercept set to zero, the slope of the fitted line gives the value of k, with the corresponding correlation coefficient r 2 o . Herein, r 2 o is a quantity characterizing linear regression with the y-intercept set to zero, which can be illustrated by y = kx; while r 2 ts is the conventional coefficient of determination for the best fit linear regression (i.e., denoted by y = ax + b) in the test set [19,28]. It can be noticed that the developed GA-RF and pure RF models fully satisfy all the requirements, but the latter is relatively less accurate than GA-RF. (1) In case of good external prediction capacity, predicted values will be very close to the observed ones and thus the r 2 ts will be very near to the r 2 o . In the best case r 2 m may be equal to r 2 ts , whereas in the worst case the r 2 m value could be zero. Herein, the built GA-RF model achieves a better r 2 m value of 0.83 than RF (0.76), which illustrates that the current model possesses a highly predictive power.

Investigation of Parameter Turning on the GA-RF Model
As seen from Tables 2 and 4, since the proposed GA-RF model illustrates relatively better statistical results than the pure RF model, the following analysis is only restrict to this model. Generally, RF has effectively only one tuning parameter, m try , which is the number of the descriptors randomly sampled as candidates for splitting at each node during the tree induction. It ranges from 1 to p, the total number of descriptors available, in which p is equal to 40. Although it has been reported [17] that RF still performs well using the default m try value (p/3), one still expects to investigate the effect of parameter turning. Herein, 50 replications of OOB estimation (r 2 oob ) based on the FBPase inhibitors are performed, with the purpose to assess the correlation between the actual and predicted data with a range of m try values, including the default value, which is equal to 13 for the current study. Figure 3 shows the boxplot of these correlations. This plot suggests that m try is optimal when near five with a median value of 0.701, while the default m try gives a median value of 0.696. Both results are comparable. It is also observed that the worst statistical results are derived from m try = 1 and = 40. The observation is in agreement with the previous report [17]. From this Figure, one can notice that it is necessary to perform a moderate parameter tuning to get the optimal one, although at most times, RF can give the optimal model by using default parameters. Besides the m try , the number of trees also has an effect on the RF performance [17]. One principle of building RF model is to ensure that there are sufficient trees in the forest in order to get enough training of each sample. To illustrate this, the performances of OOB set, test set and training set are compared with the increase of the number of trees. Figure 4 shows that similar tendency exists for the tracks of the OOB mean squared errors, the test set and the training set ones, once there are a sufficient number of trees. In the present work, 100 trees are enough to build RF model. The information obtained from this Figure is that mean squared errors of the test and OOB do not increase after the mean squared errors of training set reach the minimum; instead, they converge to their asymptotic values which are also close to their minimum. In this sense, it can be concluded that RF does not overfit, which has been supported by the previous reports [17,23].

Y-Randomization Check
Presently, the Y-randomization check [34] is implemented for further assurance of the robustness of the optimal GA-RF model. The dependent variable is randomly shuffled and a new QSAR model is developed using the original independent variable matrix. The new QSAR models (after several repetitions) are expected to possess low r 2 ncv , r 2 cv , r 2 ts , r 2 pred , r 2 m and high RMSE for the training and test sets, respectively. If the opposite happens, then an acceptable QSAR model cannot be obtained for the specific modeling method and data. In the current work, the Y-randomization check is repeated 500 times and the resulting statistics are compared with the prediction statistics without such checks, with the average values reported in Table 5. As shown in this Table, the correlation coefficients have a significant decline while the RMSE values sharply increase, which indicates that the proposed GA-RF model possesses a real prediction power, and the result is not due to a chance correlation.

Explanation of the Selected Descriptors
The ideal QSAR model would be robust, sparse, predictive, and interpretable. In many cases such an ideal is not achievable with current descriptors and the corresponding variable mapping methods, although much effort is being expended in approaching this ideal. In most QSAR researches, a full direct explanation for all the descriptors is difficult, where most similar reported works all give few detailed analyses of the descriptors involved in their model development, thus only a few descriptors in this work are explained. Generally speaking, QSAR can be classified into two categories: Interpretative QSAR and predictive QSAR. For the current work, it belongs to the latter. In spite of this fact, we still attempt to offer some rational explanations for the major descriptors using RF built-in variable importance measure technology. Figure 5 depicts the variable importance plot of the GA-RF model. Herein, there are two parameters that give the definitions of the variable importance measures: (1) Mean Decrease Accuracy (%IncMSE) and Mean Decrease Gini (IncNodePurity). The higher values of these two parameters represent the higher variable importance. For more details about these parameters, please refer to the corresponding literature [36,37]. In Figure 5, it can be seen that the first two most important descriptors are D561 and D562 surrounded by the red frames. D561 refers to the lowest eigenvalue from Burden matrix weighted by polarizabilities order-6, while D562 stands for the lowest eigenvalue from Burden matrix weighted by polarizabilities order-7. Both descriptors illustrate that molecular polarizabilities play a central role in FBPase inhibitory activity, which can be supported by the experimental results [29][30][31][32][33]. Previous literature has illustrated that the phosphate group forms a constellation of hydrogen bond interactions that are essential for binding affinity [32]. In the present work, it can be observed that all the studied compounds possess the phosphate group and most of them have oxazole and thiazole groups as well as -NH 2 substituents, which increase the molecular polarity. In addition, previous reports have also depicted the similar conclusion that the polar groups are favored to increase the FBPase inhibitory activity [16,38]. As suggested in the literature that the FBPase binding pocket presents the hydrophilic nature [11] and the polar groups of inhibitors will bind to the site, leading to potent inhibition of the enzyme. The information obtained by the current work, to some degree, provides an insight into the structural features of both oxazole and thiazole FBPase inhibitors from a theoretical point of view, which should be helpful to design new FBPase inhibitors of this series for the treatment of type 2 diabetes. It should be pointed out that herein we just present a representative explanation of the selected descriptors. However, in terms of developing a highly predictive model, the proposed GA-RF model in this work could implement this task.

Dataset
A large, diverse dataset of 190 FBPase inhibitors were collected from the literature [29][30][31][32][33] published by Dang and co-workers after removing duplicated and undesirable compounds. Here the converted molar pIC 50 (−logIC 50 ) values, ranging from 3.60 to 8.00 M, were used as the dependent variables in the QSAR regression analysis to improve the normal distribution of the experimental data points. The whole data set was divided into training (126) and test (64 molecules) sets, respectively. All structures and the corresponding activity values of the dataset as well as their belongings to the training and test sets are listed in Table 3.

Descriptor Calculation
A rational design of novel lead drug is getting more and more popular [39]. QSAR, one of the most frequent drug design methods, can build a bridge between molecular descriptors and statistical methods to predict the new compounds. In the present work, all two-dimensional structures of the dataset were built with the ISIS/Draw 2.3 program, and converted SDF format by the Open Babel software package [40]. The final structures were transferred into Mold 2 [18], a free program available to the public, to calculate molecular descriptors. The Mold 2 software package can calculate 777 molecular descriptors solely from 2D chemical structures. Hong et al. have reported that the models generated using Mold 2 descriptors were comparable to those obtained using descriptors from the commercial software packages. In our work, all original 777 molecular descriptors were calculated.

Computational Methods
GA: Genetic algorithm is derived from Darwin's theory of natural selection and evolution. Based on the Darwinian principle of survival of the fittest, GA obtains the optimal solution after a series of iterative computations including selection or reproduction, crossover or recombination, and mutation. Due to its highly efficient optimization algorithm, GA has already been successfully applied in many QSAR analyses [41][42][43][44] to perform variable selection. In the present work, the binary coding form of each chromosome was adopted with 1 and 0 representing selected and non-selected descriptors, respectively. The double point crossover was employed in our study [45][46][47], which allowed new solution regions in the search space to be explored in order to get the global optimum. In binary code genes, the code may be changed from 0 to 1 or vice versa through mutation operation. As a last step, the old population was replaced by children, and a new generation was produced. The evolutionary process operated for many generations until the termination condition was satisfied [47]. For the detailed methodology about GA, please refer to the corresponding literature [48,49].
RF: Random forest is an ensemble of single decision trees, producing a corresponding number of outputs, and the outputs of all trees are aggregated to obtain one final prediction. The training algorithm of RF for regression can be briefly summarized as follows [17,25]: (1) draw N bootstrap samples from the original training set; (2) construct an un-pruned tree T p (p = 1, …, N) with each training set B p . At each node, rather than choosing the best split among all predictors, randomly sample m try of the predictors and then choose the best split from among those variables. The tree is grown to the maximum size and not pruned back; (3) predict the N trees by average for regression. The tree growing algorithm used in RF is CART which is efficient especially when the number of descriptors is very large, with the reason being that RF only tests the m try of the descriptors rather than the whole one, where the default m try is one-third of the number of descriptors for regression. Since the number of m try is very small, the search can finish quickly.
RF possesses its own reliable statistical characteristics based on OOB set prediction, which could be used for validation and model selection with no cross-validation performed. It was shown that the prediction accuracy of an OOB set and a five-fold cross validation procedure was nearly the same [17]. Although RF performs relatively well "off the shelf" without expending much effort on the parameter tuning or variable selection [17], it is also of importance for carrying out some tentative investigations on the changes of m try or descriptor selection to optimize the performance of RF. Herein, we just present a brief introduction of RF, for more details please see the corresponding literature [17,50]. It has been reported that RF can show excellent performance even when most predictive variables are noise, and it can be used when the number of variables is much larger than the number of observations, and it returns measures of variable importance [17,50]. However, to obtain an ideal regression model, a variable selection process is still required. To achieve the above objective, in this work, the GA variable selection method using OOB MSE as the fitness function was carried out to achieve the regression task for the current FBPase inhibitors in order to yield a high prediction model.

Statistical Validation
In the current study, the selected descriptors served as independent variables and the pIC 50 values as dependent variables in the RF regression analysis. The inner predictive values of the models were evaluated first by a cross-validation process [51,52]. The cross-validated coefficient, r 2 cv , was calculated using Equation (2) is the predictive residual sum of squares (PRESS). The optimal number of components obtained from the cross-validation was used to derive the final QSAR model. Then, a non-cross-validation analysis was carried out; and the Pearson coefficient (r 2 ncv ) and RMSE were calculated.
where n denotes the number of the studied compounds.
It has been reported [19] that although the low value of r 2 cv for the training set can exhibit a low predictive ability of a model, the opposite is not necessarily true. That is, a high r 2 cv is necessary, but not sufficient, for a model with a high predictive power. Therefore, the external validation must be estimated to establish a reliable and predictive QSAR model. The predictive coefficient r 2 pred listed in the following equation was used to check the models. In addition, various criteria suggested by Tropsha and Roy [19,20] were also performed to validate the predictive power of the current built models.
where SD is the sum of the squared deviations between the actual activity of the compounds in the test set and the mean activity in the training set, and "PRESS" is the sum of the squared deviations between predicted and observed activity for each compound in the test set.

Conclusions
In the present work, a GA-RF algorithm is successfully proposed as an efficient chemoinformatic method to predict FBPase inhibitory activity. The GA-RF model went through all rigorous examinations suggested by Tropsha and Roy with r 2 pred of 0.90 and r 2 m of 0.83, exhibiting its feasibility and reliability to derive a highly predictive model for FBPase inhibitors. In addition, results from a Y-randomization check illustrate that the GA-RF model possesses real prediction power not due to chance correlation. Explanation of the selected descriptors by GA-RF suggests that the polar factors play a central role in the FBPase inhibition. Thus, the proposed model is useful for predictive tasks to screen for new and potent oxazole and thiazole series of FBPase inhibitors in early drug development.