Improvement of the Prediction Power of the CoMFA and CoMSIA Models on Histamine H3 Antagonists by Different Variable Selection Methods

The aim of this study is to enhance the predictivity power of CoMFA and CoMSIA models by means of different variable selection algorithms. The genetic algorithm (GA), successive projection algorithm (SPA), stepwise multiple linear regression (SW-MLR), and the enhanced replacement method (ERM) were used and tested as variable selection algorithms. Then, the selected variables were used to generate a simple and predictive model by the multilinear regression algorithm. A set of 74 histamine H3 antagonists were split into 40 compounds as a training set, and 17 compounds as a test set, by the Kennard-Stone algorithm. Before splitting the data, 17 compounds were randomly selected from the pool of the whole data set as an evaluation set without any supervision, pretreatment, or visual inspection. Among applied variable selection algorithms, ERM had noticeable improvement on the statistical parameters. The r2 values of training, test, and evaluation sets for the ERM-MLR model using CoMFA fields were 0.9560, 0.8630, and 0.8460 and using the CoMSIA fields were 0.9800, 0.8521, and 0.9080, respectively. In this study, the principles of organization for economic cooperation and development (OECD) for regulatory acceptability of QSARs are considered.


Introduction
One of the most frequently used QSAR techniques is the comparative molecular field analysis (CoMFA) [1][2][3][4][5]. The CoMFA method was developed to take into account the effect of steric and electrostatic interactions, which are involved in blocking a molecule from its receptor. In CoMFA, each molecule is located within grid-spacing through a gridbox dimension, and a probe calculates the energy fields between it and other aligned molecules. In this method, we assume that the whole molecule interacts with the receptor in all directions and the energy fields are then calculated for all of the grids. As a result, thousands of interactions participate in the model. These variables consist of two types: some of them have a correlation with biological activity and the others are noisy variables, which are poorly informative and irrelevant to the biological activities [5]. However, we know from the results of X-ray crystallography of a protein-ligand complex that only some parts of the molecule interact with the receptor [6,7].
In the literature, there are some solutions to address this problem. First, series are methods that try to improve the quality of CoMFA models by discriminating between informative and meaningless variables. The genetic algorithm and GOLPE are two variable selection algorithms that have been used previously to extract meaningful variables from the large pool of calculated interactions [8,9]. It is also possible to select a cluster of variables, rather than a single variable, by a smart region definition (SRD) procedure, which is as advanced as the GOLPE algorithm [10]. The prediction-weighted partial least-squares regression algorithm (PWPLS) selects predictor variables and weight them to create a model that is more robust than the CoMFA model [11]. CoMFA region focusing (CoMFA-RF) is another similar attempt to weight the lattice points in a CoMFA region to enhance or attenuate the contribution of these points to the PLS model [12]. In contrast to the first series, there are some methods such as Compass [13], SURFCOMP [14], or CoMSA [15] AFMoC [16] that try to generate variables that are more effective and reduce non-predictive variables. One of the differences between CoMFA and these methods is that they try to sample CoMFA-like fields on the molecular surface or near such a surface. Therefore, the amount of noisy variables decreases. In addition, there are some methods which use receptor information to avoid generation of non-informative variables.
CoMSIA (comparative molecular similarity indices analysis), is developed based on similarity indices. Unlike CoMFA, CoMSIA applies a Gaussian-type distance-dependent function to calculate steric, electrostatic, hydrophobic, and hydrogen bonding donor and acceptor fields [17,18]. Like CoMFA, CoMSIA uses an atomic probe at regularly spaced grid points around the aligned molecules. Then, the probe experiences a large number of noisy and parametric interactions. On the other hand, it has been proven that variable selection and outlier detection are related. Then the molecules that are chosen as outliers by a set of descriptors may be within the model when described by a different set of descriptors, and also the regression model will be distorted toward the outliers. In addition, as the number of descriptors increases, the risk of chance correlation may increase 19,20]. An intelligence variable selection with true judgment between informative and noisy variables could generate an ideal model, which is predictive, robust, and has no molecule labeled as an outlier with it. In this study, GA, SPA, SW-MLR, and ERM were applied on the CoMFA and CoMSIA fields. Then the selected variables were modeled by the MLR algorithm to generate a simple and predictive model. The performance of the different CoMFA and CoMSIA models were evaluated by modeling a data set of histamine H 3 antagonists.
Histamine is a biogenic amine neurotransmitter, which interacts with four types of G protein-coupled receptors (GPCR)s i.e. H 1 , H 2 , H 3 , and H 4 [21]. The GPCRs contain three common parts: seven α-helices that span the cell membrane, an extracellular N-terminus part and a cytoplasmic C-terminus part with variable length. The third and fifth transmembrane (TM) regions of receptors are involved in ligand-drug interactions, while the third intracellular loop is responsible for a signaling pathway connection [22,23]. The Histamine H 3 receptor (HH3R) was initially identified on a pharmacological basis by Arrang et al in 1983 [24]. In 1999, Lovenberg et al cloned this receptor (GPCR97, Uniport ID: Q9Y5N1). GPCR97 has (31%) homology with the a 2 -adrenergic and muscarinic M 1 receptors, whereas 22% and 21.4% are homologous with the H 1 and H 2 receptors, respectively. The sequence of GPCR97 has a 445-amino acid coding region with a notable aspartic acid residue in transmembrane region 3, which is a putative binding site for the interaction of receptors with primary amines [25].
The new generations of HH3R antagonists are non-imidazole based. They contain at least one basic amine, either a piperidine or pyrolidine, which is connected by an alkyl linkage to an aromatic ring. However, antagonists with a second basic site show significantly better activity, such as the ligands in this study [26]. The interaction of the negatively charged carboxylic group of Asp114 on the third helix of the HHR3 and a protonated amine group of an antagonist, is the common point in all of the docking results of HHR3 antagonists by different homology modelling [27][28][29][30][31].
HHR3 antagonists act on both the histaminergic and non-histaminergic neurons. On the histaminergic neurons, they regulate the release of histamine and its synthesis and on the non-histaminergic neurons, they presynaptically inhibit the release of a number of other neurotransmitters such as dopamine [32], GABA [33], acetylcholine [34], noradrenaline [35], and serotonin [36]. The H 3 receptor antagonists are involved in cognition, sensory gating, food intake, sleep, the waking state, and pain perception. Thus, this could be a potential target for the treatment of numerous diseases, disorders affecting cognition (e.g., attention deficit and hyperactivity disorder [ADHD], Alzheimer's disease, and schizophrenia), sleep (e.g., hypersomnia and narcolepsy), and energy homeostasis (e.g., obesity), myocardial ischaemia, migraine, and inflammatory diseases [37][38][39][40].

Comparison and validation of the models (Goodness-of-fit, robustness, predictivity)
The CoMFA model which was built by PLS in SYBYL showed very poor statistical parameters (e.g. q 2~0 .1). In addition, its results were sensitive to the orientation and placement of the compounds in the box. Therefore, the all-orientation search (AOS) and the all-placement search (APS) strategies [41] applied on the aligned compounds to improve the q 2 value. Using the AOS algorithm, all of the possible samplings of the molecular field are tested by systematically rotating and translating the molecular aggregate within the grid, and subsequently the one with the highest q 2 value can be picked out. The AOS algorithm was run in 30, 10, 5, 1, and even 0.1º intervals (Fig. 1), in such a way that the result of each AOS run was fed to the next run. In APS, aligned molecules moved in the box in all three dimensions of space and the best placement was selected according to the highest q 2 . The best APS results did not represent significant changes in the q 2 value by 1.00, 0.50, 0.10, and 0.05 Å movements of the aligned compounds.
One of the most important aspects of a QSAR model is its predictivity. It is so important that the OECD member countries adopted it as a separate and critical principle for an ideal model [42]. Tropsha et al have emphasized that having such a high value for goodness-offit and cross-validated correlation coefficient r 2 (q 2 ) is insufficient for judging about the predictivity power of a model. Although a high q 2 value is vital, it cannot guarantee the predictivity power of a model [43,44]. Therefore, an external test set is necessary. An ideal QSAR model must also have accurate predictivity on the external set [45]. Therefore, we selected 17 of 74 compounds in a fully blind sampling for the independent or evaluation set and the remaining 57 compounds were divided into 40 compounds as the training set, and 17 compounds as the test set by the Kennard-Stone algorithm [46]. The Kennard-Stone algorithm tries to guarantee uniform selection of objects for the training and test sets. The r 2 and q 2 values of the CoMFA model on the AOS-aligned compounds were 0.9780 and 0.6040, respectively ( Table 1). The CoMFA-RF algorithm improved the q 2 value to 0.6530 by weighting CoMFA fields. Although it improved the statistical parameters of the CoMFA model to some extent, satisfactory results were still not obtained. Hence, the raw fields were extracted from SYBYL. The zero columns were removed. Then different variable selection algorithms were applied to the rest of 3331 CoMFA fields to filter out the noisy variables. Variable selectors have more of a tendency to sterically clash with CoMFA fields than electrostatic ones, because of their variance contribution. Then CoMFA standard scaling was applied to the CoMFA fields to avoid swamping the electrostatic fields with steric ones. This is a block-scaling and in the case of CoMFA and CoMSIA fields, this is the best one. The PLS algorithm performs regression on the latent variables which do not have physical meanings, but the MLR algorithm is simpler and more interpretative than the PLS algorithm. However, due to the collinearity between the CoMFA or CoMSIA fields, MLR disables to generate a successful model especially from a huge amount of variables. Then using a variable selector to extract informative variables with multiple linear regression for building a simple and easy to interpret model, will be useful. Among the variable selectors, which were applied on the extracted CoMFA fields, the results of SW-MLR were significantly better than SPA and GA, and the results of the SPA algorithm were better than GA to some extent (Table 1). In spite of improving the predictivity power of the models by these variable selectors, they could not give acceptable predictivity power according to following measures: Pred > 0.6 (r 0 2 −r 2 )/r 2 < 0.1 and 0.85 < k < 1.15 or (r 2 −r' 0 2 )/r 2 < 0.1 and 0.85 < k'< 1.15 The r o 2 and r' o 2 are the correlation coefficients of predicted versus observed activities for regressions through the origin and vice versa. The k and k´ values are their corresponding slopes, respectively [43].
The SW-MLR model does not meet all the above measures because the k´ value for this model is smaller than 0.85. However, the statistical parameters for this model, especially its r 2 value for the evaluation set, are acceptable. This model with 15 variables had an r 2 value of 0.9059, a q 2 value greater than 0.5 (0.6071), and a fair r 2 value of 0.7527 for the test set. Therefore, we considered this model as a predictivity model ( Table 1). The ERM algorithm donates such a priority to the subsequent MLR model, which distinguishes it from the other models. The goodness-of-fit value (0.9560) for this model with 16 variables is as high as this value for traditional CoMFA or CoMFA-RF models, which have the advantage of the PLS algorithm. In addition, the high q 2 values of leave-one-out (LOO) and leave-many-out (LMO) cross validation (10 groups) for this model (i.e. 0.8810 and 0.8700, respectively) emphasize that this model is very close to an ideal predictive 3D QSAR model. The considerable improvement of about 0.4 and 0.5 units, respectively, in the r 2 values of the test and evaluation sets over the traditional CoMFA model were obtained for the ERM-MLR model. In addition, ERM-MLR passes all of the predictivity measures successfully (Table 1). Figure 2 Figure 3 shows that with 16  Thirteen of the sixteen selected fields are steric and the rest of them are electrostatic, i.e. the contribution of steric fields is more than that of electrostatic ones (Figure 4a). In the MLR algorithm, coefficients of the fields and their signs appear in the equation. As a result, their results are easier to interpret than those of the PLS algorithm. However, the nature of the CoMFA fields is energy and they are calculated by the summation of steric and electrostatic interactions over the whole of the compound. Hence, calculation of energy in different grids may result in identical or similar values. In addition, information from atoms and molecular features are convoluted in fields. Therefore, in practice the interpretation and suggestion of functional groups for various positions on a given scaffold or reconstructed molecule from fields is difficult. Figure 4a illustrated the ERM-selected steric and electrostatic CoMFA fields. These points are to a great extent in agreement with CoMFA (not shown here for simplicity) or CoMFA-RF contour maps (Fig.5). By this similarity, we can say that the interpretation of these fields is very similar and/or the same with what we can say for that of CoMFA-RF.

Fig. 3.
Using 16 CoMFA fields result in simultaneously maximization on the ERM-MLR model features (the r 2 values of the training, LOO-CV, test, and evaluation sets) Figure 5 shows the contours of CoMFA-RF for the steric and electrostatic maps. Greater values of bioactivity correlate with more bulk near green; less bulk near yellow; more positive charge near blue and more negative charge near red. The contour map of the steric fields has two separate parts: a green part near the backbone and a yellow part far from it. By replacing each compound with another in the space of contours, we can see that the chain substitute, or five and six-membered monocycle substitutes, usually oriented toward the green contours and most of the fused or bridged bicycle substitutes directed toward the yellow areas. The green contour near the backbone indicates that more bulky groups are favorable. It explains why the activity of compound 22 (pIC 50 =9.25) with two methyl groups is higher than that of 23 (pIC 50 = 8.74) with a bromide branch. The same reason is acceptable for higher activity of 21 (pIC 50 =9.72) compared to 22 (pIC 50 =9.25) or 33 (pIC 50 =8.60) rather than 32 (pIC 50 =8.29), which in these pairs a smaller oxygen atom was replaced by a more bulky sulfur atom. The bi-cyclic fused substitute in 78 (pIC 50 =8.44) is located near the green contour, therefore replacing it with a bulky three-cycle substitute in 79 (pIC 50 =9.00) which has increased the activity. The COOEt group in compound 24 increases the activity (pIC 50 =8.15) but this group decreases the activity in compound 25 (pIC 50 =7.46); this shows that the attachment position of a substitute to the backbone is also important, because it results in a different direction of a substitute toward the yellow or green contours. In compound 24, the bulky COOEt substitute oriented toward the green contours, but in 25 oriented toward the yellow contours. The bulkiness of substitutes along the yellow contours causes unfavorable effects on the pIC 50 values. This is due to the fact that the activity in compound 36 is less than 35 (or 43 < 44, 64 < 63, 69 < 67 < 68). More examples can be found in the data set. It must be noted that, since Figure 5 illustrates the contour maps that were achieved from all of the compounds, then some cases can be found that have incomplete adaptation to the contour maps. The electrostatic contours of compounds also have two parts. The first part consists of blue contours that are enclosed or are near to the quinoline ring and the second part consists of red contour maps far from the backbone (Fig. 5b). The electronegative groups that oriented toward the red region increase the activity. Therefore compound 14, which has a CN group near the red contour, has higher activity (pIC 50 =9.16) than compound 15 (pIC 50 =8.13), in the same way 35 > 34 and 43 > 42. Again, the attachment-position of a substitute to a compound is important because different attachment-positions change the orientation of the attached group toward the red or blue contours. In such a way, compound 31 has a higher pIC 50 (9.08) than compound 32 (8.29), because two nitrogen atoms in compound 31 oriented toward the red region, but the oxygen atom in compound 32 oriented toward the blue contour, which by even rotating around the sp 3 band, its orientation does not differ. Compound 45 has the lowest pIC 50 value in the set, because its three electronegative fluorine atoms and the oxygen atom of the OH branch directed toward the blue contours. In general, bulky and electropositive groups near the backbone, and small and electronegative groups far from the backbone are favorable in increasing bioactivity.   (Table 2). In addition, ERM is a powerful variable selector by participating with the informative variables in the model, which are highly correlated by y, causing all of the molecules to fall in the model space. Hence, the ERM-MLR model does not label any molecule as an outlier, and decreases dispersion in the predicted values ( Fig. 2(c) in comparison with 2(d)). Figure 4 (b) is a visualization of the selected CoMSIA variables. Greater values of bioactivity are correlated with more bulk near green points and less bulk near yellow points. Magenta colored points indicate points where hydrogen-bond acceptor groups increase activity; orange points represent the orientation that inserting hydrogenbond acceptor groups decreases activity. The blue and red points show the locations where electropositive and electronegative groups are favored and unfavored, respectively. The greater contribution of the orange-magenta points (0.41% of total selected fields that selected by ERM) and the green-yellow points (35%) compared to the blue points, show that steric and hydrogen bond acceptor fields are more important in the model than electrostatic fields (18%). Since using 17 CoMSIA fields results in simultaneously increasing the r 2 values of the training set, LOO-CV, test, and evaluation sets, then these number of fields were regarded as the optimum number of variables, which must participate in model building (Fig. 7).

The Applicability Domain (AD)
The domain of applicability is a space that is generated by the descriptors of the training set and corresponding biological values. If the predicted biological activity for a compound falls within this domain, it is not extrapolated by the model and then is reliable [47]. A William plot is a useful tool for the simultaneous investigation of AD and outlier detection. It is a visualization of predictivity (standardized cross-validated residuals) versus reliability (leverages). In this plot, moving from the origin toward the x direction will increase the unreliability of the predicted values, and moving toward the y direction will decrease the predictivity of the model (Fig. 8). These figures show that the selected variables were so successful that no molecule labeled as an outlier in the ERM-MLR models were based on the CoMFA and CoMSIA fields.

Progressive scrambling analysis (PSA)
Progressive scrambling analysis is a test for investigating the robustness of a QSAR model and its sensitivity to chance correlations. In a large data set, some members may be twins together. Then in leave-one-out cross validation, a near twin of each left-out compound may remain in the training set. Hence, LOO-CV is not a good criterion for the robustness of a model. In addition, instead of shuffling the responses through the whole rang such as what the y-randomization algorithm does, PSA scrambles responses only within the blocks across the range. Then PSA is sensitive even to small perturbations in the data set [48]. In our study, PSA is run more than 30 times to decrease its dependency on the random number seed. The minimum and maximum of bins were two and 10, respectively, and the critical point was set to 0.85. The q 2 values of scrambled y for traditional CoMFA, ERM-MLR (based on CoMFA fields), and ERM-MLR (based on CoMSIA fields) models were 0.4056, 0.1683, and 0.1590, respectively, and their calculated cross-validated standard error (cSDEP) values were 0.6452, 0.7691, and 0.7748 for 30 PSA runs, respectively. The low q 2 values show that models that were constructed after variable selection algorithms do not suffer chance correlation.

A defined end point (biological activity)
The first item of the EOCD principles states that for having an ideal QSAR model, a welldefined end point based on a standardized test protocol is necessary [42]. Recently, Liu et al synthesized a series of quinoline compounds via the Friedlander quinoline condensation and assessed their binding affinities by an identical test protocol (displacement of [3H]-Na-methyl histamine, using cloned human H 3 receptors). All of the reported values are the average of three independent measurements and the standard errors of the mean were less than 0.25 in each case (Table 3) [49].

Geometry Optimization, Alignment and CoMFA/CoMSIA fields' calculations
The The best results were achieved by a sp 3 hybridized carbon atom with a +1 charge. The partial atomic charges were calculated by the Gasteiger-Hückel method and energy minimizations were performed using the Tripos force field with a distance-dependent dielectric and the Powell conjugate gradient algorithm (convergence criterion of 0.01 kcal/mol Å) in order to obtain the best conformer for each molecule. Interaction of the probe with the molecules on a 2 Å grid provided 1800 explanatory variables for each field per compound.
The uninformative values were removed by an optimized column filtering value equal to 1

Genetic algorithm
The genetic algorithm was inspired by a natural process. It tries to select the best-fitted variable with the higher fitness function through exploitation (natural selection) and exploration (evaluation) process. It benefits genetic operators (mutation and recombination) to enhance the new generation of variables with a higher fitness value and avoid trapping in local minima [50].

Stepwise multiple linear regression
Stepwise regression is based on systematically adding new variables to the model. In each step a variable, which has the largest correlation with the properties vector, adds to the model or removes from it to decrease its standard deviation. Based on improvement of the regression, a partial F test judges in favour of retaining or removing this new candidate variable [50].

Successive projections algorithm (SPA)
The main goal of SPA is the selection of variables with the lowest collinearity. It starts with a candidate variable in the search space and calculates its orthogonal sub-space. Its strategy for selecting the next candidate variable is based on selecting the variable that has the maximum projection value on the sub-space of the previous selected variable(s). The procedure is repeated for all of the variables, and for each variable a set of N desired numbers of variables are selected. The final step is construction of forward selection MLR models. The best model is the MLR model with lowest RMSEP value [51].

Enhancement replacement method
The replacement method (RM) is an evolved form of the stepwise algorithm. The first time it was formulated by Duchowicz et al for the QSPR study on normal boiling points of some organic molecules [52]. It searches the pool of D (N×D) descriptors, according to the MLR procedure systematically, to find d optimal descriptors that minimize standard deviation (S): where N is the number of molecules in the training set and res i is the difference between the experimental and the predicted properties. The RM first chooses a vector of d descriptors at random and does a linear regression [52,53]. Then among these descriptors, each time a descriptor with the greatest standard deviation in its coefficient is substituted with all of the remaining D-d descriptors, one by one (without considering the one(s) changed previously). This procedure is repeated until the standard deviation value does not decrease by more replacements. Then the final optimal sets of d descriptors that have the smallest value of S (in equation 1) are kept. In the modified replacement method (MRM), the descriptor with the largest error is substituted even if that replacement is not accompanied by a smaller value of S. The sequence of RM-MRM-RM is called ERM. It judiciously filters the noisy variables from informative ones in a semi-full search manner [54][55][56].