QSAR Models for CXCR2 Receptor Antagonists Based on the Genetic Algorithm for Data Preprocessing Prior to Application of the PLS Linear Regression Method and Design of the New Compounds Using In Silico Virtual Screening

The CXCR2 receptors play a pivotal role in inflammatory disorders and CXCR2 receptor antagonists can in principle be used in the treatment of inflammatory and related diseases. In this study, quantitative relationships between the structures of 130 antagonists of the CXCR2 receptors and their activities were investigated by the partial least squares (PLS) method. The genetic algorithm (GA) has been proposed for improvement of the performance of the PLS modeling by choosing the most relevant descriptors. The results of the factor analysis show that eight latent variables are able to describe about 86.77% of the variance in the experimental activity of the molecules in the training set. Power prediction of the QSAR models developed with SMLR, PLS and GA-PLS methods were evaluated using cross-validation, and validation through an external prediction set. The results showed satisfactory goodness-of-fit, robustness and perfect external predictive performance. A comparison between the different developed methods indicates that GA-PLS can be chosen as supreme model due to its better prediction ability than the other two methods. The applicability domain was used to define the area of reliable predictions. Furthermore, the in silico screening technique was applied to the proposed QSAR model and the structure and potency of new compounds were predicted. The developed models were found to be useful for the estimation of pIC50 of CXCR2 receptors for which no experimental data is available.


Introduction
The chemokine CXCR2 receptor, a seven-transmembrane G-protein-coupled receptor, was cloned and identified in the early 1990s [1][2][3]. Chemokines play the key roles in inflammation, wound healing, hematopoiesis and metastasis. The chemokines comprise a large protein family that can be divided into subfamilies on the bases of structural motifs. Chemokines mediate their biological effects via interaction with a large family of 7-transmembrane G Protein-coupled receptors. These receptors are divided into four subgroups: CC, C, CX 3 C and CXC chemokine ligands (where X represents an amino acid) depending upon the position of the N-terminal cysteine residues within the protein. The chemokine receptors CXCR2/CXCR1 were cloned and identified and are activated by IL-8 (CXCL8) [4,5]. Interleukin 8 (IL-8, CXCL8) and growth related oncogene α (GRO-α) are members of the CXC chemokine subfamily and have a role in the activation and recruitment of the neutrophils to the sites of the inflammation mediated through the CXCR2 receptor. When CXCL8 interacts with the CXCR2 and CXCR1 on the neutrophils, an intercellular response occurs, including calcium flux, degranulation and subsequently chemotaxis. Elevated levels of CXCL8 have been observed in the diseases such as arthritis and chronic obstructive pulmonary disease (COPD) [6]. In the light of these findings, small molecule antagonists of the CXCR2 receptor are attractive biological targets for molecular drug discovery [7].
During the past decades, different approaches have been used for the development of QSAR models. The major differences between these approaches are in the structural parameters (descriptors) used to characterize molecules and/or in the mathematical methods used to establish a correlation between the descriptor values and the biological activities. One of the most successful approaches for the prediction of the chemical properties based on the molecular structural information is modeling of quantitative structure-activity/property relationships (QSAR/QSPR). The main goal of QSAR/QSPR is to predict complex physical, chemical and biological properties of the compounds from molecular structures [8,9]. The close relationship which exists between bulk properties of the compounds and their molecular structures allows one to provide a clear connection between the macroscopic and the microscopic properties of matter. QSAR methodologies have the potential of decreasing substantially the time and effort required for the discovery of the new medicines or improvement of the efficiency of the current one. The success of the QSAR approach can be explained by the insights offered for the structural determination of chemical properties, and the possibility of estimating the properties of the new chemical compounds without any need for them to be synthesized and tested. However, the success of any QSAR model depends on the accuracy of input data, selection of the appropriate descriptors, statistical tools, and most importantly validation of the developed model [10][11][12][13]. A major step in constructing the QSAR models is to find a set of molecular descriptors that represents variation in the structural properties of the molecules.
QSAR analysis employs statistical methods to drive quantitative mathematical relationships between chemical structure and biological activity. Thus, the use of the QSAR for the development of a theoretical model for calculation of the IC 50 (the half maximal inhibitory concentration) of a diverse set of compounds seems to be interesting.
The strategy used in the QSAR models includes the following steps; (1) selection of a data set; (2) generation of the data molecular structures; (3) optimization of the geometry of the molecular structures by appropriate method; (4) generation of various structural descriptors; (5) application of variable selection or/and data reduction methods on the calculated descriptors; (6) regression analysis; and finally (7) evaluation of the validity and predictability of the developed QSAR models.
In the past, QSAR models have been built in the general field of chemokine antagonists including CCR1 [14], CCR5 [15,16], CXCR3 [17], CXCR4 [18] and one group of CXCR2 [19,20]. In this work, linear methods such as SMLR, PLS and GA-PLS are used to find quantitative relationships between the structures of several classes of CXCR2 antagonists and their biological activities, and the results obtained by these methods are compared. Furthermore, in silico screening is adopted to the QSAR model in order to predict the structure of new potentially active compounds.

Data Set
The biological and chemical data of 130 CXCR2 antagonists, taken from literatures were selected for QSAR study [19,[21][22][23]. The data set were heterogeneous, and involved several main classes of CXCR2 antagonists including; N,N'-diphenylureas, nicotinamide N-oxides, quinoxalines, triazolethiols, acylsulfonamide carboxylic acid bioisosteres, N-linked sulfonylurea, and furyl-3,4diamino-3-cyclobut-3-ene-1,2-dione. The general structure and biological activities of the CXCR2 antagonists are provided in Tables 1-7.    6.07 In order to guarantee that training and prediction sets cover the total space occupied by the original data set, it was divided into two parts of training and predication set according to the Kennard-Stones algorithm [24]. The Kennard-Stones algorithm is known as one of the best ways of building training and prediction sets [25,26] and recently, it has been used in many QSAR studies [27,28]. Thus, the training set, which contains 108 compounds with pIC 50 s in the range of 4.96-9.00 was used for building up the QSAR model, whereas the prediction set containing 22 compounds (out of 130 compounds, i.e., about 20% of the total number of compounds) with pIC 50  Furthermore, in order to detect the homogeneities in the data set and to recognize the potential outliers in all of the molecules under study, the principal components analysis (PCA) [29] was performed with the calculated structural descriptors on the selected data set. Figure 2 shows that with the two more significant PCs which explain 68.47% of the variation in the data set (59.86% by PC1 and 8.61% by PC2), the distribution of molecules over the region is homogeneous. Thus, the score plot is a reliable representation of the spatial distribution of the points for the data set.

Computer Hardware and Software
A Dell Personal Computer equipped with the Windows ® Vista operating system was used. HyperChem Release 7 software (Hypercube, Inc. Gainesville, Florida, USA 2002) was used to draw the molecular structures. Dragon software (Todeschini and Consonni, 2003 [30]) was employed for calculation of molecular structural descriptors. The selection of significant descriptors, which constructs a relationship between the biological activity of the data and its molecular structures, is an important step in QSAR modeling. For this purpose, the stepwise multiple linear regression method and genetic algorithm procedure were used to select the significant descriptors. The modeling was carried out using PLS Toolbox 3.5 (Eigen vector Research, Inc., Manson, WA, USA) as implemented in MATLAB. Other calculations were performed using MATLAB (version 7.5, Mathworks, Inc. Natick, MA, USA 2007) environment.

Structural Descriptors
The theoretical molecular descriptors were derived from the chemical structure of the compounds. The 3D-structures of all the compounds were drawn using the HyperChem software. The resulting geometries were further refined by means of the semiempirical AM1 method and the molecular structures were optimized using the Polak-Rebiere algorithm until the root mean square gradient reached 0.1 kJ (mol Å). Then they were transferred into the Dragon program package (version 3) [30] to obtain the different molecular descriptors including constitutional, topological descriptors, RDF, 3D-Morse, and Geometrical descriptors [31]. Finally, the constant or near constant descriptors were omitted i.e., one of the any two descriptors with an inter-correlation greater than 0.95 was removed to reduce the redundant and useless information.

Model Validation
Evaluation of a model's stability and predictive ability is another key step in QSAR modeling. Different statistical parameters have been used for the evaluation of the suitability of the developed models for prediction of the activity of the studied compounds [32] this include cross validation coefficient (Q 2 or R 2 cv ), relative error percent of prediction sets (REP Pred ), the root mean square error of prediction (RMSEP), root mean square error of cross-validation (RMSECV), validation through an external prediction set and Y-randomization. However, it should be noted that a high Q 2 does not necessarily mean a high predictability of the developed model [31]. In other word, the high value of Q 2 is a necessary condition, but not sufficient for a developed model to have high predictability.
In order to assess the predictive ability and to check the statistical significance of the developed models, the proposed models were applied to predict the values of pIC 50 of an external set that was not used in the development of the model. The predictive powers of the developed regression models on the training set were evaluated by predicted values of the prediction set. These parameters are listed in Table 8 and show the good statistical qualities and low precision errors of the assessments. The REP is calculated according to the following equation: where ŷ i , y i , y and n are the predicted value, the experimental value, the mean of the experimental value in the prediction set and the number of samples, respectively. The root mean square error cross validation (RMSECV) is a frequently used measure of the differences between the predicted values by a model or an estimator and the actually observed values from the objects being modeled or estimated. The RMSECV is defined as follows: where ŷ i , y i and n are the prediction value, the measured value and the number of measurements, respectively. The RMSECV is a measure of a model's ability to predict new samples. The RMSECV is calculated via a leave one out cross-validation, where each sample is left out of the model formulation and then is predicted. The RMSEP is defined as a measure of the average difference between the predicated and experimental values at the predication stage. The RMSEP is calculated by applying Eq.
Most QSAR modeling methods implement the leave-one-out (LOO) or leave-some-out (LSO) cross-validation procedure [13]. The outcome from the cross-validation procedure is evaluated by cross-validation coefficient (Q 2 or R 2 CV ) which is used as the criteria of both robustness and the predictive ability of the model. Cross-validated coefficient of R 2 CV (LOO-Q 2 ) is calculated according to the following formula: where ŷ i and y i are the predicted value, the experimental value (over the prediction set), respectively, and tr y is the averaged value of the dependent variable for the training set. Tropsha used the following criteria for the external validation on the prediction set: Q 2 > 0.5 R 2 > 0.6 0.85 < k < 1.15 or 0.85 < k' < 1.15 In these equations, R 2 is the correlation coefficient of regression between the experimental values and the prediction activities of the compounds on the training and prediction sets. R 2 o , R' 2 o, are mathematically defined as the regression of the experimental activities against predicted activities and regression of the predicted activities against experimental activities, respectively; where as, k and k' are the slopes of these equations [33]. When these criteria are satisfied, it can be said that the model is predictive.
Furthermore, in order to assess the robustness of the model, the Y-randomization test was applied. The dependent variable vector (inhibitory activity) was randomly shuffled and a new QSAR model was developed using the original independent variable matrix. As was expected the new QSAR models (after several repetitions) have low R 2 and Q 2 values; the results are shown in Table 9.

Results and Discussion
The predictive ability of QSAR/QSPR models is affected by two factors: the descriptors, which must carry enough of the molecular structure information for the interpretation of the activity/property; and the employed modeling method. However, with too many descriptors, there is the possibility of over fitting of the statistical methods. Thus, in QSAR/QSPR studies the identification and selection of descriptors which provide maximum information in activity variations and have minimum co-linearity is important. On the other hand, the use of PLS usually results in well fitted stable models which have high predictive ability, but the estimation is not always very accurate and stable over the time.
Therefore, a genetic algorithm (GA) [34] with a PLS regression improves the model accuracy in the selection of proper descriptors.

Stepwise Multiple Linear Regression (MLR)
On the basis of Kennard-Stones algorithm, 108 compounds out of 130 were selected as the training set and the remaining 22 were selected as the test set. Stepwise regression was used on the training data set to select the significant descriptors and it was found that between 733 calculated descriptors the MATS5v (Moran autocorrelation-lag5/weighted by atomic van der Waals volumes), GATS8P (Moran autocorrelation-lag8/weighted by atomic polarizabilites), MATS2m (Moran autocorrelation-lag2/weighted by atomic masses) and BEHp2 (highest eigenvalue n. 2 of burden matrix/weighted by atomic polarizabilites) construct the best model and there was no significant correlation between these descriptors (Table 10). So, they were selected for the further study. The selected physicochemical descriptors serve as the first guideline for the design of novel and the potent antagonists of CXCR2. The selected parameters used for development of the QSAR model are listed in Table 11. The model was produced by applying the multiple linear regression (MLR) technique on a database containing the training set. The relative importance and contribution of each descriptor in the model was determined by the calculation of the value of the mean effect (MF) [35] for each descriptor using the following equation: where MFj represents the mean effect for the descriptor j, βj is the coefficient of the descriptor j, dij is the value of the interested descriptors for each molecule and m is the number of descriptors in the model. The MF value shows the relative importance of each descriptor in compare to the other descriptors. The MF of the descriptor MATS5v, GATS8p, MATS2m and BEHp2 are also shown in Table 11 and indicate that among the selected descriptors, the most important one is MATS2m (Moran autocorrelation-lag2/weighted by atomic masses) as it has the highest mean effect value and has the largest effect on the pIC 50 of the compound. The effect of MATS5v, GATS8p, MATS2m and BEHp2 for the QSAR study of CXCR2 receptors and the standardized regression coefficient on the significance of an individual descriptor in the model is shown in Figure 3 and indicates that, the greater the absolute value of a coefficient, the greater the weight of the variable in the model.    where n and F are the compound's number and the F-ratio, respectively. In the further study, the constructed model from the training set was used to evaluate the predictive ability of the produced model by predicting the pIC 50 values in the prediction set. The results are given in Table 12 and

Interpretation of the Selected Descriptors
The binding of a ligand to a target depends on the shape of the ligand and on a variety of factors such as molecular electrostatic potential, polarizability, hydrophobicity, and lipophobicity. Therefore, in a QSAR study the strategy for encoding molecular information, either explicitly or implicitly, should account for these physicochemical effects. Furthermore, since the data sets usually include molecules of different size with different numbers of atoms, the structural encoding schemes must allow comparison between such molecules. The descriptors, MATS5v, GATS8p and MATS2m are Autocorrelation of Topological Structure. The 2D-autocorrelation descriptors explain how the values of certain functions, at intervals equal to the lag, are correlated. The 2D autocorrelation descriptors represent the topological structure of the compounds, but are more complex in nature when compared to the classical topological descriptors. The computation of these descriptors involves the summations of different autocorrelation functions corresponding to different structural lags and leads to different autocorrelation vectors corresponding to the lengths of the sub-structural fragments. Basically, the pool of 2D autocorrelation descriptors defines a wide 2D space. On behalf of a greater applicability, physicochemical properties (atomic masses, atomic van der Waals volumes, atomic Sanderson electronegativities, and atomic polarizabilities) were inserted as weighting components. As a result, these descriptors address the topology of the structure or parts thereof in association with a specific physicochemical property. Bearing in mind this aspect, the interpretation of 2D autocorrelation descriptors was uneasy.
BCTU descriptors were designed to encode atomic properties relevant to intermolecular interactions. The three standard BCUT descriptor types-atomic charge, polarizability and hydrogen bonding properties-that are relevant to intermolecular interactions are supported. The BCUT (Burden-CAS-University of Texas eigenvalues) descriptors are the eigenvalues of a modified connectivity matrix known as the Burden matrix [17]. The BCUT metrics are extensions of parameters originally developed by Burden. The Burden parameters are based on a combination of the atomic number for each atom and a description of the nominal bond-type for adjacent and nonadjacent atoms.
Among the eigenvalues obtained from B matrix, the highest eigenvalues have been demonstrated to reflect the relevant aspects of molecular structure, and are therefore useful for similarity searching. By B eigenvalue decomposition, one can find the best structure for the molecules, e.g., number of atoms, number of bonds and the electronic distributions of the whole molecule. With respect to this concept, B eigenvalues may play a good role in the prediction in addition to BEHp2.

Partial Least Squares (PLS)
The general purpose of the linear regression method is to quantify the relationship between several independent or predictive variables and a dependent variable. Independent or predictive variables could be various physicochemical descriptors of the molecules, their principle components or latent variables. The partial least squares (PLS) method is used to establish relationships between the dependent variables of the Y matrix and the descriptors of the X matrix (as independent variables also called "latent" variables) [34]. The procedure performs a principle component analysis on the independent variables matrix and simultaneously maximizing the correlation with the dependent variables matrix. The number of appropriate latent variables (LVs) for describing the best developed model was found out by evaluating the root mean square error cross-validation (RMSECV) while the number of latent variables was changed.
As it is shown in Figure 5 the RMSECV is minimized when the value of LVs is 7 and it is increased significantly when the numbers of LVs are greater than 11. Thus, the optimum LVs for the training set of PLS method was chosen to be 7. The developed PLS regression model with 7 LVs shows a high correlation between the experimental and predicted values of pIC 50 in training set (R 2 = 0.74 and RMSECV = 0.6).  Finally, for the evaluation of the predictive ability of the developed model, the Q 2 value and the external validation method were performed. A high Q 2 and R 2 values (Q 2 > 0.5) were considered as a proof of high predictive ability of the model. The external validation method was performed by dividing the original data set randomly into two parts, training and prediction set, and the values of pIC 50 of molecules in the prediction set were predicted by the developed model. The results of the calculated R 2 , Q 2 , REP%, RMSECV and etc. for prediction set are reported in Table 2.
It should be noted that even when there is no correlation between the LOO-cross-validated R 2 (Q 2 ) and regression coefficient R 2 for a predictive set with known values of biological activities, the validated model can be used for predicting activities/ properties of new chemicals [33,36]. Furthermore, As the results reveal, the PLS method is an efficient approach in monitoring many complex processes and is capable of strongly reducing cross-correlated data set with high dimension to a smaller and interpretable set of principle components or latent variables.

Partial Least Squares combined with Genetic Algorithm (GA-PLS)
As mentioned before, one of the problems in choosing the set of molecular descriptors is the colinearity within them. To overcome this problem some workers tried to combine the genetic algorithms (GA) with PLS [37][38][39]. GA-PLS consists of three basic steps. (1) Creation of an initial population of chromosomes in which each chromosome is a binary bit string by which the existence of a variable is represented; (2) Evaluation of fitness of each chromosome in the population by the internal predictivity of PLS. Thus, the squared predictive correlation coefficient (Q 2 ) by the leave-one-out procedure in cross-validation is used as the internal predictivity [40]; (3) Reproduction of the population of chromosomes in the next generation. The operations of selection, cross-over and mutation of chromosomes, are made in this step. Then, steps 2 and 3 are continued until the number of the repetitions has reached the designated number of generations. The effective factors in the GA such as repetition rate, rate of mutation, number of chromosomes and generation are optimized.
Rogers and Hopfinger first applied GA-PLS method in QSAR analysis and stated that it is very effective and superior to PLS method. In this paper, to find the more convenient set of descriptors, a GA-PLS analysis was performed [41][42][43].
All descriptors were preprocessed by auto scaling before performing the GA-PLS was performed. The GA was optimized by variation and selection of the fitness values. The fitness function is defined as:  Table 13. The QSAR model was derived by the doing the GA analysis with partial least squares (PLS)-regression method for the population size of 64 and mutation rate of 0.003. Other parameters are summarized in Table 14. Results of R 2 , REP%, RMSEP and Q 2 for prediction set of GA-PLS study are also reported in Table 2 and as it is shown the results of this analysis are similar to those obtained by PLS method but the Q 2 and R 2 value of the GA-PLS were improved in compare to the MLR and PLS methods. However, the interpretations of the chemical properties of these descriptors are difficult as their definition is based on mathematics. The details are described in the handbook and literature of Dragon software [30]. Further more, although these results show that the GA method is a satisfactory correspondence for variable selection, but more experiments are needed to generalize the superiority of GA-PLS over other techniques.

In Silico Screening
The in silico screening procedure is a useful tool for predicting and identifying new biologically active compounds with improved characteristics prior to their actual synthesis [44,45]. Thus, the in silico procedure can be applied as a physico-chemical filter to reduce the number of compounds to be tested experimentally for hit/lead generation. In other words, the in silico procedure minimizes the time and cost associated with identifying new leads. A virtual screening was performed by insertion, deletion and substitution of different substitutes on the original molecules [46,47] and the effects of the structural modifications on the biological activity were investigated. Then, the domain of application of QSAR model was defined to use the model for screening new compounds. The applicability domain (AD) of QSAR model was used to verify the prediction reliability, to identify the problematic compounds and to predict the compounds with acceptable activity that falls within this domain. Several methods have been used for determination of the AD of QSAR models [48], but the most common one is described by Gramatica [49] which used the leverage values for each compound. The leverage approach allows the determination of the position of new chemical in the QSAR model; i.e., whether a new chemical will lie within the structural model domain or outside of it. Furthermore, the leverage approach along with the Williams plot is used to determine the applicability domain in all QSAR models.
To construct the William plot, the leverage h i for each chemical compound, in which QSAR model was used to predict its activity, was calculated according to the following equation: where x i is the descriptor vector of the considered compound and X is the descriptor matrix derived from the training set descriptor values and the warning leverage (h*) was determined as [48]: where n is the number of training compounds, p is the number of predictor variables. The defined applicability domain (AD) was then visualized via a Williams plot, the plot of the standardized residuals versus the leverage values (h). A compound with h i > h * seriously influences the regression performance and may be excluded from the applicability domain, but it doesn't appear to be an outlier because its standardized residual may be small. Moreover, a value of 3 for standardized residuals is commonly used as a cut-off value for accepting predictions, because points that lie within ±3 standardized residual from the mean cover 99% of the normally distributed data [50]. Thus, the leverage and the standardized residual were combined for the characterization of the applicability domain.
The Williams plot for the QSAR is illustrated in Figure 6. The warning leverage (h*), was found to be 0.25 for the developed QSAR model. The chemicals that had a standardized residual more than three times of the standard deviation units were considered to be outliers while chemicals with a leverage value higher than h* were considered to be influential or high leverage chemicals. Based on the leverages (h > 0.25), the one compound were found to be outside of the defined AD (Figure 6) of the QSAR model, so, it was identified as structurally influential chemical based on its large leverage value (h > h*).  Tables 1-7 (IC 50 = 8.22) was selected as a template due to its good inhibition. The molecule was modified in such a way that its synthesis was experimentally possible. Then, the in silico screen was applied by substituting different groups in the X and Y positions of the ring; the results of this investigation are given in Table 15. The model tolerated various N,N'-diphenylurea substituents since all of the studied derivatives were within the applicability domain. Among different molecules designed, the compound 10c showed the best activity (pIC 50 = 8.50). Thus, in order to clarify the relation between the activities of the compounds with different functional group, this compound was selected for further structural modification. So, in the next step the oxygen of the amide group of compound 10c was substituted by different function groups, the results are demonstrated in Table 16. As it is shown, the model tolerate all the compound designed on the bases of molecule 10c, and the best predicted activity was found for the compound 9d (where X = S). Thus, it is demonstrating that using a simple QSAR model, it is possible to simultaneously identify compounds with improved activity and to determine the structural modifications that don't fall within the applicability domain. Finally, this result confirms the reliability of the models and it shows that with the construction of the QSAR model and use of in silico screening it is possible to identify new synthetic targets for drug discovery.

Conclusions
In this study, three different modeling methods, SMLR, PLS and GA-PLS were used in the construction of a QSAR model for CXCR2 antagonists and the resulting models were compared. It was shown that performing GA prior to the calibration, yields a regression model with improved predictive power. The accuracy and predictability of the proposed models were illustrated by various criteria, including cross-validation, relative error percent of prediction sets (REP Pred ), the root mean square error of prediction (RMSEP), root mean square error of cross-validation (RMSECV), validation through and Y-randomization. It was also shown that the proposed method is a useful aid for reduction of the time and cost of synthesis and activity determination of CXCR2 receptor antagonists. Furthermore, the results confirm that among the construction models used, the GA-PLS is superior for prediction of the IC 50 of CXCR2 antagonists. Our future work will focus on validation for putative CXCR2 antagonists for virtual screening.