QSAR Study of p56lck Protein Tyrosine Kinase Inhibitory Activity of Flavonoid Derivatives Using MLR and GA-PLS

Quantitative relationships between molecular structure and p56lck protein tyrosine kinase inhibitory activity of 50 flavonoid derivatives are discovered by MLR and GA-PLS methods. Different QSAR models revealed that substituent electronic descriptors (SED) parameters have significant impact on protein tyrosine kinase inhibitory activity of the compounds. Between the two statistical methods employed, GA-PLS gave superior results. The resultant GA-PLS model had a high statistical quality (R2 = 0.74 and Q2 = 0.61) for predicting the activity of the inhibitors. The models proposed in the present work are more useful in describing QSAR of flavonoid derivatives as p56lck protein tyrosine kinase inhibitors than those provided previously.


Introduction
The quantitative structure-activity relationship (QSAR) research field provides medicinal chemists with the ability to predict drug activity by mathematical equations which construct a relationship between the chemical structure and the biological activity [1,2]. These mathematical equations are in the form of y = Xb+e that describe a set of predictor variables (X) with a predicted variable (y) by OPEN ACCESS means of a regression vector (b) [3]. After the earlier QSAR studies by Hansch, who showed a correlation between biological activity and octanol-water partition coefficient [2], it is now assumed that the sum of substituent effects on the steric, electronic and hydrophobic interaction of compounds with their receptor determines their biological activity [4][5][6]. The first step in constructing the QSAR models is finding one or more molecular descriptors that represent variation in the structural property of the molecules by a number [7]. Nowadays, a wide range of descriptors are being used in QSAR studies which can be classified into different categories according to the Karelson approach including; constitutional, geometrical, topological, quantum, chemical and so on [8]. There are different variable selection methods available including; multiple linear regression (MLR), genetic algorithm (GA), principal component or factor analysis (PCA/FA) and so on. The mathematical relationships between molecular descriptors and activity are used to find the parameters affecting the biological activity and/or estimate the property of other molecules.
It is now well established that protein tyrosine kinases (PTKs) provide a central switching mechanism in cellular signal transduction pathways by catalyzing the transfer of the γ-phosphate of either ATP or GTP to specific tyrosine residues in certain protein substrates [9,10]. This regulatory control plays a crucial role in signal transduction pathways that regulate several cellular functions under both normal and deregulated conditions [11][12][13][14]. PTKs are the intracellular effectors for many growth hormone receptors. After the discovery of activated PTKs as the product of dominant viraltransforming genes (oncogenes) providing the early hypothesis for the connection between protein tyrosine phosphorylation and cell transformation, enough evidence are now available to suggest that inappropriate or elevated expression of PTKs contribute to the transformed state of cells in many human malignancies [15][16][17][18][19]. P56 lck is a lymphoid-specific protein tyrosine kinase that is principally expressed in T lymphocytes [20]. Association of p56lck with the cytoplasmic tail of various cell surface receptors, as well as associations of p56lck with intracellular targets of phosphorylation, suggests that this tyrosine kinase plays a central role in coordinating early signal transduction events [21]. Based on this knowledge it is clear that, substances which can modulate the activity of PTKs might be potentially effective therapeutic agents. The key step in the mechanism of kinase activity of all PTKs is the recognition and binding of a nucleoside triphosphate (usually ATP) and an appropriate tyrosyl-containing substrate to the enzyme. Direct transfer of phosphate between the two molecules is the next step in the PTKs function [22]. A variety of compounds can inhibit the function of PTKs in a manner which is competitive with respect to nucleotide binding. Among such competitive inhibitors are flavonoids, a group of low molecular weight plant natural products that include one of the largest classes of naturally-occurring polyphenolic compounds [23,24]. This group of plant natural products is largely responsible for the colors of many fruits and flowers, and over 4,000 flavonoid pigments have been characterized and classified according to their chemical structure. Chemically they are C6-C3-C6 compounds in which the two C6 groups are substituted benzene rings, and the C3 group is an aliphatic chain which contains a pyran ring. Flavonoids occur as O-or C-glycosides or in the "free" state as aglycones with hydroxyl or methoxyl groups present on the aglycone. The flavonoids may be divided into seven types: flavones, flavonols, flavonones, chalcones, xanthones, isoflavones, and biflavones. Flavonoids have been gained wide interest as potential pharmacological agents since some of the best sources of flavonoids are foods: apples, blueberries, bilberries, onions, soy products and tea. Furthermore numerous medicinal plants contain therapeutic amounts of flavonoids, which are used to treat a wide variety of disorders [25].
Here, we consider the inhibitory activity of flavonoids against protein-tyrosine kinase p56 lck . Several QSAR studies were reported on this class of molecules using different descriptors and different methods of modeling. Thakur et al. described a QSAR study on p56 lck protein tyrosine kinase inhibitor flavonoids using only hydration energy and hydrophobic parameters [26]. Nikolovska-Coleska et al. treated a set of 104 derivatives with standard linear regression technique by the use of classical/quantum descriptors [27]. The same dataset was treated by Novic et al. with a counter propagation neural network by the use of classical/quantum descriptors [28]. Oblak et al. applied a wide variety of descriptors with CODESSA software on the above-mentioned dataset [29]. A quantum chemical/classical QSAR study on a set of 75 flavonoids and closely related compounds tested as p56 lck protein tyrosine kinase and AR inhibitors has been carried out by Stefanic et al. and the obtained structure-activity relationships of both enzyme systems were compared [30]. A comprehensive ab initio study of 3D structures of some flavonoids is reported by Meyer [31]. Deeb et al. calculated nodal orientation with program NODANGLE [32].
In the present paper, the QSAR study for a series of 50 flavonoid analogues with the ability to inhibit protein tyrosine kinase has been considered [32]. In a comprehensive study of the PTK system we used a very large descriptor set (more than 600 topological, geometrical, constitutional, functional group, electrostatic, quantum and chemical descriptors) and different analyses: Hansch, Free-Wilson and substituent electronic descriptors (SED), in order to be able to compare the predictive ability of descriptors from different descriptor groups. Multiple linear regression (MLR) and genetic algorithm partial least squares (GA-PLS) methods were applied as methods for modeling.

Results and Discussion
The structural features and biological activity of the studied compounds are listed in Table 1. Calculated descriptors for each molecule are summarized in Table 2.

MLR analysis
In the first step, separate stepwise selection-based MLR analyses were performed using different types of descriptors, and then, an MLR equation was obtained utilizing the pool of all calculated descriptors. The results are summarized in Table 3. Correlation coefficient (r 2 ) matrix for the descriptors used in different MLR equations is shown in Table 4. Collinear descriptors degrade the performance of MLR equations and such models have lowered prediction ability.
In Table 3 the QSAR models derived for different derivatives by using different sets of molecular descriptors are listed. Table 3 provides the resulted equations for the studied compounds. The first equation of Table 3 was found by using chemical descriptors (E 1 ). This equation explained the negative effect of hydration energy and molecular weight (Mass) of molecules on protein tyrosine kinase inhibitory activity. Equation E 2 shows that among quantum descriptors, most positive charge (MPC) has a negative effect on protein tyrosine kinase inhibitory activity and reveals the presence of columbic interactions between the ligands and receptors. The negative sign of the coefficient of MPC demonstrates that ligands with the least MPC could interact with receptor more efficiently. This indicates that there is probably a negative region in receptor which produces columbic interactions with ligand. Equation E 3 of Table 3 demonstrates the effect of constitutional descriptors. It includes the negative effects of average molecular weight (AMW), number of multiple bonds (nBM) and number of aromatic bonds (nAB) on protein tyrosine kinase inhibitory activity. Molecules with lower coefficient of AMW show better protein tyrosine kinase inhibitory activity and decreasing the number of multiple bonds of compounds results in activity enhancement. The MLR equation of Table 3 was obtained from the pool of topological descriptors (E 4 ) explained the positive effect of mean information content on the distance equality (ICR), path/walk 4-randic shape index (PW4), average connectivity index chi-4 (X4v) and the negative effect of mean information content vertex degree magnitude (IVDM) and average valence connectivity index chi-1 (X1v) on protein tyrosine kinase inhibitory activity. This equation describes the structure-activity relationship better than those obtained from the chemical, quantum, constitutional descriptors.    The equation obtained from the effect of geometrical parameter on protein tyrosine kinase inhibitory activity of the studied compounds has been described as E 5 of Table 3. It explains the positive effect of spherosity (SPH) and negative effect of sum of geometrical distances between N…O, i.e. G (N…O) on protein tyrosine kinase inhibitory activity. The effect of functional groups on protein tyrosine kinase inhibitory activity of the studied compounds has been described by equation E 6 of Table 3. This three-parametric equation does not have a high statistical quality, which suggests that the protein tyrosine kinase inhibitory activity of the studied molecules is not highly dependent on the type of functional group; but it is dependent on the structural changes induced by variations in functional groups. The negative sign of nNO 2 and nOHt indicates that molecules with lower number of nitro groups (aliphatic) and tertiary alcohols (aliphatic) bind to protein kinase stronger. On the other hand, number of hydroxyl groups (nOH) represents direct effect on the inhibitory activity of the compounds. The Hansch equation (E 7 ) shows the importance of steric, electronic and lipophilic factors on protein tyrosine kinase inhibitory activity. These factors are described by L 3 (Length parameter of C 3 substituent), ℑR ' 3 , ℑR 8 (Swain and Lupton field parameter of C-R ' 3 and C-R 8 substitutes) and π 5 (lipophilic parameter of C 5 substitute), respectively. The negative coefficient of π 5 indicates that lipophilic substituents at R 5 are not favorable for binding affinity. This equation shows the positive effect of ℑR ' 3 and the negative effect of ℑR 8 on the inhibitory activity of the compounds. In addition the negative effect of L 3 describes that the presence of bulky groups at C 3 leads to decreased activity because bulky groups hinder strong interaction between ligands and the enzyme. The SED equation (E 8 ) shows the importance of SED factors on protein tyrosine kinase inhibitory activity. One of the parameters is molecular orbital energy HOMO A 3 (Highest occupied molecular orbital parameter of C 3 substitute) and the other one is SNQ8 (Sum of negative charges parameter of C 8 substitute). It explains the positive effect of HOMO A 3 and negative effect of SNQ8 on protein tyrosine kinase inhibitory activity.
The last Equation (E 9 ) was obtained from the all types of calculated descriptors. Stepwise selection and elimination of variables produced a four-parametric QSAR equation. This equation shows that geometrical (SPH), quantum (MPC), Hansch (L 3 ) and SED (SNQ8) parameters are major factors that affect protein tyrosine kinase inhibitory activity of compounds. Among these descriptors MPC and L 3 have negative effects and the others have positive effects on the protein tyrosine kinase inhibitory activity.

Free-Wilson analysis
The simple Free-Wilson analysis (FWA) was considered to indicate which substituents on ring B and chromone moiety contribute to protein tyrosine kinase inhibitory activity and which ones detract from activity [33]. As indicated in Table 1, the molecules used in this study have a phenyl ring (ring B) and chromone moiety with different types of substituents in different positions of the ring. Some important substituents such as methoxyl, hydroxyl and amine are used in calculations. Therefore, the descriptors data matrix built for the FWA has 44 rows (i.e., number of selected molecules for FWA) and 24 columns (i.e., three substituents at eight substitution positions on the flavonoid structure). The elements of the descriptor data matrix are 1 or 0, to indicate the presence or absence of a given substituent in a specified position in a molecule, respectively. The following two-parametric equation was found between the activity data (y) and the Free-Wilson type descriptors data matrix:  [27]. A methoxyl group on C-R 5 detracts from the inhibitory activity, according to this equation.

GA-PLS analysis
In PLS analysis, the descriptors data matrix is decomposed to orthogonal matrices with an inner relationship between the dependent and independent variables. Therefore, unlike MLR analysis, the multicolinearity problem in the descriptors is omitted by PLS analysis. Because a minimal number of latent variables are used for modeling in PLS; this modeling method coincides with noisy data better than MLR. In order to find the more convenient set of descriptors in PLS modeling, genetic algorithm was used. To do so, many different GA-PLS runs were conducted using different initial set of populations. The data set (n = 50) was divided into two group: calibration set (n = 40) and prediction set (n = 10). Given 40 calibration samples; the leave-one out cross-validation procedure was used to find the optimum number of latent variables for each PLS model. The most convenient GA-PLS model that resulted in the best fitness contained 14 indices, four of them being those obtained by MLR. The PLS estimate of coefficients for these descriptors are given in Figure 1. As it observed, a combination of quantum, topological, geometrical and Hansch descriptors have been selected by GA-PLS to account the protein tyrosine kinase inhibitory activity of flavonoid derivatives. The majority of these descriptors are topological indices. The resulted GA-PLS model possessed a high statistical quality R 2 = 0.74 and Q 2 = 0.61. The predictive ability of the model was measured by applying to 10 external test set molecules. The squared correlation coefficient for prediction was 0.82 and standard error of prediction was 0.30. The values of pIC 50 using GA-PLS model (refined from cross-validation or external prediction set) along with the corresponding relative errors of prediction (REP) are shown in Table 1. Very small values of relative errors (between ± 0.40) confirm the accuracy of the proposed GA-PLS model for modeling protein tyrosine kinase inhibitory activity of the studied flavonoid derivatives. tyrosine kinase enzyme. The difference in accuracy of the two regression methods used in this study is visualized in Figure 2 by plotting the predicted activity (by cross-validation) against the experimental values. Obviously, two linear models represented scattering of data around a straight line with slope close to one. As it is observed, the plot of data resulted by GA-PLS represents the lowest scattering and the plot obtained by MLR analysis (which is obtained from E 9 ) is in the second order of accuracy.
To measure the significance of the 14 selected PLS descriptors in the protein tyrosine kinase inhibitory activity; VIP was calculated for each descriptor [34]. The VIP analysis of PLS equation is shown in Figure 3. VIP shows that HNar and TI2, which are topological, and SPH which is a geometrical parameter, are the most important indices in the QSAR equation derived by PLS analysis. In addition, quantum parameters such as (HOMO) and Hansch (ℑR' 3 ) have been found to be moderately influential parameters.

Software
The two-dimensional structures of molecules were drawn using Hyperchem 7.0 software. The final geometries were obtained with the semi-empirical AM1 method in Hyperchem program. The molecular structures were optimized using the Polak-Ribiere algorithm until the root mean square gradient was 0.01 kcal mol -1 . The resulted geometry was transferred into Dragon program package, which was developed by Milano Chemometrics and QSAR Group [35]. The z-matrix of the structures was provided by the software and transferred to the Gaussian 98 program. Complete geometry optimization was performed taking the most extended conformation as starting geometries. Semiempirical molecular orbital calculation (AM1) of the structures was preformed using Gaussian 98 program [36].

Activity data & descriptor generation
The biological data used in this study are protein tyrosine kinase inhibitory activity, -log (IC 50 ), of a set of 50 flavonoid analogues [32]. The structural features and biological activity of these compounds are listed in Table 1 and then used for subsequent QSAR analysis as dependent variables. The large number of molecular descriptors was calculated using Hyperchem, Dragon package and Gaussian 98. Some chemical parameters including molecular volume (V), molecular surface area (SA), hydrophobicity (Log P), hydration energy (HE) and molecular polarizability (MP) were calculated using Hyperchem Software. The Dragon software calculated different functional groups, topological, geometrical and constitutional descriptors for each molecule. Gaussian 98 was employed for ℑR ' 3 ℑR' 4 calculation of different quantum chemical descriptors including, dipole moment (DM), local charges, and HOMO and LOMO energies. Hardness (η), softness (S), electronegativity (χ) and electrophilicity (ω) were calculated according to the method proposed by Thanikaivelan et al. [37]. Classical substituent constants including hydrophobic constant (π), the Hammet electronic constants (σ), the Taft field effect (FI), resonance (R) substituent and steric (molar refractivity MR and STERIMOL) constants were also used as descriptor in this study [38]. The calculated descriptors for each molecule are summarized in Table 2.

Data screening & model building
The selected descriptors from each class and the experimental data were analyzed by the stepwise regression SPSS (version 12.0) software. The calculated descriptors were collected in a data matrix whose number of rows and columns were the number of molecules and descriptors, respectively. Multiple linear regression (MLR) and partial least squares (PLS) were used to derive the QSAR equations and feature selection was performed by the use of genetic algorithm (GA). The resulted models were validated by leave-one out cross-validation procedure (using MATLAB software) to check their predictability and robustness. However, this procedure did not produce good results and therefore we used genetic algorithm (GA-PLS) to select the best variables.
Application of PLS allows the construction of larger QSAR equations, while still avoiding overfitting and eliminating most variables. PLS is normally used in combination with cross-validation to obtain the optimum number of components [39,40]. The PLS regression method used in this study was the NIPALS-based algorithm existed in the chemometrics toolbox of MATLAB software (version 7.1 Math work Inc.). Leave-one-out cross-validation procedure was used to obtain the optimum number of factors based on the Haaland and Thomas F-ratio criterion [41].

Variable importance in the projection (VIP)
In order to investigate the relative importance of the variable appeared in the final model obtained by GA-PLS method, variable important in projection (VIP) was employed [34]. VIP values reflect the importance of terms in PLS model. According to Erikson et al. X-variables (predictor variables) could be classified according to their relevance in explaining y (predicted variable), so that VIP > 1.0 and VIP < 0.8 mean highly or less influential, respectively, and 0.8 < VIP< 1.0 means moderately influential [8].

Substituent electronic descriptors (SED)
Electronic descriptors obtained from quantum chemical calculations have found major popularity and there is a challenge between calculation complexity and accuracy to select the quantum chemical calculation methods (i.e., semi-empirical and ab initio) [42]. To simplify the quantum chemical calculations Hemmateenejad et al. recently have hypothesized that the calculations could be performed on the substituents instead of whole molecular structures and the resulting electronic features can be considered as electronic descriptors which have found major popularity in QSAR/QSPR studies [43,44]. Hemmateenejad et al. proposed substituent electronic descriptors (SED) as an alternative to both substituent constants and molecular descriptors [43]. SED analysis for each substituent was used in our study and the calculated descriptors are listed in Table 2. They can be classified into three different electronic categories including local charges, dipoles and orbital energies. Since most of the constituents are open shell quantum species (due to being in doublet quantum state as a radical molecule), a difference in energy between two electronic energy populations, alpha (spine up) and beta (spine down) can be seen using Gaussian 98. It provides some additional descriptors HOMO A , HOMO B , LUMO A , LUMO B , HAD, HD B , SOF A , SOF B , EN A , EN B , EPH A , and EPH B stem from two different alpha and beta electronic population energy, where the subscript A and B stand for alpha and beta population of electronic energy, respectively. Therefore, a total of 26 electronic descriptors were calculated for each substituent.

Conclusions
Quantitative relationships between molecular structure and protein tyrosine kinase inhibitory activity of flavonoid derivatives were discovered by two chemometrics methods: MLR and GA-PLS. Different QSAR models revealed that SED parameters have significant impact on protein tyrosine kinase inhibitory activity of the compounds. In this series a significant role of topological and geometrical parameters on the inhibitory activity was observed. Using the pool of all types of calculated descriptors a new QSAR model was derived for these compounds. In this model the importance of quantum, geometrical, SED and Hansch parameters have an effect on protein tyrosine kinase inhibitory activity was indicated. A comparison between the two statistical methods employed indicated that GA-PLS represented superior results. The resulted GA-PLS model possessed a high statistical quality (R 2 = 0.74 and Q 2 = 0.61) for predicting the activity of the inhibitors. The models proposed in present work are more useful in describing QSAR of flavonoid derivatives as p56 lck protein tyrosin kinase Inhibitors than those proposed previously.