Reproduction is permitted for noncommercial purposes.
Nicotine and a variety of other drugs and toxins are metabolized by cytochrome P450 (CYP) 2A6. The aim of the present study was to build a quantitative structureactivity relationship (QSAR) model to predict the activities of nicotine analogues on CYP2A6. Kernel partial least squares (KPLS) regression was employed with the electrotopological descriptors to build the computational models. Both the internal and external predictabilities of the models were evaluated with test sets to ensure their validity and reliability. As a comparison to KPLS, a standard PLS algorithm was also applied on the same training and test sets. Our results show that the KPLS produced reasonable results that outperformed the PLS model on the datasets. The obtained KPLS model will be helpful for the design of novel nicotinelike selective CYP2A6 inhibitors.
Cytochrome P450 2A6 (CYP2A6), the major coumarin 7hydroxylase present in human liver (Cashman, etc., 1992; Pearce, etc., 1992; Shimada, etc., 1996), is known to metabolize a variety of compounds including quinoline (Reigh, etc., 1996), nicotine (Nakajima, etc., 1996), cotinine (Nakajima, etc., 1996), and various Nnitroso compounds present in cigarette smoke (Guengerich, etc., 1994). Hepatic CYP2A6 catalyses the major route of nicotine metabolism via the intermediacy of the aldehyde oxidasecatalyzed iminium ion that is converted to the metabolite, cotinine. (Cashman, etc., 1992; Tricker, 2003; Hukkanen, etc., 2005). The efficiency of CYP2A6mediated metabolism of nicotine is closely related to the specific concentration of nicotine in blood for keeping addiction liability. Potent and specific inhibitors of the CYP2A6 enzyme might improve nicotine bioavailability and thus make oral nicotine administration feasible in smoking cessation therapy. The inhibition of CYP2A6 may decrease the number of cigarettes a person needs to smoke to obtain their desired blood nicotine concentration. Nowadays, a number of compounds tested as CYP2A6 inhibitors possess strong inhibitory effects (Draper, etc., 1997; Maenpaa, etc., 1993; Fujita, etc., 2003). However, to our knowledge, no compounds have been characterized as both potent and selective CYP2A6 inhibitors. In the present study QSAR models were established based on a series of nicotine derivatives, with the ultimate aim of aiding the prediction and development of a potent and specific CYP2A6 inhibitor. The in silico methods were built employing electrotopological state descriptors by using kernel partial least squares (KPLS), a relatively novel method in chemometrics compared to the partial least squares (PLS) method.
The partial least squares method (Wold, 1975; Wold, etc., 1984) has been a popular modeling, regression, discrimination and classification technique in its domain of origin chemometrics. In its general form PLS creates orthogonal score vectors by using the existing correlations between different sets of variables while also keeping most of the variance of all sets. It is a statistical tool specifically designed to deal with multiple regression problems, where the number of observations is limited, the missing data are numerous and the correlations between the predictor variables are high.
PLS has proven to be useful in situations where the number of observed variables is much greater than the number of observations and high multicollinearity among the variables exists. This situation is quite common in the case of kernelbased learning where the original data are mapped to a highdimensional feature space corresponding to a reproducing kernel Hilbert space. Too high dimensions also cause problems like overfitting, thus leading to the decrease of the prediction accuracy of the external data. As an alternative to PLS, a nonlinear PLS has been newly developed based on kernel methods, i.e., kernel partial least squares. In the next section, a detailed description of KPLS was offered.
The outline of the paper is as follows. The kernel partial least squares analysis was introduced based on an optimizationderived method. QSAR models were built for nicotine analogues employing KPLS for a library of 58 nicotine analogues as CYP2A6 selective inhibitors (Denton, etc., 2005). Finally, PLS and KPLS were compared to determine which exhibits superior performance.
As a generic kernel regression method, kernel partial least squares has been proven to be more competitive, and even more stable than other kernel regression algorithms such as support vector machines (SVM) and kernel ridge regression, and this method is also much more easily implemented (John and Nello, 2004).
The idea of the kernel PLS is developed based on the mapping of the original Ξspace data into a highdimensional feature space. A kernel is a continuous function κ: Ξ × Ξ → P for which there exists an Φ inner product space as a representation space and a map φ : Ξ → Φ such that for all x, yε Ξ
This definition allows us to perform calculations in the Φ space in an implicit way, by substituting the scalar product operation with its corresponding kernel version.
In the following part, a derivation of Direct Kernel Partial Least Squares (DKPLS) based on the optimization algorithm (Bennett and Embrechts, 2003) for nonlinear regression is introduced. The DKPLS is developed on the basis of a direct factorization of the kernel matrix. DKPLS has the advantage that the kernel does not need to be square, which factorizes the kernel matrix directly and then the final regression function is computed based on this factorization. We provide here the simplified algorithm for one response variable, which is more popular in QSAR modeling.
Lets consider the data sample (
from k = 1 to
The final regression coefficients
where the mth columns of
The final predictions are
It should be noted that the test data should be centralized before, according to the following formula:
where 1 is the vector of element 1,
Meanwhile, in order to compare the performances of KPLS and PLS methods on the data set, the Partial Least Squares regression using the SIMPLS algorithm is also proposed (Jong, 1993). The same training and test sets are applied for both KPLS and PLS models.
In the present study, we used a data set of 55 nicotine analogues whose selective inhibition on CYP2A6 was reported in the literature (Denton, etc., 2005). All these compounds were shown in
The molecular descriptors in
In the past decade, electrotopological state (Estate) indices have been used for correlating a variety of physicochemical and biological properties of chemical compounds. The Estate indices are computed for each atom in a molecule and encode information about both the topological environment of that atom and the electronic interactions due to all other atoms in the molecule (Kier and Hall, 1990). Estate indices have been found to be very useful in building QSAR models (Wang, etc., 2004; Wang, etc., 2005a; Wang, etc., 2005b). In this work, the Estate descriptors with detailed definitions are indicated in
KPLS performs as well as or better than support vector regression for moderatelysized problems with the advantages of simple implementation, less training cost, and easier tuning of parameters. The most critical and demanding phase of any KPLS model is the definition of kernels and the determination of parameters.
From the functions available, three types of kernels are popularly used in both SVM and KPLS, i.e., linear, polynomial (a quadratic kernel function is normally applied) and radial basis function (Gaussian kernel), or to obtain complex kernels by combining simpler ones. The Gaussian kernel is possibly the simplest and effective kernel functions used in many cases. Therefore, in this work in the case of the kernel transformations we used a Gaussian RBF kernel function, which has the form:
Before generating the kernel, all the data have been firstly Mahalanobis scaled to have mean 0 and standard deviation 1. The value of
As can be seen from
A second aspect for application of KPLS regression analysis is the optimal choice of the number of latent components (
The structure of the optimum KPLS achieving the highest R coefficient was determined. Meanwhile, a leaveone out crossvalidated Q^{2} (0.41) was also obtained for the model.
From this figure, we can find that the most potent compounds like S29, S30 and S37 in the training set, or like S10 and S44 in the test set are correctly modeled. However, we also find that the prediction errors of the model for compounds S50 and S51 are big. One major reason is that the two compounds are the ones with the weakest inhibitory effects on CYP2A6. Thus, the chemical space of the model might not be big enough to cover these two compounds, although in the training sets several compounds with the same biggest
Based on the obtained model, we have attempted the prediction of lots of new virtual compounds for their binding abilities. Two compounds (P1, P2) with their structures shown in
A kernel version of PLS has some important advantages, such as the ability to find nonlinear, global solutions and to work with high dimensional input vectors. Different from the PLS involving two orders of correlation for the latent components, KPLS has three or more orders of correlation for the nonlinear components. As a relatively new method KPLS has not gained the popularity as PLS in the field of chemometrics and other relevant fields. For a comparison of performance of both PLS and KPLS, PLS approach was also applied to build QSAR models using the same training and test tests in the present work. The number of latent components was assigned 4 based on the optimum R and MSE obtained for both training and test sets (data not shown). Finally, the structure of the optimum PLS achieving the highest R coefficient was determined. Upon inspecting the results the first thing one notices is that the nonlinear KPLS outperforms its linear conversion.
The main goal of this paper was to build a QSAR model for nicotine derivatives as selective CYP2A6 inhibitors. Another goal was also to compare the performances of kernel partial least squares and partial least squares analysis methods when being applied to QSAR modeling. Due to the nonlinearity of the data, KPLS outperforms PLS in the present work. The above successful application of KPLS method on nicotine derivatives will be helpful for quantitative design of nicotine analogues as selective CYP2A6 inhibitors.
This work also proposes a derivation of KPLS based on optimization algorithms, which makes the KPLS approach more easily applied for chemometrics field, and also more accessible to machine learning researchers. All these will promote the kernel partial least squares algorithm, a relatively novel method, to gain popularities in chemometrics applications and other fields.
This paper was supported by the Youth Teacher Fund of Dalian University of Technology.
Modeling results for the training and test sets with different
The coefficients and residues for the training and test sets when
The kernel partial least squares analysis of p
The partial least squares analysis of p
Log
name  p 
sumdelI  sumI  Qv  nHBd  nHBa  nwHBa  SHBd  SHBa  SwHBa  Hmax  Gmax  Hmin  nrings 

S1  0.68  6.783  31.5  0.969  0  3  9  0  17.783  11.68  1.447  12.521  0.62  2 
S2  −1.83  3.894  29  1.319  0  3  9  0  10.923  14.346  1.424  5.009  0.614  2 
S3  −0.18  2.921  26.25  1.396  0  2  9  0  5.892  14.826  1.364  4.141  0.605  2 
S4  0.10  5.504  30.833  0.933  0  3  10  0  15.938  14.895  1.431  10.446  1.237  2 
S5 
−0.15  6.29  32.5  1.05  0  3  10  0  16.626  14.292  1.379  11.085  0.686  2 
S6  0.66  1.829  22.167  1.168  0  2  9  0  5.752  16.415  1.328  4.045  1.186  2 
S7  0.01  6.481  29.833  0.849  0  3  9  0  17.61  12.223  1.434  12.42  1.212  2 
S8  −0.99  3.656  27.333  1.188  0  3  9  0  10.799  14.917  1.411  4.978  0.723  2 
S9  −0.65  2.701  24.583  1.251  0  2  9  0  5.795  15.4  1.351  4.098  1.198  2 
S10 
0.60  2.086  23.833  1.331  0  2  9  0  5.82  15.89  1.336  4.086  0.593  2 
S11  −0.42  6.826  31.5  0.969  0  3  9  0  17.778  11.685  1.442  12.52  0.62  2 
S12  −0.82  3.937  29  1.319  0  3  9  0  10.897  14.373  1.419  5.008  0.614  2 
S13  −1.83  7.093  32.5  0.84  0  3  10  0  19.5  13  1.479  10.342  1.255  2 
S14  −0.89  2.833  25.333  1.074  0  2  10  0  8.23  17.103  1.364  4.209  1.187  2 
S15 
−1.65  3.2  27  1.217  0  2  10  0  8.371  16.578  1.372  4.309  0.63  2 
S16 
−0.71  3.23  27  1.217  0  2  10  0  8.321  16.622  1.37  4.27  0.585  2 
S17  −0.43  3.261  27  1.217  0  2  10  0  8.352  16.62  1.368  4.309  0.576  2 
S18  −0.99  3.301  27  1.217  0  2  10  0  8.454  16.56  1.37  4.403  0.621  2 
S19  −0.26  3.659  26.333  0.994  0  3  9  0  12.177  14.157  1.395  4.098  1.243  2 
S20  −1.44  3.458  26.333  0.994  0  3  9  0  11.85  14.483  1.384  4.006  1.235  2 
S21  −0.80  4.537  29.111  1.047  0  3  10  0  13.688  15.423  1.407  5.672  1.234  2 
S22 
−0.04  2.732  25.333  1.074  0  2  10  0  8.067  17.267  1.353  4.033  1.221  2 
S23  −1.65  3.82  32.667  1.093  0  2  14  0  8.543  24.124  1.429  4.43  1.206  3 
S24  0.605  3.047  23.833  1.01  1  3  8  1.693  10.967  12.866  1.693  4.091  1.225  2 
S25  −0.795  3.55  25.5  1.163  1  3  8  1.715  11.33  12.24  1.715  4.285  0.631  2 
S26  0.62  5.154  32.833  0.955  1  4  10  2.629  16.947  15.886  2.629  8.363  1.244  2 
S27  0.15  6.739  34.5  0.865  1  4  10  2.647  20.509  13.991  2.647  8.297  1.262  2 
S28  −0.14  6.739  34.5  0.865  1  4  10  2.647  20.509  13.991  2.647  8.297  1.262  2 
S29  1.40  5.323  29  1.055  1  3  9  1.49  14.891  13.679  1.49  5.454  0.792  2 
S30  1.70  3.697  27.333  1.188  1  3  9  1.463  11.329  15.389  1.463  5.536  0.744  2 
S31  0.55  4.736  29.5  1.182  1  3  9  1.54  12.688  14.167  1.54  5.608  0.556  2 
S32  0.75  2.931  27.833  1.328  1  3  9  1.513  9.059  15.877  1.513  4.107  0.539  2 
S33  −1.35  2.594  29.333  1.473  0  3  9  0  8.14  16.015  1.359  4.128  0.561  2 
S34 
−1.67  4.515  31  1.319  0  3  9  0  11.835  14.305  1.386  5.699  0.579  2 
S35  −0.75  5.053  29.333  1.031  1  3  9  2.463  14.529  14.688  2.463  8.893  0.869  2 
S36  −1.55  6.68  31  0.923  1  3  9  2.49  18.091  12.978  2.49  8.788  0.917  2 
S37  1.05  3.317  23.167  1.086  1  2  7  1.45  9.077  13.692  1.45  5.182  0.74  1 
S38 
0.05  2.731  23.667  1.247  1  2  7  1.5  6.884  14.192  1.5  3.941  0.527  1 
S39  −1.36  2.643  25.167  1.415  0  2  7  0  5.997  14.387  1.335  3.966  0.549  1 
S40  −0.15  3.839  29.5  1.182  1  2  11  1.45  9.605  19.304  1.45  5.526  0.72  2 
S41  −1.04  2.419  25.5  1.479  0  2  9  0  5.959  15.339  1.349  4.188  0.589  2 
S42 
0.77  2.215  23.833  1.331  0  2  9  0  5.857  15.918  1.336  4.145  0.583  2 
S43 
0.23  3.516  25.833  1.133  1  3  9  1.569  11.229  14.605  1.569  5.614  1.22  2 
S44 
0.89  2.833  25.333  1.178  0  3  8  0  10.149  13.233  1.367  4.209  0.641  2 
S45  0.28  3.343  26.833  1.232  0  3  8  0  10.371  13.419  1.371  4.28  0.498  2 
S46  0.64  4.042  36.5  1.127  0  3  14  0  10.592  25.067  1.445  4.406  0.928  3 
S47  0.77  4.042  36.5  1.127  0  3  14  0  10.592  25.067  1.445  4.406  0.928  3 
S48  0.60  2.853  25.333  1.178  0  3  8  0  10.154  13.214  1.376  4.128  0.664  2 
S49  0.21  2.329  23.667  1.025  0  3  8  0  9.836  13.831  1.387  3.992  1.227  2 
S50 
−1.81  3.285  25.833  0.86  1  5  6  1.936  17.196  8.637  1.936  3.925  1.256  2 
S51 
−1.83  2.978  25.667  0.871  0  5  6  0  16.181  9.485  1.498  3.924  1.261  2 
S52  −0.08  1.787  22.167  1.168  0  2  9  0  5.783  16.383  1.333  4.047  1.159  2 
S53  −0.51  2.773  23.167  1.069  0  3  8  0  9.797  13.369  1.364  4.165  1.226  2 
S54  1.00  2.044  23.833  1.331  0  2  9  0  5.851  15.858  1.341  4.088  0.593  2 
S55 
−0.64  4.94  37.667  1.058  0  3  15  0  12.617  25.05  1.458  4.375  1.262  3 
Compounds used in test sets.
All compounds used in this work
S1 

S2 

S3 

S4 

S5 

S6 

S7 

S8 

S9 

S10 

S11 

S12 

S13 

S14 

S15 

S16 

S17 

S18 

S19 

S20 

S21 

S22 

S23 

S24 

S25 

S26 

S27 

S28 

S29 

S30 

S31 

S32 

S33 

S34 

S35 

S36 

S37 

S38 

S39 

S40 

S41 

S42 

S43 

S44 

S45 

S46 

S47 

S48 

S49 

S50 

S51 

S52 

S53 

S54 

S55 

P1 

P2 

The definition of the molecular descriptors used in this work
Descriptor  Definition 

sumdelI  Sum of deltaI values (Intrinsic State and EState values). 
sumI  Sum of intrinsic state values (I). 
Qv  Qv is based on the EState sumI values. It is the ratio of sumI’s for two extremes of the structure, i.e., molecule’s position along a line from Q calculated for the isostructural alkane on one end and the most polar isoskeletal version of the structure. 
nHBd, nHBa 
Hydrogen bond donor and acceptor counts (nwHBd and nwHBa are the weak hydrogen bonds). 
SHBa  Acceptor descriptor for molecule (sum of Estate values for all hydrogen bond acceptors in the molecule). The following groups are classified as acceptors: OH, =NH, NH2, NH, >N, O, =O, Salong with F and Cl. 
SHBd  Donor descriptor for molecule (sum of hydrogen EState values for all hydrogen bond donors in the molecule). The following groups are classified as donors: OH, =NH, NH2, NH, SH, and #CH. 
SwHBa  Descriptor for weak hydrogen bond acceptor (sum of EState values for all weak hydrogen bond acceptors). Aromatic and otherwise unsaturated carbons are considered to be weak acceptors. 
Hmax, Gmax, Hmin  Extreme atom level EState values in molecule:
Hmax—Largest hydrogen EState value Gmax—Largest EState value Hmin—Smallest hydrogen EState value 
nrings  Number of rings. 
The statistical results for KPLS and PLS optimum models
KPLS  PLS  

 
R  MSE  R  MSE  
Training  0.95  0.07  0.62  0.47 
Test  0.70  0.63  0.09  1.29 