# A Novel Chemometric Method for the Prediction of Human Oral Bioavailability

^{1}

^{2}

^{3}

^{4}

^{*}

^{†}

## Abstract

**:**

^{2}) of 0.80 and standard error of estimate (SEE) of 0.31 for test sets. For the MLR and PLS, they are relatively weak, showing prediction abilities of 0.60 and 0.64 for the training set with SEE of 0.40 and 0.31, respectively. Our study indicates that the MLR, PLS and SVR-based in silico models have good potential in facilitating the prediction of oral bioavailability and can be applied in future drug design.

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Dataset Construction

#### 2.2. Molecular Descriptors

#### 2.3. Database Division

^{2}) were employed as a judgment for the boundary setting of molecules. The feature similarity of neighboring molecules was estimated to probe the maximum spatial gap:

_{i}is sample number of subset i, and x

_{j}, x

_{k}are feature descriptor vectors for compound j, k of subset i.

^{2}values were obtained for each subset.

#### 2.4. Design of Training and Test Sets

#### 2.5. MLR

#### 2.6. Partial Least Squares Analysis (PLS)

^{2}(called Q

^{2}) of the training set. The model is generally considered internally predictive if Q

^{2}> 0.5 [34], as generally the Q

^{2}are much better indicators than standard error and conventional R

^{2}of how reliable predictions actually are.

#### 2.7. Support Vector Regression (SVR)

_{i}, y

_{i})}

_{i}

^{n}where x

_{i}denotes the input vector; y

_{i}denotes the output (target) value and n denotes the total number of data patterns. The modeling aim is to identify a regression function y= f(x) that accurately predicts the outputs y

_{i}corresponding to a new set of input-output examples{(x

_{i}, y

_{i})}. Using mathematical notation, the nonlinear regression function in the original feature space is approximated using the following function:

_{i}and α

_{i}

^{*}are Lagrange multipliers, have been obtained by minimizing the regularized risk function. The kernel function K (x, x

_{i}) has been defined as a linear dot product of the nonlinear mapping, i.e.,

## 3. Results and Discussion

#### 3.1. Dataset Division

^{2}for training set and testing set are 0.39 and 0.47, respectively. The PLS model with best performance has 8 latent variable, presenting R

^{2}= 0.58 for the training set, and Q

_{ex}

^{2}= 0.37 for the test set. As for the SVM model, the regression results in the poor R

^{2}and Q

_{ex}

^{2}of 0.39 and 0.35, respectively. From these results, it can be concluded that the present models generated relatively poor models for the prediction of OB values, in agreement with these reported models [8,9,11,12]. Accordingly, novel chemometric methods are needed to be introduced to improve the prediction ability for the OB of drugs.

#### 3.2. Design of Training and Test Sets

#### 3.3. Model Building

#### 3.3.1. The Results of MLR

^{2}, F-test, and SEE (standard error of estimation) values, but also have ability to predict the property of the test compounds (Q

_{ex}

^{2}, SEP, standard error of prediction) not included in the training set. The resulted correlation between experimental and predicted logB for all the compounds was shown in Figure 2.

_{tr}and N

_{te}are the number of compounds included in the training and test set, respectively. Predicted values from Equation (6) fell close to the experimental logB with reasonable R

^{2}(0.6), which indicates good statistical characteristics of the model. The model’s prediction capability is further validated by the external test with rational correlation coefficients (0.612).

^{2}value (0.521). The model’s prediction ability is also validated by the external test with correlation coefficients Q

_{ex}

^{2}= 0.542 and SEP = 0.480. As the most important descriptor for the OB value in Set 2, G2e is a second component symmetry directional WHIM descriptor that involves the atomic Sanderson electronegativities as a weighting scheme [39]. It is based on the statistical indices calculated as the information content index on the symmetry along each component. The negative coefficient of the descriptor indicates that bioavailability of a candidate drug increase with decreasing molecular symmetry. For another descriptor R1e, it belongs to the same class of GETAWAY descriptors as the R3v+ and R2p+ in Set 1, which emphasizes the important roles of the GETAWAY descriptor in determining the OB values of molecules. In addition, several 3D-MoRSE descriptors are also selected to build the model, including Mor25e, Mor04e, Mor27v, Mor30u and Mor06e. They are molecule atom projections along different angles, which represent different views of the whole molecule structure. As for the functional group counts’ descriptors nCONN, nRCOOR, they represent the E-state of hydrogen bond acceptors, which are closely associated with the human bioavailability as mentioned in Equation 6.

^{2}and higher residues (Table 1), and therefore detailed analysis for the results of Set 3 and Set 4 are not presented here.

#### 3.3.2. The Results of PLS

^{2}of ~0.691 with SEE of ~0.411 for the training data. All the data show that the models are externally good predictive, which indicates that PLS enables to generate relatively good models for the bioavailability of the compounds. However, compared with the MLR models, the PLS models do not display absolute advantages for the SEE, SEP, and O

_{ex}

^{2}for the training and test sets.

#### 3.3.3. The Results of SVR

^{−4}in Set 2, C = 131072, γ = 3.05 × 10

^{−5}in Set 3, and C = 32768, γ = 1.53 × 10

^{−5}in Set 4. Subsequently, we selected optimal variables for SVR by varying numbers of components from 1 to 1536.

_{ex}

^{2}of ~0.611 for the test sets despite the good determination coefficients (R

^{2}= ~0.752) for the training sets, which reveals that the number of selected features probably have an potential effect on the prediction ability of models. Thus, the stepwise method was used to select the proper number of variables for each subset, and finally 21, 12, 18 and 12 input variables were obtained for Set 1, Set 2, Set 3 and Set 4, respectively. The resulting correlations between the experimental and predicted LogB for all the compounds were shown in Figure 3, indicating that the SVM models are superior to MLR and PLS for the OB prediction. Such results imply that the nonlinear relationship between the bioavailability and molecular structures is more notable than the linear relationship.

#### 3.4. Comparison of the MLR, PLS and SVR Models

_{1}and n

_{2}are the number of samples in the test set, SEP

_{1}

^{2}is the square from the higher and SEP

_{2}

^{2}is the square from lower root mean square errors of the two compared models. When comparing the performances of SVR with MLR and PLS, F-values are 1.66, 0.56 and 0.92 for SVR/MLR in Set 1, Set 3 and Set 4, and 0.5, 0.48 and 0.88 for SVR/PLS in each subset, respectively, which are lower than the critical ones (1.74 for Set 1, 1.65 for Set 3, and 1.66 for Set 4). As for Set 2, its F-values are 2.57 for SVR/MLR, and 2.68 for SVR/PLS, which are much higher than the critical one (1.90). This indicates no statistically significant difference at a level of significance of 0.05 in Set 1, Set 3 and Set 4, except for Set 2. As for the two linear models, the calculated F-values are also lower than the critical ones for the four subsets. The results show that at a level of significance of 0.05 the differences in the performances of the mean SEP in both linear models differ only randomly. In summary, the statistical tests reveal that the performances of the MLR, PLS and SVR models are comparable with each other except for the SVR model in Set 3. This implies that the linear and non-linear methods are all appropriate for predicting the human bioavailability of candidate drugs.

## 4. Conclusions

^{2}and prediction error residues. Thus, they could be helpful as complementary tools applied in “screening prior to synthesis” procedures for prediction of OB values.

## Acknowledgement

## References

- O’Brien, S.E.; de Groot, M.J. Greater than the sum of its parts: Combining models for useful ADMET prediction. J. Med. Chem
**2005**, 48, 1287–1291. [Google Scholar] - Beresford, A.P.; Selick, H.E.; Tarbit, M.H. The emerging importance of predictive ADME simulation in drug discovery. Drug Discov. Today
**2002**, 7, 109–116. [Google Scholar] - Egan, W.J.; Merz, K.M.; Baldwin, J.J. Prediction of drug absorption using multivariate statistics. J. Med. Chem
**2000**, 43, 3867–3877. [Google Scholar] - Chen, M.L.; Shah, V.; Patnaik, R.; Adams, W.; Hussain, A.; Conner, D.; Mehta, M.; Malinowski, H.; Lazor, J.; Huang, S.M.; et al. Bioavailability and bioequivalence: An FDA regulatory overview. Pharm. Res
**2001**, 18, 1645–1650. [Google Scholar] - Hou, T.; Li, Y.; Zhang, W.; Wang, J. Recent developments of in silico predictions of intestinal absorption and oral bioavailability. Comb. Chem. High Throughput Scr
**2009**, 12, 497–506. [Google Scholar] - Hou, T.; Wang, J. Structure-ADME relationship: Still a long way to go? Expert Opin. Drug Metab. Toxicol
**2008**, 4, 759–770. [Google Scholar] - Lipinski, C.A.; Lombardo, F.; Dominy, B.W.; Feeney, P.J. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliver. Rev
**1997**, 23, 3–25. [Google Scholar] - Aller, S.G.; Yu, J.; Ward, A.; Weng, Y.; Chittaboina, S.; Zhuo, R.; Harrell, P.M.; Trinh, Y.T.; Zhang, Q.; Urbatsch, I.L.; et al. Structure of P-Glycoprotein reveals a molecular basis for poly-Specific Drug Binding. Science
**2009**, 323, 1718–1722. [Google Scholar] - Yoshida, F.; Topliss, J.G. QSAR model for drug human oral bioavailability. J. Med. Chem
**2000**, 43, 2575–2585. [Google Scholar] - Hou, T.J.; Wang, J.M.; Zhang, W.; Xu, X.J. ADME evaluation in drug discovery. 6. Can oral bioavailability in humans be effectively predicted by simple molecular property-based rules? J. Chem. Inf. Model
**2007**, 47, 460–463. [Google Scholar] - Wang, Z.; Yan, A.X.; Yuan, Q.P.; Gasteiger, J. Explorations into modeling human oral bioavailability. Eur. J. Med. Chem
**2008**, 43, 2442–2452. [Google Scholar] - Ma, C.Y.; Yang, S.Y.; Zhang, H.; Xiang, M.L.; Huang, Q.; Wei, Y.Q. Prediction models of human plasma protein binding rate and oral bioavailability derived by using GA-CG-SVM method. J. Pharma. Biomed
**2008**, 47, 677–682. [Google Scholar] - Tian, S.; Li, Y.Y.; Wang, J.M.; Zhang, J.; Hou, T.J. ADME evaluation in drug discovery. 9. prediction of oral bioavailability in humans based on molecular properties and structural fingerprints. Mol. Phar
**2011**, 8, 841–851. [Google Scholar] - Hou, T.; Li, Y.; Zhang, W.; Wang, J. Recent developments of in silico predictions of intestinal absorption and oral bioavailability. Comb. Chem. High Throughput Scr
**2009**, 12, 497–506. [Google Scholar] - Chan, L.M.; Lowes, S.; Hirst, B.H. The ABCs of drug transport in intestine and liver: Efflux proteins limiting drug absorption and bioavailability. Eur. J. Pharm. Sci
**2004**, 21, 25–51. [Google Scholar] - Doherty, M.M.; Charman, W.N. The mucosa of the small intestine: How clinically relevant as an organ of drug metabolism? Clin. Pharmacokinet
**2002**, 41, 235–253. [Google Scholar] - Benet, L.Z.; Wu, C.Y.; Hebert, M.F.; Wacher, V.J. Intestinal drug metabolism and antitransport processes: A potential paradigm shift in oral drug delivery. J. Control Rel
**1996**, 39, 139–143. [Google Scholar] - Borchardt, R.T.; Smith, P.; Wilson, G. Models for Assessing Drug Absorption and Metabolism; Plenum Press: New York, NY, USA, 1996. [Google Scholar]
- Hou, T.J.; Xu, X.J. ADME evaluation in drug discovery. 1. Applications of genetic algorithms on the prediction of blood-brain partitioning of a large set drugs from structurally derived descriptors. J. Mol. Model
**2002**, 8, 337–349. [Google Scholar] - Chemical Book Database. Available online: http://www.chemicalbook.com/ accessed on 5 June 2012.
- Wang, X.; Yang, W.; Xu, X.; Zhang, H.; Li, Y.; Wang, Y.H. Studies of benzothiadiazine derivatives as hepatitis C virus NS5B polymerase inhibitors using 3D-QSAR, molecular docking and molecular dynamics. Curr. Med. Chem
**2010**, 17, 2788–2803. [Google Scholar] - Hancock, T.; Put, R.; Coomans, D.; vander Heyden, Y.; Everingham, Y. A performance comparison of modern statistical techniques for molecular descriptor selection and retention prediction in chromatographic QSRR studies. Chemometr. Intell. Lab
**2005**, 76, 185–196. [Google Scholar] - Saíz-Urra, L.; González, M.P.; Teijeira, M. QSAR studies about cytotoxicity of benzophenazines with dual inhibition toward both topoisomerases I and II: 3D-MoRSE descriptors and statistical considerations about variable selection. Bioorg. Med. Chem
**2006**, 14, 7347–7358. [Google Scholar] - Khajeh, A.; Modarress, H. Quantitative structure–property relationship for surface tension of some common alcohols. J. Chemometr
**2011**, 25, 333–339. [Google Scholar] - Talete, S. Dragon for windows (software for molecular descriptor calculations), version 5.4. Available online: http://www.talete.mi.it accessed on 20 May 2011.
- Jain, A.N. Surflex: Fully automatic flexible molecular docking using a molecular similarity-based search engine. J. Med. Chem
**2003**, 46, 499–511. [Google Scholar] - RCSB Protein Data Bank. Available online: http://www.rcsb.org accessed on 5 June 2012.
- Xu, X.; Fu, J.X.; Wang, H.; Zhang, B.D.; Wang, X.; Wang, Y.H. Influence of P-glycoprotein on embryotoxicity of the antifouling biocides to sea urchin (Strongylocentrotus intermedius). Ecotoxicology
**2011**, 20, 419–428. [Google Scholar] - Xue, Y.; Guoyin, C.; Guan, Y.N.; Cracknell, A.P.; Jiakui, T. Iterative self-consistent approach for Earth surface temperature determination. Int. J. Remote Sens
**2005**, 26, 185–192. [Google Scholar] - Vesanto, J.; Alhoniemi, E. Clustering of the self-organizing map. IEEE Trans. Neural Networks
**2000**, 11, 586–600. [Google Scholar] - Wang, Y.; Li, Y.; Ding, J.; Wang, Y.; Chang, Y. Prediction of binding affinity for estrogen receptor α modulators using statistical learning approaches. Mol. Divers
**2008**, 12, 93–102. [Google Scholar] - Höskuldsson, A. PLS regression methods. J. Chemometr
**1988**, 2, 211–228. [Google Scholar] - Chin, W.W.; Marcolin, B.L.; Newsted, P.R. A partial least squares latent variable modeling approach for measuring interaction effects: Results from a monte carlo simulation study and voice mail emotion/adoption study. Inf. Syst. Res
**2003**, 14, 189–217. [Google Scholar] - Wold, S.; Ruhe, A.; Wold, H.; Dunn, W.J.J. The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses. Sci. Stat. Comput
**1984**, 5, 735–743. [Google Scholar] - Vapnik, V.; Golowich, S.; Smola, A. Support Vector Method for Function Approximation, Regression Estimation, and Signal Processing. In Advances in Neural Information Processing Systems 9, Proceedings of the 1996 Neural Information Processing Systems Conference NIPS 1996; The MIT Press: Cambridge, MA, USA, 1997; pp. 281–287. [Google Scholar]
- Benet, L.Z.; Cummins, C.L.; Wu, C.Y. Unmasking the dynamic interplay between efflux transporters and metabolic enzymes. Int. J. Pharm
**2004**, 277, 3–9. [Google Scholar] - Szakács, G.; Váradi, A.; Özvegy-Laczka, C.; Sarkadi, B. The role of ABC transporters in drug absorption, distribution, metabolism, excretion and toxicity (ADME–Tox). Drug Discov. Today
**2008**, 13, 379–393. [Google Scholar] - Guha, R.; Serra, J.R.; Jurs, P.C. Generation of QSAR sets with a self-organizing map. J. Mol. Graph. Model.
**2004**, 1–14. [Google Scholar] - Todeschini, R.; Gramatica, P.; Provenzani, R.; Marengo, E. Weighted holistic invariant molecular descriptors. Part 2. Theory development and applications on modeling physicochemical properties of polyaromatic hydrocarbons. Chemom. Intell. Lab. Syst
**1995**, 27, 221–229. [Google Scholar] - Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
- Schölkopf, B.; Burges, C.J.C.; Smola, A.J. Advances in Kernel Methods: Support Vector Learning; The MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
- Bhandare, P.; Mendelson, Y.; Peura, R.A.; Janatsch, G.; Kruse-Jarres, J.D.; Marbach, R.; Heise, H.M. Multivariate determination of glucose in whole blood using partial least-squares and artificial neural networks based on mid-infrared spectroscopy. Appl. Spectrosc.
**1993**, 47, 1214–1221. [Google Scholar] - Goodarzi, M.; Freitas, M.P.; Richard, J. Feature selection and linear/nonlinear regression methods for the accurate prediction of glycogen synthase kinase-3B inhibitory activities. QSAR Comb. Sci
**2008**, 27, 1092–1098. [Google Scholar]

**Figure 1.**Clustering of 8 × 8 Self-organizing map (SOM) of 224 compounds in Set 3. The numbers correspond to the series numbers of the compounds. Those numbers with frames are compounds of the test set, and the others are the compounds of the training set.

**Figure 2.**Experimental and predicted LogB values for Set 1, Set 2, Set 3 and Set 4 using the multiple linear regression (MLR), partial least squares (PLS) and support-vector machine regression (SVR) models, respectively. For MLR, the training and test sets are represented by the black empty squares and black solid squares, respectively. For PLS, they are represented by the red empty circles and red solid circles, respectively, while for SVR, they are shown by the blue empty triangles and blue solid triangles, respectively.

**Figure 3.**The prediction accuracies of 5-fold cross-validation for the 805 compounds derived from partial least squares analysis with latent variables varying from 3 to 20 in Set 1, Set 2, Set 3 and Set 4, respectively.

**Figure 4.**Contour plots of the optimization error for SVR when optimizing the parameters γ and C for the prediction of bioavailability for the training (

**a**) and test (

**b**) sets in Set 1 and Set 2.

**Table 1.**Statistical results of MLR, PLS and SVR for oral bioavailability (OB) prediction of compounds.

Set 1 | Set 2 | Set 3 | Set 4 | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Training size | Test size | Training size | Test size | Training size | Test size | Training size | Test size | |||||||||

156 | 36 | 122 | 27 | 180 | 44 | 197 | 43 | |||||||||

R^{2} | SEE | Q_{ex}^{2} | SEP | R^{2} | SEE | Q_{ex}^{2} | SEP | R^{2} | SEE | Q_{ex}^{2} | SEP | R^{2} | SEE | Q_{ex}^{2} | SEP | |

MLR | 0.621 | 0.411 | 0.612 | 0.311 | 0.521 | 0.400 | 0.541 | 0.482 | 0.610 | 0.492 | 0.612 | 0.48 | 0.61 | 0.482 | 0.622 | 0.480 |

PLS | 0.631 | 0.390 | 0.651 | 0.311 | 0.643 | 0.331 | 0.511 | 0.470 | 0.561 | 0.500 | 0.561 | 0.521 | 0.831 | 0.312 | 0.600 | 0.490 |

SVM | 0.800 | 0.311 | 0.720 | 0.220 | 0.750 | 0.280 | 0.630 | 0.772 | 0.780 | 0.361 | 0.800 | 0.361 | 0.690 | 0.421 | 0.682 | 0.461 |

SVM_{T} | 0.840 | - | 0.731 | - | 0.731 | - | 0.310 | - | 0.970 | - | 0.590 | - | 0.990 | - | 0.561 | - |

^{2}, the regression coefficient of the training set; Q

_{ex}

^{2}, the regression coefficient of the test set; SEE, standard error of estimate; SEP, standard error of prediction; SVM

_{T}represents the models using the total 1536 molecular descriptors as the input variables of SVR; -, not available.

© 2012 by the authors; licensee Molecular Diversity Preservation International, Basel, Switzerland. This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

## Share and Cite

**MDPI and ACS Style**

Xu, X.; Zhang, W.; Huang, C.; Li, Y.; Yu, H.; Wang, Y.; Duan, J.; Ling, Y.
A Novel Chemometric Method for the Prediction of Human Oral Bioavailability. *Int. J. Mol. Sci.* **2012**, *13*, 6964-6982.
https://doi.org/10.3390/ijms13066964

**AMA Style**

Xu X, Zhang W, Huang C, Li Y, Yu H, Wang Y, Duan J, Ling Y.
A Novel Chemometric Method for the Prediction of Human Oral Bioavailability. *International Journal of Molecular Sciences*. 2012; 13(6):6964-6982.
https://doi.org/10.3390/ijms13066964

**Chicago/Turabian Style**

Xu, Xue, Wuxia Zhang, Chao Huang, Yan Li, Hua Yu, Yonghua Wang, Jinyou Duan, and Yang Ling.
2012. "A Novel Chemometric Method for the Prediction of Human Oral Bioavailability" *International Journal of Molecular Sciences* 13, no. 6: 6964-6982.
https://doi.org/10.3390/ijms13066964