A Novel Chemometric Method for the Prediction of Human Oral Bioavailability

Orally administered drugs must overcome several barriers before reaching their target site. Such barriers depend largely upon specific membrane transport systems and intracellular drug-metabolizing enzymes. For the first time, the P-glycoprotein (P-gp) and cytochrome P450s, the main line of defense by limiting the oral bioavailability (OB) of drugs, were brought into construction of QSAR modeling for human OB based on 805 structurally diverse drug and drug-like molecules. The linear (multiple linear regression: MLR, and partial least squares regression: PLS) and nonlinear (support-vector machine regression: SVR) methods are used to construct the models with their predictivity verified with five-fold cross-validation and independent external tests. The performance of SVR is slightly better than that of MLR and PLS, as indicated by its determination coefficient (R2) of 0.80 and standard error of estimate (SEE) of 0.31 for test sets. For the MLR and PLS, they are relatively weak, showing prediction abilities of 0.60 and 0.64 for the training set with SEE of 0.40 and 0.31, respectively. Our study indicates that the MLR, PLS and SVR-based in silico models have good potential in facilitating the prediction of oral bioavailability and can be applied in future drug design.


Introduction
A large number of compounds emerging from combinatorial chemistry and high throughput medicinal chemistry programs have increased the demand for new compounds that need to be screened in a wide range of biological assays [1]. It has been reported that 95% of lead compounds fail in the developmental stages, and 50% of these failures are induced by unfavorable absorption, distribution, metabolism, and excretion (ADME) properties [2,3]. Since the predominant and most convenient way to deliver drugs to the systemic circulation for patients is the oral route, the good oral bioavailability (OB) of a new drug candidate is undoubtedly one of the most important pharmacokinetic parameters along with ADME properties.
The OB is defined as "the rate and extent to which the active ingredient or active moiety is absorbed from a drug product and becomes available at the site of action" by FDA [4]. A low and highly variable bioavailability is the main reason for stopping further development of the drug candidates. In fact, in recent years, multiple large-scale experiments drugs candidates have been conducted to assess the OB values of molecules, but they are labor-intensive and time-consuming. Therefore, developing a reliable and efficient in silico method that can predict human OB is compelling [5,6], both in the early stage of drug discovery to select the most promising compounds for further optimization and in the later stage to identify final candidates for further clinical development.
Lipinski's "Rule of Five", which could be qualitatively used to predict the absorption and permeability of drug molecules, has so far been the primary guide to predicting OB [7]. Since then, numerous classification and regression models for the predictions of OB were proposed by applying various statistical and machine-leaning computational approaches [8,9]. However, most of these models cannot demonstrate satisfactory predictions for the bioavailability. In 2000, Andrews and co-workers constructed a regression model to predict OB based on a dataset of 591 molecules by employing 85 structural descriptors [8]. Compared to Lipinski's "Rule of Five", the false negative rate was reduced from 5% to 3%, and the false positive rate decreased from 78% to 53%. But it should be noted that this model is not very good considering the high rate of false positives and the 85 descriptors used, which would cause an overfitting problem. In the same year, Yoshida et al. established a classification model for predicting OB with 15 structural descriptors, in which three descriptors were closely related to distribution coefficients [9]. However, this model can only perform a correct accuracy of 60% for the test compounds, leading to the difficulty to construct highly reliable models.
It is a significant milestone that Hou and his coworkers have built a publicly available and reliable source for OB in 2007 [10]. Then, many new models based on the large and reliable database were proposed [11,12]. In 2008, Wang et al. showed that no good prediction model of general validity could be obtained with the full dataset of 772 compounds, and instead, have built four better models by manually selecting the compounds characteristic with similar structures or pharmacological activities [11]. However, the number of each subset is too small to cover all the 772 compounds. Subsequently, Ma and co-workers proposed a prediction model of OB using GA-CG-SVM, which gained receivable overall classification accuracy (~80%), but unreasonable prediction accuracy for the low-bioavailability class (~25%) [12]. In 2011, Tian et al. constructed multiple linear regression (MLR) models for OB based on molecular properties and structural fingerprints by employing the genetic function approximation (GFA) technique, but their prediction abilities were relatively low [13]. In short, no reliable prediction model for OB has been developed based on the simple descriptors so far.
Indeed, the human bioavailability involves a complex biological and physiological process that is influenced by various factors, including gastrointestinal transition and absorption, intestinal membrane permeation, and intestinal/hepatic first-pass metabolism [14]. P-glycoprotein (P-gp), as the most studied ATP-dependent efflux protein, enables to drive compounds from inside the cell back into the intestinal lumen, and thereby prevents their absorption into blood [15]. Such protein is present on the intestinal brush border, in close proximity to the main cytochrome P450 (CYP) isoenzymes responsible for the majority of intestinal drug metabolism, CYP3A4 and CYP2D6 [16]. The CYP system is the most important family of enzymes that carry out oxidation of drugs and xenobiotics, which contributes significantly to the first-pass metabolism of many drugs such as cyclosporine, midazolam, and verapami. Indeed, substrates for P-gp and CYP3A have great overlap covering diverse therapeutic indications and a broad range of molecular structures, which indicates that the two proteins act synergistically in preventing drugs from crossing the intestinal barrier [17]. Therefore, it is reasonable to believe that in combination with P-gp, the CYP enzymes are critical factors in determining the bioavailability of drugs.
It is frequently stated that the ideal model system for OB of drugs should be physiologically and sufficiently reflective of the specific biological barrier of interest in humans [18]. This emphasizes that the knowledge of the metabolism and efflux at the intestinal mucosal level is of particular importance. Thus in this work, to compensate for the lack of key information on the physiological and biochemical processes for the prediction of oral drug bioavailability, and to improve the reliability and efficiency of OB models, we have developed a novel chemometric method by integrating the properties of P-gp-mediated efflux and metabolism by P450. On the basis of a large-scale OB (%F, the amount of drug that reaches the systemic circulation after absorption and first pass clearance) database, the relationships between %F with the well-used molecular properties have been elucidated and the reliable prediction models for OB were constructed, which can be used as rapid screening filters for candidate drugs.

Dataset Construction
805 structurally diverse drug and drug-like molecules and their OB values (%F) in human were obtained from the bioavailability database [19]. The dataset encompassed a broad range of chemical substances. For the compounds with different bioavailability values, their average values were employed to reduce bias in the experiments. In this work, to guarantee the linear distribution of the biological data, all the OB values were transformed into the common logarithm of log (oral bioavailability) (logB). The structures of the components were downloaded from Chemical Book Database [20], or generated by ISIS Draw 2.5 (MDL Information Systems, Inc.), and then were optimized by Sybyl 6.9. The Sybyl parameters were set similarly to those in our previous work [21]. All the chemicals were saved as mol2 format for further analysis.

Molecular Descriptors
Construction of the models for OB firstly depends on the generation of molecular descriptors, which can be calculated directly from the structure of any particular molecule by simply using various molecular modeling tools. Dragon descriptors have been successfully used for quantitatively representing the structural and physicochemical features of a molecule [22][23][24]. In the present work, a total of 1536 molecular descriptors were calculated using dragon professional 5.4-2006 [25], including constitutional descriptors, topological descriptors, walk and path counts, connectivity indices, information indices, 2D autocorrelations, edge adjacency indices, Burden eigenvalues, topological charge indices, eigenvalue-based indices, Randic molecular profiles, geometrical descriptors, RDF descriptors, 3D-MoRSE descriptors, WHIM descriptors, GETAWAY descriptors, functional group counts, atom-centered fragments, charge descriptors and molecular properties. Constitutional descriptors are related to the number of atoms and bonds in the molecule. Topological descriptors are a special class of descriptors that do not rely on a three-dimensional model, including valence and no valence molecular connectivity indices calculated from the hydrogen suppressed formula of the molecule, encoding information about the size, composition, and the degree of branching of a molecule. Autocorrelation descriptors constitute a set of molecular descriptors derived from a conceptual dissection of the molecular topology and taking into account chemical information contained in the atomic weightings and structural information by specified weights of the molecule atoms. Radial Distribution Function (RDF) descriptors do not depend on the molecular size and take into account the 3D arrangement of the atoms without ambiguities, thus being applicable to a large number of molecules with great structural variance and being a characteristic common to all of them. Formally, the RDF of an ensemble of atoms can be interpreted as the probability distribution of finding an atom in spherical volume of radius R. The geometrical descriptors describe the size of the molecule and require 3D-coordinates of the atoms in the given molecule.

Database Division
All the 805 compounds were divided into several statistical subsets by using the Self-consistent method. First, the geometry-based algorithm [26] was applied to identify the binding modes of molecules with the metabolizing enzymes CYP3A4 and CYP2D6 and the efflux protein P-gp. The crystal structures of human CYP3A4 (PDB code: 3NXU) and CYP2D6 (PDB code: 2F9Q) were retrieved from RCSB Protein Data Bank [27]. Due to unavailability of the X-ray structure of human Pgp, a homology modeling for the protein was performed [28]. The template protein employed here was mouse P-gp (PDB code: 3G60 chain A [9]) which exhibited a high resolution (4.40 Å). This method employed three-dimensional transformations driven by local feature matching, and spatial pattern detection techniques such as the geometric hashing and pose clustering, to yield good molecular shape complementarity with high efficiency. After the fast transformational search, the best geometric fit obtained the highest scores (~5000), while the low scores (~500) exhibited poor matches. For the complexes in our work, the Clustering RMSD was 2.5 Å. The five lowest-binding energy matches for each complex were selected and analyzed visually.
Secondly, the iterative self-consistent approach was used for setting boundary for the subsets [29]. This method applied iterative steps to identify the maximum spatial distribution of molecules on the basis of their features, and estimate the statistical parameters of the spatial processes. In other words, the subsets are divided based on the correlation between the values of neighboring molecules. In the present work, the 805 molecules were first ranked by their binding affinities in descending order. Based on the ranking of binding affinities, one subset of the compounds with random number was selected at each of the end position, and then the correlations between the bioavailability of the molecules and their corresponding binding affinities were evaluated, respectively (step 1). The obtained determination coefficients (R 2 ) were employed as a judgment for the boundary setting of molecules. The feature similarity of neighboring molecules was estimated to probe the maximum spatial gap: where n is the number of subset, r i is sample number of subset i, and x j , x k are feature descriptor vectors for compound j, k of subset i. Finally, steps 1 and 2 were repeated until the reliable R 2 values were obtained for each subset.

Design of Training and Test Sets
The compounds in each subset were split into training and independent validation sets based on their distribution in the chemical space as defined by Self-organizing map (SOM). SOM is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional representation of the input space of the training samples [30]. In the SOM, the procedure for placing the vector from data space onto the map is to find the neuron with the closest weight vector to the vector taken from data space and to assign the map coordinates of this neuron to the vector. A formal rule for the selection of the winner (out) is based on the Euclidean distance between a vector (x) and a weight (w).
Then the weight of the winner is corrected to decrease this distance. In such a process, the SOM works as a clustering diagram grouping similar inputs from the input space into similar neurons of the output space.
The optimal 10 × 10, 7 × 6, 8 × 8, 8 × 8 node architectures were chosen to map objects into 100, 42, 64, 64 positions for Set 1, Set 2, Set 3 and Set 4, respectively. Similar compounds were clustered into the same position (x, y coordinate in a SOM). Only one part of a representative object from each position in the SOM map was chosen for the training set, respecting the original proportion and the predefined 4:1 ratio between the training and test objects. For each subset, the obtained training sets including 156, 122, 180 and 197 compounds were applied for the development of the modeling system, and the rest groups of 36, 27, 44 and 43 compounds as the independent evaluation set were used for the assessment of the system, respectively. The simulations were carried out using an internally developed C-language program.

MLR
As one of the most widely used methods for forecasting, MLR attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to the observed data [31]. By interpreting the descriptors in the regression models it is possible to gain some insight into those factors that are likely to govern the OB of the compounds, which is very useful in the design of new drugs or to screen drug-like compounds starting only from the molecular graph. Generally, the more the parameters are used in regression equation the lower the value of residual sum of squares and the higher the value of explained sum of squares will be. However, models containing more correlating parameters may suffer from the defect of collinearity and containing inferior variables and omitting important ones, which will make the parameters' estimation based on traditional methods not satisfactory. Therefore, considering the large number of molecular descriptors we used, the stepwise process is applied to choose the best combination of descriptors automatically and construct the multiple regression models with the highest statistical significance.
In this work, those variables with zero values (> 80%) were eliminated, with remaining molecular descriptors further selected by the stepwise method in MLR process. The Stepwise variable entry and removal examines the variables in the block at each step (criteria: probability of F to enter ≤ 0.05, probability of F to remove ≥ 0.10).

Partial Least Squares Analysis (PLS)
PLS, known as Projection to Latent Structures, is a powerful statistical method that can easily cope with a large number of correlated descriptors by projecting them into several orthogonal latent variables [32]. Being a component-based structural equation modeling technique, PLS simultaneously models the structural paths (i.e., theoretical relationships among latent variables) and measurement paths (i.e., relationships between a latent variable and its indicators). Rather than suppose equal weights for all indicators of a scale, the PLS algorithm allows each indicator to vary in how much it contributes to the composite score of the latent variable. Therefore, indicators with weaker relationships to related indicators and the latent construct are given lower weightings [33]. Each iteration of the algorithm introduces another latent variable. The number of latent variables was chosen to maximize cross-validated R 2 (called Q 2 ) of the training set. The model is generally considered internally predictive if Q 2 > 0.5 [34], as generally the Q 2 are much better indicators than standard error and conventional R 2 of how reliable predictions actually are.

Support Vector Regression (SVR)
Recently, SVR has emerged as an alternative and powerful technique to solve regression problems by introducing an alternative loss function [35]. Such method attempts to minimize the generalization error bound by structural risk minimization (SRM) principle so as to achieve generalized performance instead of minimizing the observed training error. In this work, this method has been applied to estimate the nonlinear relationships between the OB values of molecules and their relative molecular features.
Hence, only a brief description of the method is given here. Suppose we are given training data {( , )} n i i i x y where x i denotes the input vector; y i denotes the output (target) value and n denotes the total number of data patterns. The modeling aim is to identify a regression function ( ) y f x = that accurately predicts the outputs y i corresponding to a new set of input-output examples{( , )} i i x y . Using mathematical notation, the nonlinear regression function in the original feature space is approximated using the following function: where w and b are regression parameters. And ( ) x ϕ denotes the high-dimensional feature space, which is nonlinearly mapped from the input space. Additionally, by introducing Lagrange multipliers and exploiting the optimality constraints, the SVR function is finally formulated as following: studies have demonstrated that the RBF outperforms the other three kinds of kernel functions. Hence, this work adopted the RBF to perform inference process. The regularization parameter C and the kernel parameter γ were selected based on the overall accuracy of the internal five-fold cross-validation using the grid search method.

Dataset Division
At first, three methods including MLR, PLS and SVM were tried in order to build reasonable predictive models for OB of 805 molecules as a whole dataset. For the MLR model, its coefficients R 2 for training set and testing set are 0.39 and 0.47, respectively. The PLS model with best performance has 8 latent variable, presenting R 2 = 0.58 for the training set, and Q ex 2 = 0.37 for the test set. As for the SVM model, the regression results in the poor R 2 and Q ex 2 of 0.39 and 0.35, respectively. From these results, it can be concluded that the present models generated relatively poor models for the prediction of OB values, in agreement with these reported models [8,9,11,12]. Accordingly, novel chemometric methods are needed to be introduced to improve the prediction ability for the OB of drugs. The multidrug resistance (MDR) ATP binding cassette (ABC) proteins, especially the P-gp, are large, membrane-bound proteins, which form a functional network, capable to extrude a very wide range of foreign (xenobiotic) substrates [16]. As well as the efflux pump, the cellular route of absorption exposes drugs to intracellular metabolic systems; small intestinal enterocytes provide the first site for CYPP450-mediated metabolism of orally ingested drugs and xenobiotics [15]. For CYP3A4 and CYP2D6, the major Phase I drug-metabolizing enzymes, they are found to play complementary roles with P-gp in intestinal drug metabolism, where, through repeated extrusion and re-absorption, P-gp ensures elongated exposure of the drugs to the metabolizing enzyme [36]. Therefore, for crossing tissue barriers, in addition to some basic physical characteristics, including molecular size, charge distribution, and hydrophobicity, drug interactions with the membrane transporter P-gp and the metabolizing enzymes CYP3A4 and CYP2D6 are also key determinants [37]. Due to the critical roles of the specific membrane transport system and the intracellular metabolizing enzymes in oral drug bioavailability, it is thus reasonable to bring the information on P-gp, CYP3A4 and CYP2D6 into the creation of a new chemometric method for OB prediction.
In this work, we have divided the 805 structurally diverse drug and drug-like molecules into subsets based on the binding affinity features of the molecules with the three proteins, which significantly strengths the performance of OB models as mentioned in the following section 3.2. Here, the self-consistent method was applied to define the ranking boundary of the subsets. According to the binding features of the molecules, all compounds are iteratively used for classification analysis to ensure the optimal performance for the whole datasets. Finally, four subsets with best performance were generated: Set 1 (binding score < 5, 192 compounds), Set 2 (5 < binding score < 6, 149 compounds), Set 3 (6 < binding score < 7.5, 224 compounds) and Set 4 (binding score > 7.5, 240 compounds) (Table S1).

Design of Training and Test Sets
The Kohonen's self-organizing Neural Network has the special property of effectively creating a spatially organized internal representation of various features of input signals and their abstractions [31]. Therefore, the SOM method that enables to split a dataset into training and test sets assures that both sets cover the information space as good as possible. Figure 1 shows the distribution of the compounds within the best mapping in Set 3. The two-dimensional square grid presents clear division of the input pattern into 64 neurons. Our selective premise for the test set was that the test molecules distributed in the overall data set should be more representative of the overall data set and thus should lead to good predictive models [38]. Projections of the test and training set in the map indicate that these molecules are evenly distributed over the map: 54 of 64 for Set 3. For the SOM of Set 3, a relatively concentrated area appears in the middle side for the test set (numbers with frame), and several small test clusters reside in the ambient side of the map.

The Results of MLR
The MLR analysis with stepwise selection was employed to extract the molecular descriptors for creation of structure-OB relationships. The appropriate model should have reasonable R 2 , F-test, and SEE (standard error of estimation) values, but also have ability to predict the property of the test compounds (Q ex 2 , SEP, standard error of prediction) not included in the training set. The resulted correlation between experimental and predicted logB for all the compounds was shown in Figure 2.  For the MLR equation, it is worthwhile to note that the sequential order in which these variables appeared in the OB model agreed with the order of relative contribution importance (in modulus), as derived from a subsequent standardization of the orthogonalized regression coefficients. The equation of Set 1 shows that the most important two descriptors are the GETAWAY descriptors R3v+ and R2p+. R3v+ is defined as R maximal autocorrelation of lag 3 / weighted by atomic van der Waals volumes, while R2p+ is R maximal autocorrelation of lag 2 / weighted by atomic polarizabilities. Since such descriptors derived from the Molecular Influence Matrix (MIM) contain local or distributed information on molecular structure, in most cases more than one GETAWAY descriptor is needed to reach an acceptable modeling power. The negative coefficient of R3v+ implies that low value of atomic van der Waals volumes can lead to increased bioavailability for a compound. While the positive coefficient of R2p+ may be interpreted as that low value of the atomic polarizabilities can lead to decreased OB of a molecule.
Hydrogen bonding interaction often plays an important role in determining the binding of a ligand-receptor. This implies that the binding of candidate drugs with proteins and metabolizing enzymes in the cellular membranes have great effects on the human bioavailability, i.e., the stronger binding of agents with the proteins, the more difficult the molecules overcome the cellular barriers. In fact, for the functional group counts descriptors in Equation 1, nC=O(O)2, nNq and nROCON present the E-state of hydrogen bond acceptors, while nSH is the E-state of hydrogen bond donor. Since the negative coefficients of the descriptors suggest that higher values of the factors induce lower OB value of a molecule, it is reasonable to believe that the existence of hydrogen bond acceptors and donors is actually unfavorable for the bioavailability of candidate drugs.
As for the Set 2 model, Equation 7 presents the optimal MLR model based on the eleven descriptors for the OB direction: The plot of experimental versus predicted logB shows that the predicted values enable to capture the experimental values with reasonable R 2 value (0.521). The model's prediction ability is also validated by the external test with correlation coefficients Q ex 2 = 0.542 and SEP = 0.480. As the most important descriptor for the OB value in Set 2, G2e is a second component symmetry directional WHIM descriptor that involves the atomic Sanderson electronegativities as a weighting scheme [39]. It is based on the statistical indices calculated as the information content index on the symmetry along each component. The negative coefficient of the descriptor indicates that bioavailability of a candidate drug increase with decreasing molecular symmetry. For another descriptor R1e, it belongs to the same class of GETAWAY descriptors as the R3v+ and R2p+ in Set 1, which emphasizes the important roles of the GETAWAY descriptor in determining the OB values of molecules. In addition, several 3D-MoRSE descriptors are also selected to build the model, including Mor25e, Mor04e, Mor27v, Mor30u and Mor06e. They are molecule atom projections along different angles, which represent different views of the whole molecule structure. As for the functional group counts' descriptors nCONN, nRCOOR, they represent the E-state of hydrogen bond acceptors, which are closely associated with the human bioavailability as mentioned in Equation 6.
For the MLR models of Set 3 and Set 4, their statistical characteristics are slightly worse, with lower R 2 and higher residues (Table 1), and therefore detailed analysis for the results of Set 3 and Set 4 are not presented here.

The Results of PLS
PLS is a wide class of methods for modeling relations between sets of observed variables (Y) by means of latent variables (orthogonal linear combinations of X). As a standard regression technique, PLS can handle highly correlated, noisy, and numerous X-variables and simultaneously predict several response variables. In this section, PLS was carried out to construct the relationships between the bioavailability of the compounds and their molecular structures, which is based on linear transformation from a large number of original descriptors to a new variable space. The numbers of the latent variables were chosen to maximize the prediction accuracy of the cross-validated dataset. As shown in Figure 3, the PLS models were built with different latent variables varying from 3 to 20 for Set 1, Set 2, Set 3 and Set 4, and finally 5, 5, 4 and 7 were selected as the optimal number of the PLS factors for each subset, respectively. Under such condition, the average error rate (cross-validation) for the 805 molecules is the smallest (~0.581) among all the cases of different number of factors. As shown in Figure 2, the obtained models are satisfactory for both the training and test sets, with no evident overfitting or over-training phenomenon. The PLS models exhibit the reasonable correlation coefficients R 2 of ~0.691 with SEE of ~0.411 for the training data. All the data show that the models are externally good predictive, which indicates that PLS enables to generate relatively good models for the bioavailability of the compounds. However, compared with the MLR models, the PLS models do not display absolute advantages for the SEE, SEP, and O ex 2 for the training and test sets.

The Results of SVR
As a new and powerful modeling tool, SVR has recently gained much interest in pattern recognition and function approximation applications. Compared with traditional regression and neural networks methods, SVRs have some advantages, including global optimum, good generalization ability, simple implementation, few free parameters, and dimensional independence [40,41]. In this study, SVR takes the most commonly used Gaussian radial basis function as the kernel function, which involves two parameters to be optimized, i.e., the penalty parameter C and the Gaussian function parameter γ. To determine the optimal parameters, a grid search was performed based on leave-one-out cross validation on the training set for all parameter combinations of log2C ranging from −4 to 12 and log2γ from −12 to 8. Figure 4 and Figure S1 show the influence of each parameter with the other one fixed to the optimal values on the model performance in Set 1, Set 2, Set 3 and Set 4, respectively. The results show that the SVR models reach the best performance when C = 128, γ = 0.0078 in Set 1, C = 262144, γ = 9.77 × 10 −4 in Set 2, C = 131072, γ = 3.05 × 10 −5 in Set 3, and C = 32768, γ = 1.53 × 10 −5 in Set 4. Subsequently, we selected optimal variables for SVR by varying numbers of components from 1 to 1536.
As shown in Table 1, when all the 1536 molecular descriptors were used as the input variables of SVR, the obtained regression models exhibited relatively weak Q ex 2 of ~0.611 for the test sets despite the good determination coefficients (R 2 = ~0.752) for the training sets, which reveals that the number of selected features probably have an potential effect on the prediction ability of models. Thus, the stepwise method was used to select the proper number of variables for each subset, and finally 21, 12, 18 and 12 input variables were obtained for Set 1, Set 2, Set 3 and Set 4, respectively. The resulting correlations between the experimental and predicted LogB for all the compounds were shown in Figure 3, indicating that the SVM models are superior to MLR and PLS for the OB prediction. Such results imply that the nonlinear relationship between the bioavailability and molecular structures is more notable than the linear relationship.

Comparison of the MLR, PLS and SVR Models
In this work, three regression methods were employed to construct reasonable predictive models for the OB values of candidate drugs. Two methods are based on the linear regression, MLR and PLS; one other method is the network SVR based on the nonlinear regression. In this work, F-test was applied to investigate the performance difference of the three models [42]. This indicates no statistically significant difference at a level of significance of 0.05 in Set 1, Set 3 and Set 4, except for Set 2. As for the two linear models, the calculated F-values are also lower than the critical ones for the four subsets. The results show that at a level of significance of 0.05 the differences in the performances of the mean SEP in both linear models differ only randomly. In summary, the statistical tests reveal that the performances of the MLR, PLS and SVR models are comparable with each other except for the SVR model in Set 3. This implies that the linear and non-linear methods are all appropriate for predicting the human bioavailability of candidate drugs. For the SVR models, their good performances partly benefits from the fact that it can model nonlinear relationships between dependent and independent variables, even without prior knowledge of the form of the nonlinearity. While for the traditional neural network approaches, they have suffered difficulties with generalization, producing models that can overfit the data, which is induced by the use of optimalization algorithms used for parameter selection and the statistical measures. In addition, instead of minimizing the observed training error as the traditional methods, SVR attempts to minimize the generalization error bound so as to achieve good generalized performance. Moreover, since SVR works by solving a constrained quadratic problem where the convex objective function for minimization is given by the combination of a loss function with a regularization term, the introduction of the ε value and the regularization parameter C leads to better robust properties of SVR for various signal-to-noise ratios.
The prediction ability of QSAR models depends heavily on two factors, including the molecular descriptors carrying enough information of molecular structures for the interpretation of the activity/property, and the statistical method employed [43]. For the linear MLR method, features should represent the maximum information in activity variations, the minimum collinearity, and the well-understood relationships to the responses. In this work, the MLR-extracted descriptors, the R3v+, R2p+ for Set 1, G2e, R1e for Set 2, O-056, nNq for Set 3, and nArCNO, nNq for Set 4 play important roles in understanding the mechanism of human bioavailability of candidate drugs. For the MLR method, although it only allow easy model interpretation of what should be changed in a structure to improve the molecular biological activity, the prediction ability of such model obtained in this work still shows its reasonability, which indicates that the MLR method are reliable for the prediction of human bioavailability.
PLS is a useful linear technique commonly used in QSAR analysis. In this work, four conventional quantitative structure-logOB relationships were derived by the PLS analysis. As we can see from the Table 1, PLS is uniformly comparable to MLR on the datasets. The performance of PLS is slightly better that of MLR in the prediction ability and cross-validation for the Set 1 and Set 2 data. While for the Set 3 and Set 4 data, PLS performs worse than MLR in the prediction process. Compared with MLR, the advantage of PLS is that it is a bilinear modeling method in which the original X (predictor variables) is projected onto a small number of orthogonal latent variables (LVs) to simplify the relationship between X and Y (response variables) and mitigate the colinearity problem, which is thus relatively favorable for the prediction of bioavailability.
In summary, the results of MLR, PLS and SVR models are indicative of their abilities to accommodate linearity and nonlinearity in the bioavailability and structural descriptors. In particular, the advantages of SVR, such as robustness, no additional test requirement, and optimal prediction ability were validated in our work.

Conclusions
Automatically predicting human bioavailability is a very important issue because it helps to prevent industrial failure, human toxicity and poor drug activity. However, the application of existing OB models has been limited by the ignorance of the presence of specific membrane transport systems and intracellular metabolizing enzymes in the gastrointestinal tract. In this work, for the first time, we have constructed a novel chemometric method for prediction of human OB by integrating the information of the ATP-dependent efflux protein P-gp and the cytochrome P4503A4 and P4502D6 metabolizing enzymes, the important defence limiting the absorption of candidate drugs. To establish in silico models for predicting OB values of molecules, the two linear methods MLR and PLS, and the non-linear method SVR were attempted in the present work. It can be concluded from our results that the performance of MLR, PLS and SVR are all reliable, as indicated by their correlation R 2 and prediction error residues. Thus, they could be helpful as complementary tools applied in "screening prior to synthesis" procedures for prediction of OB values.