These authors contributed equally to this work.

This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Orally administered drugs must overcome several barriers before reaching their target site. Such barriers depend largely upon specific membrane transport systems and intracellular drug-metabolizing enzymes. For the first time, the P-glycoprotein (P-gp) and cytochrome P450s, the main line of defense by limiting the oral bioavailability (OB) of drugs, were brought into construction of QSAR modeling for human OB based on 805 structurally diverse drug and drug-like molecules. The linear (multiple linear regression: MLR, and partial least squares regression: PLS) and nonlinear (support-vector machine regression: SVR) methods are used to construct the models with their predictivity verified with five-fold cross-validation and independent external tests. The performance of SVR is slightly better than that of MLR and PLS, as indicated by its determination coefficient (^{2}) of 0.80 and standard error of estimate (SEE) of 0.31 for test sets. For the MLR and PLS, they are relatively weak, showing prediction abilities of 0.60 and 0.64 for the training set with SEE of 0.40 and 0.31, respectively. Our study indicates that the MLR, PLS and SVR-based

A large number of compounds emerging from combinatorial chemistry and high throughput medicinal chemistry programs have increased the demand for new compounds that need to be screened in a wide range of biological assays [

The OB is defined as “the rate and extent to which the active ingredient or active moiety is absorbed from a drug product and becomes available at the site of action” by FDA [

Lipinski’s “Rule of Five”, which could be qualitatively used to predict the absorption and permeability of drug molecules, has so far been the primary guide to predicting OB [

It is a significant milestone that Hou and his coworkers have built a publicly available and reliable source for OB in 2007 [

Indeed, the human bioavailability involves a complex biological and physiological process that is influenced by various factors, including gastrointestinal transition and absorption, intestinal membrane permeation, and intestinal/hepatic first-pass metabolism [

It is frequently stated that the ideal model system for OB of drugs should be physiologically and sufficiently reflective of the specific biological barrier of interest in humans [

805 structurally diverse drug and drug-like molecules and their OB values (%F) in human were obtained from the bioavailability database [

Construction of the models for OB firstly depends on the generation of molecular descriptors, which can be calculated directly from the structure of any particular molecule by simply using various molecular modeling tools. Dragon descriptors have been successfully used for quantitatively representing the structural and physicochemical features of a molecule [

All the 805 compounds were divided into several statistical subsets by using the Self-consistent method. First, the geometry-based algorithm [

Secondly, the iterative self-consistent approach was used for setting boundary for the subsets [^{2}) were employed as a judgment for the boundary setting of molecules. The feature similarity of neighboring molecules was estimated to probe the maximum spatial gap:

where _{i}_{j}_{k}

Finally, steps 1 and 2 were repeated until the reliable ^{2} values were obtained for each subset.

The compounds in each subset were split into training and independent validation sets based on their distribution in the chemical space as defined by Self-organizing map (SOM). SOM is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional representation of the input space of the training samples [

Then the weight of the winner is corrected to decrease this distance. In such a process, the SOM works as a clustering diagram grouping similar inputs from the input space into similar neurons of the output space.

The optimal 10 × 10, 7 × 6, 8 × 8, 8 × 8 node architectures were chosen to map objects into 100, 42, 64, 64 positions for Set 1, Set 2, Set 3 and Set 4, respectively. Similar compounds were clustered into the same position (

As one of the most widely used methods for forecasting, MLR attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to the observed data [

In this work, those variables with zero values (> 80%) were eliminated, with remaining molecular descriptors further selected by the stepwise method in MLR process. The Stepwise variable entry and removal examines the variables in the block at each step (criteria: probability of

PLS, known as Projection to Latent Structures, is a powerful statistical method that can easily cope with a large number of correlated descriptors by projecting them into several orthogonal latent variables [^{2} (called ^{2}) of the training set. The model is generally considered internally predictive if ^{2} > 0.5 [^{2} are much better indicators than standard error and conventional ^{2} of how reliable predictions actually are.

Recently, SVR has emerged as an alternative and powerful technique to solve regression problems by introducing an alternative loss function [

Hence, only a brief description of the method is given here. Suppose we are given training data {(_{i}_{i}_{i}^{n}_{i}_{i}_{i}_{i}_{i}

where

where α _{i}_{i}^{*} are Lagrange multipliers, have been obtained by minimizing the regularized risk function. The kernel function _{i}

Generally, four kinds of kernel functions,

At first, three methods including MLR, PLS and SVM were tried in order to build reasonable predictive models for OB of 805 molecules as a whole dataset. For the MLR model, its coefficients ^{2} for training set and testing set are 0.39 and 0.47, respectively. The PLS model with best performance has 8 latent variable, presenting ^{2} = 0.58 for the training set, and _{ex}^{2} = 0.37 for the test set. As for the SVM model, the regression results in the poor ^{2} and _{ex}^{2} of 0.39 and 0.35, respectively. From these results, it can be concluded that the present models generated relatively poor models for the prediction of OB values, in agreement with these reported models [

The multidrug resistance (MDR) ATP binding cassette (ABC) proteins, especially the P-gp, are large, membrane-bound proteins, which form a functional network, capable to extrude a very wide range of foreign (xenobiotic) substrates [

In this work, we have divided the 805 structurally diverse drug and drug-like molecules into subsets based on the binding affinity features of the molecules with the three proteins, which significantly strengths the performance of OB models as mentioned in the following section 3.2. Here, the self-consistent method was applied to define the ranking boundary of the subsets. According to the binding features of the molecules, all compounds are iteratively used for classification analysis to ensure the optimal performance for the whole datasets. Finally, four subsets with best performance were generated: Set 1 (binding score < 5, 192 compounds), Set 2 (5 < binding score < 6, 149 compounds), Set 3 (6 < binding score < 7.5, 224 compounds) and Set 4 (binding score > 7.5, 240 compounds) (Table S1).

The Kohonen’s self-organizing Neural Network has the special property of effectively creating a spatially organized internal representation of various features of input signals and their abstractions [

The MLR analysis with stepwise selection was employed to extract the molecular descriptors for creation of structure-OB relationships. The appropriate model should have reasonable ^{2}, _{ex}^{2}, SEP, standard error of prediction) not included in the training set. The resulted correlation between experimental and predicted logB for all the compounds was shown in

For example, the optimal linear model was built with eleven descriptors in Set 1:

where _{tr} and _{te} are the number of compounds included in the training and test set, respectively. Predicted values from ^{2} (0.6), which indicates good statistical characteristics of the model. The model’s prediction capability is further validated by the external test with rational correlation coefficients (0.612).

For the MLR equation, it is worthwhile to note that the sequential order in which these variables appeared in the OB model agreed with the order of relative contribution importance (in modulus), as derived from a subsequent standardization of the orthogonalized regression coefficients. The equation of Set 1 shows that the most important two descriptors are the GETAWAY descriptors R3v+ and R2p+. R3v+ is defined as R maximal autocorrelation of lag 3/weighted by atomic van der Waals volumes, while R2p+ is R maximal autocorrelation of lag 2/weighted by atomic polarizabilities. Since such descriptors derived from the Molecular Influence Matrix (MIM) contain local or distributed information on molecular structure, in most cases more than one GETAWAY descriptor is needed to reach an acceptable modeling power. The negative coefficient of R3v+ implies that low value of atomic van der Waals volumes can lead to increased bioavailability for a compound. While the positive coefficient of R2p+ may be interpreted as that low value of the atomic polarizabilities can lead to decreased OB of a molecule.

Hydrogen bonding interaction often plays an important role in determining the binding of a ligand-receptor. This implies that the binding of candidate drugs with proteins and metabolizing enzymes in the cellular membranes have great effects on the human bioavailability,

As for the Set 2 model,

The plot of experimental versus predicted logB shows that the predicted values enable to capture the experimental values with reasonable ^{2} value (0.521). The model’s prediction ability is also validated by the external test with correlation coefficients _{ex}^{2} = 0.542 and SEP = 0.480. As the most important descriptor for the OB value in Set 2, G2e is a second component symmetry directional WHIM descriptor that involves the atomic Sanderson electronegativities as a weighting scheme [

For the MLR models of Set 3 and Set 4, their statistical characteristics are slightly worse, with lower ^{2} and higher residues (

PLS is a wide class of methods for modeling relations between sets of observed variables (^{2} of ~0.691 with SEE of ~0.411 for the training data. All the data show that the models are externally good predictive, which indicates that PLS enables to generate relatively good models for the bioavailability of the compounds. However, compared with the MLR models, the PLS models do not display absolute advantages for the SEE, SEP, and _{ex}^{2} for the training and test sets.

As a new and powerful modeling tool, SVR has recently gained much interest in pattern recognition and function approximation applications. Compared with traditional regression and neural networks methods, SVRs have some advantages, including global optimum, good generalization ability, simple implementation, few free parameters, and dimensional independence [^{−4} in Set 2, ^{−5} in Set 3, and ^{−5} in Set 4. Subsequently, we selected optimal variables for SVR by varying numbers of components from 1 to 1536.

As shown in _{ex}^{2} of ~0.611 for the test sets despite the good determination coefficients (^{2} = ~0.752) for the training sets, which reveals that the number of selected features probably have an potential effect on the prediction ability of models. Thus, the stepwise method was used to select the proper number of variables for each subset, and finally 21, 12, 18 and 12 input variables were obtained for Set 1, Set 2, Set 3 and Set 4, respectively. The resulting correlations between the experimental and predicted LogB for all the compounds were shown in

In this work, three regression methods were employed to construct reasonable predictive models for the OB values of candidate drugs. Two methods are based on the linear regression, MLR and PLS; one other method is the network SVR based on the nonlinear regression. In this work,

where, _{1} and _{2} are the number of samples in the test set, SEP_{1}^{2} is the square from the higher and SEP_{2}^{2} is the square from lower root mean square errors of the two compared models. When comparing the performances of SVR with MLR and PLS,

For the SVR models, their good performances partly benefits from the fact that it can model nonlinear relationships between dependent and independent variables, even without prior knowledge of the form of the nonlinearity. While for the traditional neural network approaches, they have suffered difficulties with generalization, producing models that can overfit the data, which is induced by the use of optimalization algorithms used for parameter selection and the statistical measures. In addition, instead of minimizing the observed training error as the traditional methods, SVR attempts to minimize the generalization error bound so as to achieve good generalized performance. Moreover, since SVR works by solving a constrained quadratic problem where the convex objective function for minimization is given by the combination of a loss function with a regularization term, the introduction of the

The prediction ability of QSAR models depends heavily on two factors, including the molecular descriptors carrying enough information of molecular structures for the interpretation of the activity/property, and the statistical method employed [

PLS is a useful linear technique commonly used in QSAR analysis. In this work, four conventional quantitative structure-logOB relationships were derived by the PLS analysis. As we can see from the

In summary, the results of MLR, PLS and SVR models are indicative of their abilities to accommodate linearity and nonlinearity in the bioavailability and structural descriptors. In particular, the advantages of SVR, such as robustness, no additional test requirement, and optimal prediction ability were validated in our work.

Automatically predicting human bioavailability is a very important issue because it helps to prevent industrial failure, human toxicity and poor drug activity. However, the application of existing OB models has been limited by the ignorance of the presence of specific membrane transport systems and intracellular metabolizing enzymes in the gastrointestinal tract. In this work, for the first time, we have constructed a novel chemometric method for prediction of human OB by integrating the information of the ATP-dependent efflux protein P-gp and the cytochrome P4503A4 and P4502D6 metabolizing enzymes, the important defence limiting the absorption of candidate drugs. To establish ^{2} and prediction error residues. Thus, they could be helpful as complementary tools applied in “screening prior to synthesis” procedures for prediction of OB values.

This work was supported by special talent recruitment fund of Northwest A&F University (to J.D.).

Clustering of 8 × 8 Self-organizing map (SOM) of 224 compounds in Set 3. The numbers correspond to the series numbers of the compounds. Those numbers with frames are compounds of the test set, and the others are the compounds of the training set.

Experimental and predicted LogB values for Set 1, Set 2, Set 3 and Set 4 using the multiple linear regression (MLR), partial least squares (PLS) and support-vector machine regression (SVR) models, respectively. For MLR, the training and test sets are represented by the black empty squares and black solid squares, respectively. For PLS, they are represented by the red empty circles and red solid circles, respectively, while for SVR, they are shown by the blue empty triangles and blue solid triangles, respectively.

The prediction accuracies of 5-fold cross-validation for the 805 compounds derived from partial least squares analysis with latent variables varying from 3 to 20 in Set 1, Set 2, Set 3 and Set 4, respectively.

Contour plots of the optimization error for SVR when optimizing the parameters

Statistical results of MLR, PLS and SVR for oral bioavailability (OB) prediction of compounds.

Set 1 | Set 2 | Set 3 | Set 4 | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Training size | Test size | Training size | Test size | Training size | Test size | Training size | Test size | |||||||||

156 | 36 | 122 | 27 | 180 | 44 | 197 | 43 | |||||||||

^{2} |
SEE | _{ex}^{2} |
SEP | ^{2} |
SEE | _{ex}^{2} |
SEP | ^{2} |
SEE | _{ex}^{2} |
SEP | ^{2} |
SEE | _{ex}^{2} |
SEP | |

0.621 | 0.411 | 0.612 | 0.311 | 0.521 | 0.400 | 0.541 | 0.482 | 0.610 | 0.492 | 0.612 | 0.48 | 0.61 | 0.482 | 0.622 | 0.480 | |

0.631 | 0.390 | 0.651 | 0.311 | 0.643 | 0.331 | 0.511 | 0.470 | 0.561 | 0.500 | 0.561 | 0.521 | 0.831 | 0.312 | 0.600 | 0.490 | |

0.800 | 0.311 | 0.720 | 0.220 | 0.750 | 0.280 | 0.630 | 0.772 | 0.780 | 0.361 | 0.800 | 0.361 | 0.690 | 0.421 | 0.682 | 0.461 | |

_{T} |
0.840 | - | 0.731 | - | 0.731 | - | 0.310 | - | 0.970 | - | 0.590 | - | 0.990 | - | 0.561 | - |

^{2}, the regression coefficient of the training set; _{ex}^{2}, the regression coefficient of the test set; SEE, standard error of estimate; SEP, standard error of prediction; SVM_{T} represents the models using the total 1536 molecular descriptors as the input variables of SVR; -, not available.