Next Article in Journal
Phenolic Profiles and Antioxidant Activity of Litchi (Litchi Chinensis Sonn.) Fruit Pericarp from Different Commercially Available Cultivars
Next Article in Special Issue
Benchmarking Ligand-Based Virtual High-Throughput Screening with the PubChem Database
Previous Article in Journal
Copper Nanoparticles Mediated by Chitosan: Synthesis and Characterization via Chemical Methods
Previous Article in Special Issue
Computational Prediction of Blood-Brain Barrier Permeability Using Decision Tree Induction
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

QSPR Models for Predicting Log Pliver Values for Volatile Organic Compounds Combining Statistical Methods and Domain Knowledge

1
Planta Piloto de Ingeniería Química (PLAPIQUI) CONICET-UNS, La Carrindanga km.7, Bahía Blanca, 8000, Argentina
2
Laboratorio de Investigación y Desarrollo en Computación Científica (LIDeCC), DCIC, UNS, Av. Alem 1250, Bahía Blanca, 8000, Argentina
3
Faculty of Computer Science, Dalhousie University, 6050 University Av., PO BOX 15000, Halifax, NS B3H 4R2, Canada
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Molecules 2012, 17(12), 14937-14953; https://doi.org/10.3390/molecules171214937
Submission received: 10 September 2012 / Revised: 12 December 2012 / Accepted: 13 December 2012 / Published: 17 December 2012
(This article belongs to the Special Issue QSAR and Its Applications)

Abstract

:
Volatile organic compounds (VOCs) are contained in a variety of chemicals that can be found in household products and may have undesirable effects on health. Thereby, it is important to model blood-to-liver partition coefficients (log Pliver) for VOCs in a fast and inexpensive way. In this paper, we present two new quantitative structure-property relationship (QSPR) models for the prediction of log Pliver, where we also propose a hybrid approach for the selection of the descriptors. This hybrid methodology combines a machine learning method with a manual selection based on expert knowledge. This allows obtaining a set of descriptors that is interpretable in physicochemical terms. Our regression models were trained using decision trees and neural networks and validated using an external test set. Results show high prediction accuracy compared to previous log Pliver models, and the descriptor selection approach provides a means to get a small set of descriptors that is in agreement with theoretical understanding of the target property.

Graphical Abstract

1. Introduction

Volatile organic compounds (VOCs) are emitted as gases from certain solids or liquids. VOCs include a variety of chemicals, some of which may have short- and long-term adverse health effects. Concentrations of many VOCs are consistently higher indoors (up to ten times higher) than outdoors. Organic chemicals are widely used as ingredients in household products. Paints, varnishes, and wax all contain organic solvents, as do many cleaning, disinfecting, cosmetic, degreasing, and hobby products. All of these products may release organic compounds while they are used, and, to some degree, when they are stored. The main concern is the potential for VOCs to adversely impact on the health of people that are exposed to them indoors [1,2]. Woodruff et al. [3] described the need for better public health policies on chemicals released into our environment. They proposed modernizing approaches to assessing health risk and remarked the importance of scientific understanding of the relationship between pollutant exposure and adverse health effects.
In this context, quantitative structure-property relationship (QSPR) models allow one to relate measurements on a set of “descriptor” (or predictor) variables to the behavior of the response variable and constitute a valuable tool for in silico property prediction. In particular, the development of combinatorial chemistry and high throughput screening programs has stimulated drug discovery research to find theoretical and computational models to estimate and predict drug absorption, distribution, metabolism, and excretion (ADME) based on drug physicochemical properties [4]. These methodologies have also been applied to VOCs inhalation studies [5,6] and are related to the analysis of physiologically-based pharmacokinetic (PBPK) models.
PBPK modeling is a mathematical modeling technique for predicting the ADME of synthetic or natural chemical substances in humans and other animal species. In respiratory PBPK models blood-air, liver-air and liver-blood partition coefficients of VOCs are important for their hazard assessment and bioavailability estimation [7]. Several attempts have been made to model the relationship between the structure or molecular properties and the blood-to-liver distribution, usually denoted as log Pliver, of VOCs and drugs. Abraham and Weathersby [8] used the Abraham descriptors to estimate values of log Pliver of VOCs. Balaz and Luckacova [9] correlated values of log Pliver for 28 compounds by using four variables. Poulin and Theil [10,11] developed an equation for the prediction of in vivo plasma-to-tissue partition coefficients of drugs. Zhang [12] built a nonlinear model to calculate log Pliver of VOCs. Liu et al. [13] obtained a nonlinear model for predicting the tissue-to-blood partition of organic compounds using a least squares support vector machine. Rodgers et al. [14] achieved equations for the prediction of plasma-water-to-tissue distribution. Zhang and Zhang [15] generated a general training model for predicting in vivo blood-to-liver (among other tissues) distribution of drugs. Abraham et al. [7] applied solvation equations to correlate in vitro blood-to-liver partition coefficients for VOCs and drugs. Martín-Biosca et al. [16] employed biopartitioning micellar chromatography (BMC) for predicting blood-to-tissue partition coefficients of drugs and proposed PLS2 and multiple linear regression (MLR) models based on BMC retention data.
While most of these works make interesting contributions to the study of the log Pliver property, in general their predictive accuracies or chemical interpretation are not good enough for wide use at an industrial scale. In particular, a key issue for data-driven QSPR methodologies is how expert knowledge can be incorporated into the modeling process in order to obtain interpretable predictors. For these reasons, new statistical QSPR models for log Pliver addressing these premises are presented in this work. The proposed methodology combines the use of machine learning methods with expert analysis for the identification of the most relevant molecular descriptors for the definition of the QSPR model. This integration is achieved by means of a careful analysis, where a reduced number of descriptors selected by data-driven methods are evaluated by experts in terms of their chemical meaning and statistical contribution to a candidate QSPR model. From this semi-automatic analysis a new set of descriptors is chosen, and hence the associated statistical QSPR model is finally obtained. In this way, a double contribution is pursued in this work. First, the design of new log Pliver models with high prediction accuracy and good interpretability. Second, the application of our specific design methodology that integrates machine learning with human expert knowledge, and hence recommending its analogous applications for prediction of other chemical properties.
The article is structured as follows: in Section 2, the main results obtained from a log Pliver dataset are presented. Section 3 describes the methodological approach applied for our experiments, and it also includes a thorough analysis of the contribution of the descriptors used in our models. Finally, in Section 4 main conclusions of this work are discussed.

2. Results and Discussion

2.1. Dataset and Calculation of the Molecular Descriptors

The in vitro blood-to-liver partition coefficients, log Pliver (human/rat), values were taken from Abraham et al. [7]. In this data set there are 122 VOCs among which are hydrocarbons, alkyl halides, alcohols, ethers, esters, ketones, epoxides, nitriles, halobenzenes, polycyclic hydrocarbons and benzene derivatives (Table 1). The values of log Pliver range from −0.56 to 1.17.
A critical step in the development of QSPR models is the computation of the molecular descriptors. The model performance and results are strongly dependent on the way descriptors are calculated. The calculation process of the molecular descriptors is described as follows: all VOCs structures were drawn using HyperChem 8.0.7 [17]. The molecules were optimized with the same software, in order to find energetically stable conformations. The structures were pre-optimized with the Force Field Molecular Mechanics (MM+) procedure. Then, the resulting geometries were further refined by means of the Semi-Empirical Molecular Orbital Method AM 1 (Austin Model 1) by using Polak-Ribiere’s algorithm and a gradient norm limit of 0.01 kcal/(Å mol). As a next step, the HyperChem output files were used by Dragon 5.5 [18,19] to calculate several classes of descriptors such as: constitutional, geometrical, topological and electrostatic. Finally, constant descriptors (i.e., variables that take a same value for all samples in the dataset) and near constants (i.e., variables that take a same value, but allowing some predetermined small number of samples to take other values) were deleted.

2.2. Performance of Our Model

In order to evaluate the prediction capacity of our methodology, two different experiments were carried out in this work. The first experiment reports the performance of our models when tested on one sixth of the dataset (16.6%). When using decision trees, the mean absolute error (MAE) is 0.15 ± 0.04 (“±values” correspond to the confidence intervals calculated at 95% level). The root mean squared error (RMSE) is 0.18 and the coefficient of determination (R2) is 0.73.
Figure 1 shows the conditions on the internal nodes of the decision tree and the linear regressions used in the leaves, while Figure 2 shows a plot displaying the prediction of each individual test compound with the best linear fit of our model. The analysis of the tree structure sheds light on the understanding of the model used for prediction. The first decision of the tree is based on the value of Se; if it is lower than 16.025, this leads to a leaf with a simple regression using only three out of the five available descriptors, namely: ALOGP, Mor29u and Se. Making a structural inspection of the compounds that are associated to this leaf, it can be appreciated (Table S3, in supplementary material) that most of them have a short carbon chain and halogens with low log Pliver values. This separation is coherent with a physicochemical point of view: small polar molecules have higher affinity with blood mediums than longer ones. Another observation is that AMW and Pol have a rather high Pearson correlation ( | r | ≈ 0.56) to Se (Table 2) and hence their contributions can be mainly explained by Se.
When the value of Se is greater than 16.025, there are three different linear regressions using the five descriptors. From Table 2, we can see that the correlation of AMW and Pol to Se are much lower than what happens in the left branch, and hence they now become necessary in the model. Note that all coefficients retain the same sign, indicating that the contribution of the descriptors to the model is always the same, and the differences in the coefficients come from producing a better fit to the compounds assigned to a specific leaf. We can also compare from Table 2, that ALOGP becomes more correlated to Se in the right branch than in the left one. Thereby, we can see in Figure 1 that there is a drop in the absolute value of the coefficient assigned to the ALOGP descriptor (0.061 and 0.1146) in the right branch compared to the one in the left branch (0.1729). A more thorough analysis of the physicochemical relevance of the descriptors can be found in Section 3.2.
Neural network ensemble on this same data partition reported a slight decrease of the regression accuracy compared to our previous model: MAE = 0.17 ± 0.04, RMSE = 0.19 and R2 = 0.66. The prediction obtained per compound in this experiment using decision trees and neural network ensemble can be found in Table 1.
For the training set, we obtained the following metrics using decision trees: MAE = 0.13 ± 0.02, RMSE = 0.17 and R2 = 0.75. Neural network ensemble reported MAE = 0.12 ± 0.02, RMSE = 0.16 and R2 = 0.80.
In our second experiment we evaluated our results by separating half of the compounds of the dataset for testing. When using decision trees we obtained the following metrics: MAE = 0.15 ± 0.04, RMSE = 0.21 and R2 = 0.62. Using neural network ensemble results in a higher prediction performance reporting MAE = 0.16 ± 0.03, RMSE = 0.20 and R2 = 0.66. Figure 3 shows the prediction values for each test compound using neural network ensemble. The prediction obtained per compound in this experiment using decision trees and neural network ensemble can be also found in Table 1.
When using decision trees we obtained MAE = 0.17 ± 0.03, RMSE = 0.21 and R2 = 0.62 for the training set. Neural network ensemble on this same partition reported MAE = 0.11 ± 0.03, RMSE = 0.15 and R2 = 0.81. These last results show an improvement over the results published by Abraham et al. [7] as their experiments using the same dataset and the same test set size yielded an RMSE = 0.221 and R2 = 0.481.

3. Computational Methods and Experiments

In order to select the most relevant descriptors, a mixed scheme of automatic and expert chemical knowledge was employed. As a first step a machine learning approach based on a cross-fold validation with in-fold feature selection was applied [20]. This approach consists in splitting the samples set into n folds. The feature selection uses a learning algorithm that is applied to predict each fold by using the samples in the n-1 remaining folds. Since n different sets of features can be selected a voting scheme is employed, where the most frequently selected descriptors are kept for the final set of relevant descriptors. This technique ensures that particular predictions are not biased by feature over-selection or over-fitting since each prediction is performed without using the test samples neither during the feature selection nor during the classifier building process. From these experiments, the most frequently selected descriptors were kept for the initial set of relevant descriptors.
As a second step chemical knowledge was employed in order to evaluate the merit of each descriptor selected automatically. Since most of them did not exhibit a clear physicochemical explanation a small number of these descriptors were chosen for the final QSPR models, whereas other few descriptors were incorporated based on chemical expertise. Our methodology is schematized in Figure 4 and detailed explanations of these steps are given in the following subsections.

3.1. Molecular Descriptor Selection

The compounds listed in Section 2.1 were used to calculate 634 molecular descriptors using Dragon [18,19]. The final set of descriptors was chosen by using a combination of a feature selection method and a physicochemical-motivated strategy. The feature selection method that we used here is based on a 5-fold cross-validation with in-fold feature selection over the training set, which selected the following descriptors: RTu+, Mor29u, AMW, ZM2V, Jhetv, PW4, Ss, Ms, Me, Mv, nCIC, AAC, GATS2m, S1K, PW3, EEig07x, IC1, Qindex, RBN, Mor04m, Mor11v, ATS1v and MAXDN (complete names of the descriptors may be found in the E-Dragon web site [19]). After that, the physicochemical-motivated selection was done manually by domain experts, who aimed at including into the model orthogonal aspects of the molecules, so that important and interpretable features are considered and redundancy is kept minimal. These manually-selected descriptors are: AMW, Mor29u, ALOGP, Pol and Se; a brief description of each one is included in Table 3. The first two descriptors were taken from the feature selection algorithm results, and the following three were added on the basis of the experts’ criteria. Physicochemical rationale of this selection is supported in Section 3.2. Although this reduced subset of descriptors decreases the regression accuracy from R2 = 0.79, MAE = 0.13 ± 0.04 and RMSE = 0.15 to R2 = 0.73, MAE = 0.15 ± 0.04 and RMSE = 0.18 in our first experiment when a decision tree model is used, this subset is preferred for its low cardinality and more interpretable set of features. The values of the final pool of descriptors are available in the supplementary file (Table S3).
From the very beginning of our training process we held-out a test set of compounds, which is only used once to estimate an unbiased performance of our prediction method. We applied this validation strategy with two different sets of experiments. In the first experiment, we kept aside one sixth of the dataset (20 compounds) as a test set, whereas in the second experiment we used for testing half of the number of compounds in the dataset (61 compounds). In both cases the compounds selected for testing were chosen by using a stratified selection to ensure that compounds in the training and testing sets are similarly distributed.
Different machine learning methods such as linear regression, decision trees, neural network ensemble, SVM (support vector machine) and K-nearest neighbours were applied in this work, out of which decision trees and neural network ensemble stood out with the highest prediction accuracies for our dataset. All our experiments were run using data mining toolbox Weka [21]. In particular, the results with M5p (or M5prime) algorithm [22] and neural networks were discussed in this paper (Section 2). Details about the characteristics of these methods and their parameterization are explained in Section 3.3.

3.2. Physicochemical Relevance of Molecular Descriptors

The aim of this subsection is to analyze the relationship among molecular descriptors and the target property in order to provide a physicochemical justification of the resulting model. When the interpretation of a QSPR model is consistent with existing theories and knowledge of mechanisms, the model becomes more appealing for cheminformaticians [23]. Despite it is not always possible to find a global interpretation, it is desirable to make the effort to find an explanation for the model in a “mechanistic” way [24].
In our dataset values of log Pliver are consistent with regard to affinity for medium polarity, e.g., families with non-polar characteristic as alkanes (2 to 16 in Table 1) show higher affinity for liver tissue than for blood. The five descriptors chosen for the model provide to a lesser or greater extent important information about molecular properties related to the molecule capability to distribute between the two media under study: liver tissue and blood. The relationships between descriptor values and log Pliver values are shown in Figure 5. Our analysis is focused on some representative chemical families thathave been highlighted in colors in order to illustrate our point graphically (alkanes, alcohols, aromatics and some structurally similar halogenated hydrocarbons).
The descriptor AMW (molecular weight divided by the number of atoms) (Figure 5a) discriminates the molecules taking into account their atomic composition (type and quantity). Take for example the alkanes (Cn H2n+2) and the aromatics (Cn Hn): they are constituted by carbons and hydrogens, and since each family has a different C/H rate, they present a specific value of AMW, even though the compounds are slightly different. When these families can be segregated from whole data set in the graph, the differences in their physicochemical properties become more evident, e.g., their polarity (which is related to the molecule affinity with an aqueous medium or a non-polar one). In this figure, it can also be seen the behavior of non-polar families as alkanes, where they tend to have high log Pliver, while polar families as alcohols present lower log Pliver. The same analysis can be applied to aromatics and halogenated hydrocarbons. Something similar happens to the descriptor Se (Figure 5b) that succeeds in discriminating the VOCs families with the sum of Sanderson atomic electronegativities (scaled on carbon atom).
The descriptors Pol (Polarity number) and Mor29u (3D- Molecule Representation of Structures Based on Electron diffraction - signal 29/unweighted) highlight structural 2D and 3D properties respectively and are plotted in Figure 5c,d. Pol relates to the steric properties of molecules and it is calculated on the distance matrix as the number of pairs of vertices at a topological distance equal to three (i.e., number of third neighbors) [25]. In Figure 5c, it can be seen that Pol presents either low values or zero for short carbon chains whereas it takes higher values (between 4 and 16) for longer structures (e.g., most of the halogenated hydrocarbons and long alkanes respectively). In other words, Pol is low or equal to zero for compounds with few atoms because they have a small number of third neighbors and the opposite occurs for long molecules. Therefore, this descriptor works as a specific filter that discriminates molecules by chain length.
Mor29u (3D-MoRSE - signal 29/unweighted) belongs to 3D-MoRSE (3D-Molecule Representation of Structures based on Electron diffraction) descriptors. They are based on the idea of obtaining information from the 3D atomic coordinates by the transformation used in electron diffraction studies for preparing theoretical scattering curves [26]. 3D-MoRSE descriptors are derived from molecule atom projections along different angles, such as in electron diffraction. They represent different views of the whole molecule structure, although their meaning remains still unclear [27]. While its influence does not appear to be completely clear, its inclusion is based mainly upon a regression-based objective: it was selected by the feature selection method and in all our experiments, the removal of this descriptor from our equations lead to a remarkable drop in the train and testing prediction quality. Nevertheless, we can partially analyze its contribution. It can be seen in Figure 5d that Mor29u takes positive and negative values because the original equation includes the term sin(s·rij)/s·rij [26], where s measures the scattering angle and rij represents the interatomic distances between atoms i and j. Then, the descriptor sign only is not determinant for the relationship with the target. Another observation from Figure 5d is that the chemical families are not segregated as occurs with AWM and Se (Figure 5a,b). This seems to be coherent because the components of a chemical family share many physicochemical properties (polarity, mobility, hydrogen bond, etc.) besides the 3D structure. Moreover, from Table S3 in supplementary material, it can be noted that isomers as o, m and p-xylenes, along with several examples, present different values, and thus they get differentiated. In brief, it is observed that Mor29u captures minimum variations in 3D-structural features based on interatomic distances.
Finally, ALOGP (Ghose-Crippen octanol-water partition coefficient) gives relevant information about molecular affinity for an octanol-water medium. In fact, ALOGP is a descriptor that commonly appears in models about partition coefficients [9,15]. It is calculated from a model consisting of a regression equation based on the hydrophobicity contribution of 120 atom types [28,29,30]. Each atom in every structure is classified into one of the 120 atom types. Then, an estimated log P value for any compound is given by ALOGP = i n i a i , where ni is the number of atoms of type i and ai is the corresponding hydrophobicity constant.
It can be seen in Figure 5e that each VOC has its own ALOGP value regardless of its chemical family. That is, this descriptor is sensitive to minimum differences in molecular structure. As expected, it can be noted a correlation between this descriptor and log Pliver (Figure 5e), because polar molecules have low ALOGP and log Pliver values (e.g., alcohols and halogenated hydrocarbons) and non-polar ones have high values (e.g., alkanes and aromatics).

3.3. Regression Algorithms

Two methodologies applied as regression algorithms, namely M5p and an ensemble of neural networks, were applied in this work. The decision tree model applied here is M5p [22]. This is an extension of Quinlan's M5 algorithm that allows using decision trees for regression problems, i.e., attributes and target variable can be continuously defined over the set of real numbers. A key aspect of this decision tree algorithm is that it makes use of a linear regression model for each leaf of the tree. It also provides a mechanism for pruning (i.e., keeping the height of the tree minimal to avoid overfitting) and a smoothing process that allows compensating discontinuities between adjacent linear models at the leaves of the tree. For our experiments we set to 4 the minimum allowed number of compounds per leaf. The neural networks used in our experiments make use of the traditional backpropagation algorithm, which was used before in the QSPR literature [31]. A total of fifty networks were used to define the ensemble. The architecture of each network is a single hidden layer with three nodes and all activation functions of the internal nodes of the network are sigmoids. The networks were initialized with different random weights. To facilitate the gradient optimization of the parameters all descriptors were normalized before training. The learning rate and the momentum were set to 0.3 and 0.2 respectively.
Neural networks and decision trees constitute very different modeling techniques in nature. On the one hand, neural networks are one of the most popular techniques for QSPR modeling and are able to fit any kind of function, provided there is a sufficient number of hidden nodes. This aspect also makes them prone to overfit the training data very easily (in the absence of any mechanism to thwart overfitting). On the other hand, decision trees are well accepted by lay users, who are able to interpret the meaning of the model very easily. Therefore, a decision on which of these models should be used would be based on how important the understanding of the prediction model is.

4. Conclusions

In this paper we introduced new models for the prediction of blood-to-liver partition coefficients for volatile organic compounds following a QSPR approach. We applied two different machine learning approaches to model log Pliver, namely: decision trees and neural networks. Both models have shown a similar prediction capacity and they significantly outperformed the results obtained by Abraham et al. [7], which is the only work in this area that uses the same compound dataset. To the best of our knowledge this is the largest dataset of VOCs with their associated log Pliver values.
A key aspect of the good performance of our approaches is based on the careful selection of the descriptors used to build our models. This selection was first done using an automatic feature selection method, which gives a subset of descriptors where their joint application yields good regression accuracy in a non-linear model. However, many of these descriptors were not easily interpretable. Thereby, a new manual selection of descriptors was done by domain experts aiming at introducing descriptors that model the target property and the differences of the compound families in the dataset. In this way, a smaller and more interpretable subset of descriptors was obtained. While the prediction capacity of this combined subset of descriptors is similar, this smaller subset is preferred as it allows a better understanding of the target property and reduces the likelihood of having a chance correlation due to the small size of the dataset.
This semi-automatic approach can be also applied to model other properties and other compounds, as long as statistical methods and expert knowledge are available. Nevertheless, it is important to always be cautious in the use of QSPR approaches. While prediction accuracy on unseen compounds are estimated by the use of a test set, it is hard to assess the prediction accuracy of the compounds that fall outside of the applicability domain of the model. The applicability domain of a model is usually affected by the training set, the complexity or dimensionality of its representation and the prediction model [32,33]. For these reasons, our model may not perform with the same accuracy for compounds of a different nature to those present in the training set. Yet, the use of strategies that include expert knowledge during the modeling phase leads to more plausible models that are easier to interpret and more likely to better generalize to unseen compounds.
Finally, this work contributes reliable techniques to predict a metric related to exposure to chemicals in the environment, which may be applied to risk assessment and decision making in public health policies.

Supplementary Materials

Supplementary materials can be accessed at: https://www.mdpi.com/1420-3049/17/12/14937/s1.

Acknowledgments

This work is kindly supported by grants PGI 24/ZN15 and PGI 24/ZN16 (Universidad Nacional del Sur) and PIP112-2009-0100322 (CONICET - National Research Council of Argentina).

References

  1. Vallero, D. Fundamentals of Air Pollution, 4th ed.; Academic Press: San Diego, CA, USA, 2008. [Google Scholar]
  2. Williams, J.; Koppmann, R. Volatile Organic Compounds in the Atmosphere: An Overview. In Volatile Organic Compounds in the Atmosphere; Koppmann, R., Ed.; Blackwell Publishing Ltd.: Oxford, UK, 2007. [Google Scholar]
  3. Woodruff, T.J.; Burke, T.A.; Zeise, L. The Need for Better Public Health Decisions on Chemicals Released Into Our Environment. Health Aff. 2011, 30, 957–967. [Google Scholar] [CrossRef] [PubMed]
  4. Tronde, A. Pulmonary Drug Absorption. In Vitro and In Vivo Investigations of Drug Absorption Across the Lung Barrier and Its Relation to Drug Physicochemical Properties. Ph.D. Thesis, Uppsala University, Uppsala, Sweden, 2002. [Google Scholar]
  5. Katritzky, A.R.; Kuanar, M.; Fara, D.C.; Karelson, M.; Acree, W.E., Jr.; Solov’ev, V.P.; Varnek, A. QSAR modeling of blood:air and tissue:air partition coefficients using theoretical descriptors. Bioorg. Med. Chem. 2005, 13, 6450–6463. [Google Scholar] [CrossRef] [PubMed]
  6. Dashtbozorgi, Z.; Golmohammadi, H. Prediction of air to liver partition coefficient for volatile organic compounds using QSAR approaches. Eur. J. Med. Chem. 2010, 45, 2182–2190. [Google Scholar] [CrossRef] [PubMed]
  7. Abraham, M.H.; Ibrahim, A.; Acree, W.E., Jr. Air to liver partition coefficients for volatile organic compounds and blood to liver partition coefficients for volatile organic compounds and drugs. Eur. J. Med. Chem. 2007, 42, 743–751. [Google Scholar] [CrossRef] [PubMed]
  8. Abraham, M.H.; Weathersby, P.K. Hydrogen bonding. 30. Solubility of gases and vapors in biological liquids and tissues. J. Pharm. Sci. 1994, 83, 1450–1456. [Google Scholar] [CrossRef] [PubMed]
  9. Balaz, S.; Luckacova, V. A Model-based Dependence of the Human Tissue/Blood Partition Coefficients of Chemicals on Lipophilicity and Tissue Composition. Quant. Struct.-Act. Rel. 1999, 18, 361–368. [Google Scholar] [CrossRef]
  10. Poulin, P.; Theil, F.P. Prediction of pharmacokinetics prior to In Vivo studies. II. Generic physiologically based pharmacokinetic models of drug disposition. J. Pharm. Sci. 2002, 91, 1358–1370. [Google Scholar] [CrossRef] [PubMed]
  11. Poulin, P.; Theil, F.P. A priori prediction of tissue:plasma partition coefficients of drugs to facilitate the use of physiologically-based pharmacokinetic models in drug discovery. J. Pharm. Sci. 2000, 89, 16–35. [Google Scholar] [CrossRef]
  12. Zhang, H. A new nonlinear equation for the tissue/blood partition coefficients of neutral compounds. J. Pharm. Sci. 2004, 93, 1595–1604. [Google Scholar] [CrossRef] [PubMed]
  13. Liu, H.X.; Yao, X.J.; Zhang, R.S.; Liu, M.C.; Hu, Z.D.; Fan, B.T. Prediction of the tissue/blood partition coefficients of organic compounds based on the molecular structure using least-squares support vector machines. J. Comput. Aid. Mol. Des. 2005, 19, 499–508. [Google Scholar] [CrossRef] [PubMed]
  14. Rodgers, T.; Leahy, D.; Rowland, M. Physiologically based pharmacokinetic modeling 1: Predicting the tissue distribution of moderate-to-strong bases. J. Pharm. Sci. 2005, 94, 1259–1276. [Google Scholar] [CrossRef] [PubMed]
  15. Zhang, H.; Zhang, Y. Convenient Nonlinear Model for Predicting the Tissue/Blood Partition Coefficients of Seven Human Tissues of Neutral, Acidic, and Basic Structurally Diverse Compounds. J. Med. Chem. 2006, 49, 5815–5829. [Google Scholar] [CrossRef] [PubMed]
  16. Martín-Biosca, Y.; Torres-Cartas, S.; Villanueva-Camañas, R.M.; Sagrado, S.; Medina-Hernández, M.J. Biopartitioning micellar chromatography to predict blood to lung, blood to liver, blood to fat and blood to skin partition coefficients of drugs. Anal. Chim. Acta 2009, 632, 296–303. [Google Scholar] [CrossRef] [PubMed]
  17. HyperChemTM, Molecular Modeling System, Release 8.0.7 for Windows; Hypercube, Inc.: Gainesville, FL, USA, 2009.
  18. DRAGON for Windows (Software for Molecular Descriptor Calculations), Version 5.5; Talete srl: Milan, Italy, 2007.
  19. Todeschini, R.; Consonni, V.; Mauri, A.; Pavan, M. E-Dragon for VCCLAB. Available online: http://michem.disat.unimib.it/chm/Help/edragon/index.html (accessed on 14 November 2012).
  20. Picard, R.; Cook, D. Cross-Validation of Regression Models. J. Am. Stat. Assoc. 1994, 79, 575–583. [Google Scholar] [CrossRef]
  21. Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H. The WEKA Data Mining Software: An Update. ACM SIGKDD Explor. Newsl. 2009, 11, 10–18. [Google Scholar] [CrossRef]
  22. Wang, Y.; Witten, I.H. Induction of model trees for predicting continuous classes. Working paper. University of Waikato: Hamilton, New Zealand, 1996. Available online: http://researchcommons.waikato.ac.nz/bitstream/handle/10289/1183/uow-cs-wp-1996-23.pdf?sequence=1 (accessed on 14 November 2012).
  23. Guidance Document on the Validation of (Quantitative) Structure-Activity Relationships [(Q)SAR] Models. In OECD Environment Health and Safety Publications. Series on Testing and Assessment. No 69; Chapter 6: Guidance on the Principle of Mechanistic Interpretation; Organisation for Economic Co-operation and Development: Paris, France, 2007; Available online: http://www.oecd.org (accessed on 14 November 2012).
  24. Gramatica, P. Chemometric Methods and Theoretical Molecular Descriptors in Predictive QSAR Modeling of the Environmental Behavior of Organic Pollutants. In Recent Advances in QSAR Studies: Methods and Applications; Puzin, T., Leszczynski, J., Cronin, M.T.D., Eds.; Springer: Dordrecht, The Netherlands, 2010; pp. 327–366. [Google Scholar]
  25. Platt, J. Influence of Neighbor Bonds on Additive Bond Properties in Paraffins. J. Chem. Phys. 1947, 22, 1448–1455. [Google Scholar] [CrossRef]
  26. Schuur, J.; Selzer, P.; Gasteiger, J. The Coding of the Three-Dimensional Structure of Molecules by Molecular Transforms and Its Application to Structure-Spectra Correlations and Studies of Biological Activity. J. Chem. Inf. Comput. Sci. 1996, 36, 334–344. [Google Scholar] [CrossRef]
  27. Saíz-Urra, L.; Pérez González, M.; Teijeira, M. QSAR studies about cytotoxicity of benzophenazines with dual inhibition toward both topoisomerases I and II: 3D-MoRSE descriptors and statistical considerations about variable selection. Bioorg. Med. Chem. 2006, 14, 7347–7358. [Google Scholar] [CrossRef] [PubMed]
  28. Viswanadhan, V.; Ghose, A.; Revankar, G.; Robins, R. Atomic physicochemical parameters for three dimensional structure directed quantitative structure-activity relationships. 4. Additional parameters for hydrophobic and dispersive interactions and their application for an automated superposition of certain naturally occurring nucleoside antibiotics. J. Chem. Inf. Comput. Sci. 1989, 29, 163–172. [Google Scholar]
  29. Viswanadhan, V.; Reddy, M.; Bacquet, R.; Erion, M. Assessment of Methods Used for Predicting Lipophilicity: Application to Nucleosides and Nucleoside Bases. J. Comput. Chem. 1993, 14, 1019–1026. [Google Scholar] [CrossRef]
  30. Ghose, A.; Viswanadhan, V.; Wendoloski, J. Prediction of Hydrophobic (Lipophilic) Properties of Small Organic Molecules Using Fragmental Methods: An Analysis of ALOGP and CLOGP Methods. J. Phys. Chem. A 1998, 102, 3762–3772. [Google Scholar] [CrossRef]
  31. Niculescu, S.P. Artificial Neural Networks and Genetic Algorithms in QSAR. J. Mol. Struc.: Theochem 2003, 622, 71–83. [Google Scholar] [CrossRef]
  32. Soto, A.J.; Vazquez, G.E.; Strickert, M.; Ponzoni, I. Target-Driven Subspace Mapping Methods and Their Applicability Domain Estimation. Mol. Inf. 2011, 30, 779–789. [Google Scholar] [CrossRef] [PubMed]
  33. Dragos, H.; Marcou, G.; Varnek, A. Predicting the predictability: A unified approach to the applicability domain problem of QSAR models. J. Chem. Inf. Model. 2009, 49, 1762–1776. [Google Scholar] [CrossRef] [PubMed]
Sample Availability: Samples in .mol file format are available upon request from the authors.
Figure 1. Decision tree model obtained after holding out 16.6% using M5p algorithm.
Figure 1. Decision tree model obtained after holding out 16.6% using M5p algorithm.
Molecules 17 14937 g001
Figure 2. Target values vs. predicted values using 16.6% of the compounds for testing using the decision tree depicted in Figure 1.
Figure 2. Target values vs. predicted values using 16.6% of the compounds for testing using the decision tree depicted in Figure 1.
Molecules 17 14937 g002
Figure 3. Target values vs. predicted values using 50.0% of the compounds for testing using neural network ensemble.
Figure 3. Target values vs. predicted values using 50.0% of the compounds for testing using neural network ensemble.
Molecules 17 14937 g003
Figure 4. Combined methodology scheme proposed for the QSPR model development.
Figure 4. Combined methodology scheme proposed for the QSPR model development.
Molecules 17 14937 g004
Figure 5. Plots of descriptors values vs. log Pliver values for the complete dataset. Some chemical families have been highlighted according to the color-coding presented in Table 1. (a) AMW; (b) Se; (c) Pol; (d) Mor29u and (e) ALOGP.
Figure 5. Plots of descriptors values vs. log Pliver values for the complete dataset. Some chemical families have been highlighted according to the color-coding presented in Table 1. (a) AMW; (b) Se; (c) Pol; (d) Mor29u and (e) ALOGP.
Molecules 17 14937 g005
Table 1. Dataset of in vitro blood-to-liver partition coefficients for 122 volatile organic compounds [7]. The color coding used in the figures in Section 3.2 is detailed here as follows: alkanes (blue/b), alcohols (green/g), aromatics (orange/o), some halogenated hydrocarbons (red/r) and the remaining compounds (white/w). Predicted values for decision trees (DT) and neural networks ensemble (NNE) using one sixth of the dataset (Experiment 1, Ntest = 20, Ntrain = 102) and half of the dataset (Experiment 2, Ntest = Ntrain = 61) are reported. Trn or Tst denotes whether the compound was part of the training or test set respectively.
Table 1. Dataset of in vitro blood-to-liver partition coefficients for 122 volatile organic compounds [7]. The color coding used in the figures in Section 3.2 is detailed here as follows: alkanes (blue/b), alcohols (green/g), aromatics (orange/o), some halogenated hydrocarbons (red/r) and the remaining compounds (white/w). Predicted values for decision trees (DT) and neural networks ensemble (NNE) using one sixth of the dataset (Experiment 1, Ntest = 20, Ntrain = 102) and half of the dataset (Experiment 2, Ntest = Ntrain = 61) are reported. Trn or Tst denotes whether the compound was part of the training or test set respectively.
CompoundLog PliverExperiment 1Experiment 2
SetDTNNESetDTNNE
1wNitrous oxide−0.04Trn−0.101−0.031Tst−0.210−0.028
2 bPentane0.61Trn0.4380.465Tst0.3560.293
3 bHexane0.48Trn0.5070.567Trn0.4750.460
4 bHeptane0.46Tst0.5680.621Trn0.5770.595
5 bOctane0.73Trn0.6840.680Tst0.6870.706
6 bNonane0.50Tst0.7620.746Trn0.8010.786
7 bDecane0.85Trn0.8620.843Tst0.9460.883
8 b2-Methylpentane1.04Trn0.7890.857Trn0.7020.859
9 b3-Methylpentane1.06Tst0.8140.898Trn0.7890.916
10 b3-Methylhexane0.93Trn0.8630.910Tst0.8450.947
11 b2-Methylheptane0.52Trn0.7130.751Tst0.7240.800
12 b2-Methyloctane0.74Trn0.7890.807Tst0.8350.864
13 b2-Methylnonane0.76Trn0.7860.733Tst0.8360.799
14 b2,2-Dimethylbutane1.13Trn0.7190.741Tst0.5940.731
15 b2,2,4-Trimethylpentane0.80Tst0.5790.556Trn0.5280.602
16 b2,3,4-Trimethylpentane0.70Trn0.9020.886Tst1.0070.976
17 wCyclopropane0.02Trn0.1180.075Tst0.0820.213
18 wMethylcyclopentane0.96Trn0.8880.905Trn0.9061.013
19 wCyclohexane0.88Trn0.8430.918Tst0.8280.944
20 wMethylcyclohexane0.71Tst0.8300.892Trn0.8260.922
21 w1,2-Dimethylcyclohexane1.17Trn0.9570.949Trn1.1361.099
22 w1,2,4-Trimethylcyclohexane0.86Trn0.8860.820Trn1.0200.932
23 wtert-Butylcyclohexane0.30Trn0.5290.447Trn0.6080.407
24 wJP-100.98Trn1.0830.972Tst1.4521.400
25 wEthene0.24Tst0.0440.084Trn0.0390.226
26 wPropene−0.07Trn0.1230.092Tst0.1480.228
27 w1-Octene0.80Trn0.6870.719Tst0.6930.732
28 w1-Nonene0.93Tst0.9010.815Trn0.8330.854
29 w1-Decene1.06Trn0.9810.871Tst0.9520.915
30 w1,3-Butadiene−0.26Trn0.1430.083Tst0.2130.199
31 w2-Methyl-1,3-butadiene0.32Trn0.2440.321Tst0.4400.318
32 wDifluoromethane0.24Trn−0.068−0.053Tst−0.251−0.004
33 wChloromethane0.23Trn0.0130.058Trn−0.0720.140
34 rDichloromethane−0.11Trn0.0590.057Trn0.0020.073
35 rChloroform0.13Trn0.1670.141Tst0.1650.099
36 wCarbon tetrachloride0.53Trn0.5200.607Trn0.4720.517
37 wChloroethane0.07Trn0.0850.035Tst0.0490.162
38 w1,1-Dichloroethane0.15Trn0.1060.031Tst0.1400.133
39 w1,2-Dichloroethane0.16Trn0.1550.078Trn0.1940.164
40 r1,1,1-Trichloroethane0.44Trn0.2610.238Trn0.3740.287
41 r1,1,2-Trichloroethane0.19Trn0.2300.145Tst0.2310.165
42 r1,1,1,2-Tetrachloroethane0.40Trn0.3330.288Tst0.4210.252
43 r1,1,2,2-Tetrachloroethane0.16Trn0.3180.262Tst0.3600.221
44 wPentachloroethane0.39Tst0.4060.447Trn0.4350.415
45 wHexachloroethane0.81Trn0.4750.714Trn0.3680.835
46 w1-Chloropropane0.12Trn0.2010.148Trn0.2920.227
47 w2-Chloropropane0.18Trn0.1630.083Tst0.1710.191
48 w1,2-Dichloropropane0.25Trn0.2200.112Trn0.2230.165
49 wDibromomethane−0.04Trn0.1230.069Trn−0.0250.039
50 r1,2-Dibromoethane0.00Trn0.2150.107Trn0.3080.160
51 w1-Bromopropane−0.06Trn0.2210.129Tst0.2660.198
52 w2-Bromopropane0.00Trn0.2010.138Tst0.2930.238
53 wFluorochloromethane−0.17Trn−0.002−0.002Tst−0.1050.040
54 wBromochloromethane0.26Trn0.0920.055Tst−0.004−0.007
55 wBromodichloromethane0.00Trn0.1950.144Tst0.1360.085
56 wChlorodibromomethane0.22Trn0.2240.180Tst0.1040.264
57 r1,1-Dichloro-1-fluoroethane0.20Trn0.2140.154Trn0.2570.211
58 r1-Bromo-2-chloroethane0.03Trn0.1860.095Trn0.2610.150
59 r2-Chloro-1,1,1-trifluoroethane0.17Trn0.2020.089Trn0.1310.126
60 r2;2-Dichloro-1,1,1-trifluoroethane0.06Trn0.2880.223Trn0.2460.225
61 w1,1-Difluoroethene0.64Trn0.0750.051Trn0.0730.155
62 wChloroethene0.03Trn0.0570.058Tst0.0700.162
63 r1,1-Dichloroethene−0.05Tst0.1840.189Trn0.2120.207
64 rcis-1,2-Dichloroethene 0.02Trn0.0640.009Tst0.0570.064
65 rtrans-1,2-Dichloroethene0.07Trn0.0780.030Trn0.1680.102
66 rTrichloroethene0.27Trn0.1910.133Trn0.1980.123
67 wTetrachloroethene0.66Trn0.3100.320Tst0.2680.264
68 rBromoethene0.03Tst0.0670.029Trn0.0460.056
69 r1-Chloro-2,2-difluoroethene−0.02Tst0.0900.001Trn0.1200.070
70 w1,2-Epoxy-3-butene−0.23Trn−0.018−0.078Trn0.076−0.008
71 g1-Propanol0.05Trn−0.020−0.047Tst0.0590.001
72 g2-Propanol−0.03Trn−0.048−0.042Trn0.020−0.007
73 g1-Butanol0.02Tst0.0730.115Trn0.2070.114
74 g2-Methyl-1-propanol0.02Trn0.026−0.053Tst0.011−0.050
75 gtert-Butanol0.01Trn−0.0020.100Trn0.1180.098
76 g1-Pentanol0.41Trn0.3980.330Trn0.2850.291
77 g3-Methyl-1-butanol0.22Tst0.4080.362Trn0.3200.388
78 gtert-Amyl alcohol0.09Tst0.3790.290Trn0.2550.280
79 wAcetone0.02Trn−0.148−0.018Trn0.008−0.029
80 wButanone0.12Trn−0.0240.024Tst0.1340.023
81 w2-Pentanone0.13Trn0.0540.093Trn0.1680.081
82 w4-Methyl-2-pentanone0.23Trn0.4260.433Tst0.3680.551
83 w2-Heptanone0.30Trn0.4360.483Tst0.3430.455
84 wMethyl acetate−0.03Trn−0.118−0.166Tst−0.089−0.147
85 wEthyl acetate0.13Trn−0.036−0.002Tst0.102−0.012
86 wPropyl acetate0.48Trn0.3720.215Trn0.2400.248
87 wIsopropyl acetate0.62Tst0.4850.318Trn0.3450.499
88 wButyl acetate0.51Trn0.4360.478Trn0.3640.524
89 wIsobutyl acetate0.73Trn0.4310.470Trn0.3570.540
90 wPentyl acetate0.66Trn0.5270.572Trn0.4280.617
91 wIsopentyl acetate0.76Trn0.5920.642Trn0.5040.762
92 wDiethyl ether−0.17Tst0.0430.183Trn0.2510.195
93 wtert-Butyl methyl ether0.17Trn0.3450.201Tst0.1750.161
94 wtert-Butyl ethyl ether0.45Trn0.6240.634Tst0.5750.869
95 wtert-Amyl methyl ether0.28Tst0.4050.492Trn0.3800.470
96 wDivinyl ether0.07Trn−0.072−0.100Tst0.026−0.059
97 wEthylene oxide−0.07Trn−0.146−0.128Trn−0.108−0.042
98 wCyanoethylene oxide−0.56Trn−0.158−0.205Tst−0.168−0.157
99 wHalothane0.29Trn0.3230.290Trn0.2620.298
100 wTeflurane0.23Trn0.2610.194Tst0.1470.220
101 wFluroxene0.18Tst0.081−0.042Trn0.0570.019
102 wEnflurane0.27Trn0.3560.327Tst0.3000.357
103 wIsoflurane0.36Trn0.3140.317Trn0.1390.320
104 wSevoflurane0.63Trn0.3930.354Tst0.2810.413
105 wMethoxyflurane0.19Trn0.2470.234Trn0.1050.241
106 w1-nitropropane−0.13Trn0.084−0.052Tst0.056−0.005
107 w2-nitropropane−0.43Trn0.056−0.088Trn0.015−0.033
108 wCarbon disulfide0.48Trn0.1510.298Tst0.0120.142
109 oBenzene0.21Trn0.1820.108Tst−0.0050.121
110 oToluene0.50Trn0.2790.275Tst0.1370.194
111 oEthylbenzene0.31Tst0.3100.329Trn0.1570.276
112 oo-Xylene0.34Trn0.3990.364Tst0.4280.281
113 om-Xylene0.37Trn0.3060.334Tst0.1440.281
114 op-Xylene0.34Trn0.4060.367Trn0.3910.285
115 o1,2,4-Trimethylbenzene0.43Trn0.4140.370Tst0.4810.288
116 otert-Butylbenzene0.49Trn0.4340.368Tst0.4920.276
117 oStyrene0.47Trn0.3140.298Tst0.3290.246
118 om-Methylstyrene0.23Trn0.3640.326Trn0.3410.272
119 op-Methylstyrene0.14Tst0.4410.409Trn0.5330.298
120 wChlorobenzene0.31Trn0.3180.316Tst0.2320.225
121 w4-Chlorobenzotrifluoride0.28Trn0.5190.418Tst0.5720.443
122 wFuran−0.05Trn0.033−0.053Tst−0.0380.067
Table 2. Correlation coefficient of Se vs. AMW, Pol, ALOGP and Mor29u.
Table 2. Correlation coefficient of Se vs. AMW, Pol, ALOGP and Mor29u.
Descriptorr (correlation coefficient of Se vs. descriptor)
Se ≤ 16.025Se > 16.025
AMW−0.55−0.33
Pol0.570.32
ALOGP0.120.75
Mor29u0.09−0.15
Table 3. Final set of selected descriptors.
Table 3. Final set of selected descriptors.
DescriptorMeaningFamily
AMWaverage molecular weightConstitutional
Mor29u3D-MoRSE - signal 29/unweighted3D-MoRSE
ALOGPGhose-Crippen octanol-water partition coeff. (logP)Molecular properties
Polpolarity numberTopological
Sesum of atomic Sanderson electronegativitiesConstitutional

Share and Cite

MDPI and ACS Style

Palomba, D.; Martínez, M.J.; Ponzoni, I.; Díaz, M.F.; Vazquez, G.E.; Soto, A.J. QSPR Models for Predicting Log Pliver Values for Volatile Organic Compounds Combining Statistical Methods and Domain Knowledge. Molecules 2012, 17, 14937-14953. https://doi.org/10.3390/molecules171214937

AMA Style

Palomba D, Martínez MJ, Ponzoni I, Díaz MF, Vazquez GE, Soto AJ. QSPR Models for Predicting Log Pliver Values for Volatile Organic Compounds Combining Statistical Methods and Domain Knowledge. Molecules. 2012; 17(12):14937-14953. https://doi.org/10.3390/molecules171214937

Chicago/Turabian Style

Palomba, Damián, María J. Martínez, Ignacio Ponzoni, Mónica F. Díaz, Gustavo E. Vazquez, and Axel J. Soto. 2012. "QSPR Models for Predicting Log Pliver Values for Volatile Organic Compounds Combining Statistical Methods and Domain Knowledge" Molecules 17, no. 12: 14937-14953. https://doi.org/10.3390/molecules171214937

Article Metrics

Back to TopTop