Estimating the Octanol/Water Partition Coefficient for Aliphatic Organic Compounds Using Semi-Empirical Electrotopological Index

A new possibility for estimating the octanol/water coefficient (log P) was investigated using only one descriptor, the semi-empirical electrotopological index (ISET). The predictability of four octanol/water partition coefficient (log P) calculation models was compared using a set of 131 aliphatic organic compounds from five different classes. Log P values were calculated employing atomic-contribution methods, as in the Ghose/Crippen approach and its later refinement, AlogP; using fragmental methods through the ClogP method; and employing an approach considering the whole molecule using topological indices with the MlogP method. The efficiency and the applicability of the ISET in terms of calculating log P were demonstrated through good statistical quality (r > 0.99; s < 0.18), high internal stability and good predictive ability for an external group of compounds in the same order as the widely used models based on the fragmental method, ClogP, and the atomic contribution method, AlogP, which are among the most used methods of predicting log P.


Introduction
The logarithm of the molecular 1-octanol-water partition coefficient (log P) of compounds, which is a measure of hydrophobicity, is widely used in numerous Quantitative Structure-Activity Relationship (QSAR) models for predicting the pharmaceutical properties of molecules [1][2][3][4][5][6][7]. In medicinal chemistry there is continued interest in developing methods of deriving log P based on molecular structure. From the experimental point of view the equilibrium methods for the determination of partition coefficients are difficult or, in some cases, impossible, as in the case of instable compounds or due to impurities. Other difficulties are associated with the formation of stable emulsions after shaking or compounds which have a strong preference for one of the phases of the system. Thus, the agreement between the theoretical and experimental approaches to the determination of partition coefficients continues to be a focus of scientific interest [8]. Despite the huge amount of experimental data on the log P values of organic structures, this is still insufficient compared with the number of compounds for which log P is of interest [5]. The first method of calculating log P was the π-system, developed by Hansch and Fujita [9,10]. Several different methods for calculating the log P values from chemical structure have in common that molecules are cut into groups or atoms; summing the fragmental or single-atom contribution results, to give the final log P value.
The most widely used method for calculating log P is the fragmental method [11], which is based on the additive constitutive properties of log P. In the case of the atomic-contribution method [12] the atom type is used instead of a fragment. This approach was developed in an effort to attribute properties to an atom within a molecular structure and most of these methods do not use correction factors, as in the fragmental methods. The more recent approaches consider the molecule as a whole. These models attempt to make theoretical estimations of log P, using graph-theoretical descriptors, molecular properties or quantum-chemical descriptors to quantify log P, some methods incorporating the effects of the three-dimensional structure and the electronic properties of the molecule [13][14][15][16][17][18][19][20][21][22]. Several researchers have compared the predictive ability of log P calculation models. A review was published by Mannhold and Waterbeemd in 2001 comparing log P calculations obtained from different models [5].
Recently, a new topological index, called the semi-empirical electrotopological index (I SET ), was developed by our research group in order to obtain a molecular descriptor not directly related to the chromatographic retention indices (RI) but based on values calculated by quantum mechanics to obtain Quantitative Structure-Property Relationship (QSPR) for different classes of organic compounds. This new approach takes into account the charges of the heteroatom and the carbon atoms attached to them through the definition of an equivalent local dipole moment [23][24][25][26].
The main goal of this study is to compare the predictive power of four log P calculation models and I SET for a set of 131 aliphatic organic compounds from five different classes. The external validation of the models is performed using the cross-validation coefficient, r cv 2 , and seven experimental log P values for aliphatic alcohols are calculated, which are not included in the training sets for each model.

Methods
The QSPR study of these aliphatic organic compounds was performed with the selection of the data set, generation of molecular descriptors, simple linear regression statistical analysis and model validation techniques. The model applicability was further examined by plotting predicted data against experimental data for all of the compounds. All regression analysis was carried out using the Origin [27] and TSAR programs [28]. The statistical parameters used to test the prediction efficiency of the models obtained were the correlation coefficient (r), standard deviation (s), coefficient of determination (r 2 ) and null hypothesis test (F-test). The validity of the model was tested with the cross-validation coefficient (r cv 2 ) using "leave-one-out" in the software program TSAR 3.3 for windows [28]. A group of seven compounds, not included in the original QSPR models, was employed for the external validation.

Data Set and Calculation Models
The experimental Log P values for the organic compound groups studied herein were taken from the literature [6,7]. Theoretical values of log P for 131 aliphatic organic compounds were obtained using four log P calculation models. Log P calculation methods can be roughly divided into two major classes: substructure approaches which have in common that molecules are cut into groups (fragmental methods) or atoms (atomic-contribution methods) (property-based models); and whole-molecule approaches that consider the entire molecule using molecular lipophilicity potentials, topological indices or molecular properties. Atomic-contribution methods do not usually require correction factors. The almost identical methodological background of the fragmental and atomic-contribution methods indicates their interchangeability.
Log P values were calculated employing atomic-contribution methods as in the Ghose/Crippen approach [12] (available in the Hyperchem package [29]) or its later refinement, AlogP [30,31], and using fragmental methods such as the ClogP method [32] available in the Osiris Property Explorer package [33]. ClogP and AlogP methods are among the most prominent methods of predicting log P. Both methods have been implemented as part of free and commercial software programs for molecular modeling applications [29,33,34]. Values of log P derived from the whole-molecule approach were calculated using topological indices as in the MlogP method [35]. AlogP and MlogP are available in the VCCLAB on-line software package (ALOGPS 2.1 program) [34]. The calculated and the experimental log P values for 131 organic compounds in the test set are shown in Table 1. The theoretical values were then determined using the models of Ghose/Crippen, AlogP, ClogP, MlogP and the present model through the I SET molecular descriptor. As can be seen in Table 1, some experimental log P values are missing, which may be related to the inherent difficulties associated with the determination of log P for certain compounds. However, their calculated values are included herein to allow future comparison with experimental values.

Semi-Empirical Electrotopological Index, I SET
In this study, the new descriptor, that is, the recently developed electrotopological index, I SET [23][24][25][26], is applied to QSPR studies to predict the octanol/water partition coefficient, Log P, for a large amount of organic compounds, including aliphatic hydrocarbons such as alkanes and alkenes, aldehydes, ketones, esters and alcohols. This new descriptor can be quickly calculated for this series of molecules from the semi-empirical, quantum-chemical, AM1 method and correlated with the approximate numerical values attributed by the semi-empirical topological index to the primary, secondary, tertiary and quaternary carbon atoms. Thus, unifying the quantum-chemical with the topological method gives a three-dimensional picture of the atoms in the molecule [23]. It is important to note that the AM1 method gives more reliable semi-empirical charges, dipoles and bond lengths than those obtained from time-consuming, low-quality, ab initio methods, that is, when employing a minimal basis set in ab initio calculations [36]. Despite the fact that the calculated partial atomic charges may be less reliable than other molecular properties, and that different semi-empirical methods give values for the net charges with poor numerical agreement, it is important to recognize that their calculation is easy and that the values at least indicate trends in the charge density distributions in the molecules. Since many chemical reactions or physico-chemical properties are strongly dependent on local electron densities, net atomic charges and other charge-based descriptors are currently used as chemical reactivity indices [37].
For alkanes and alkenes, this correlation has allowed the creation of a new semi-empirical electrotopological index (I SET ) for QSRR models [20] based on the fact that the interactions between the solute and the stationary phase are due to electrostatic and dispersive forces. This new index, I SET , is able to distinguish between the cis-and trans-isomers directly from the values of the net atomic charges of the carbon atoms that are obtained from quantum-chemical calculations. For polar molecules like aldehydes, ketones, esters and alcohols, the presence of heteroatoms like oxygen changes considerably the charge distribution of the corresponding hydrocarbons giving a partial increase in the interactions between the solute and the stationary phase. An appropriate way to calculate the I SET was developed, which takes into account the dipole moment exhibited by these molecules and the atomic charges of the heteroatoms and the carbon atoms attached to them. By considering the stationary phase as a non-polar material, the interaction between these molecules and the stationary phase are electrostatic with a contribution from dispersive forces. These interactions slowly increase relative to the corresponding hydrocarbons. Hence, the interactions between the molecules and the stationary phase slowly increase and, clearly, this is due to the charge redistribution that occurs in the presence of the heteroatom. This charge redistribution accounts for the dipole moment of the molecules. The dispersive force between these kinds of molecules and the stationary phase includes the charge-dipole interactions and dipole-induced dipole interactions, which are weak relative to the electrostatic interactions. Thus, the dipolar charge distribution in such molecules leads to a small increase in the interactions of the solute with the stationary phase relative to hydrocarbons where the dipole moment is zero, or almost zero. Clearly, the major effects on the charge distribution due to the presence of the (oxygen) heteroatoms occur in its neighborhood and the excess charge at these atoms leads to electrostatic interactions that are stronger than the weak dispersive dipolar interactions.
For aldehydes, ketones, esters and alcohols all these factors were included in the calculation of the retention index through a small increase in the values for the atomic descriptor (named SET i ) for the heteroatoms and carbon atom attached to them [24][25][26]. This was achieved by multiplying the SET i values of these atoms by a function A µ which is logarithmically dependent on the dipole moment of the molecule and the net charge at the oxygen and carbon atoms (to include both the electrostatic and dispersive interactions) that are embodied in the definition of the local dipole moment µ F [24][25][26]. In this approach the dispersive dipolar interactions were included in the calculation of the retention index by multiplying the SET i values of the heteroatoms (oxygen) and carbon atoms attached to the heteroatoms by the dipolar function A µ . That is, in this model the I SET is calculated as in Equation 1, where the SET i values are obtained through a linear relationship with the net atomic charge obtained from AM1 calculations [18][19][20][21]. In Equation 1, A µ is logarithmically dependent on the dipole moment of the molecule, as in Equation 2: where µ is the calculated molecular dipole moment and µ F is the equivalent local dipole moment which is dependent on the charges of the atoms belonging to the C-heteroatom group. In the above expression for the I SET (Equation 1) the dipolar function A µ is taken as the unit for the remaining carbon atoms of the molecules. The various definitions of the local dipole moment µ F are given in previous papers concerned with the retention index of aldehydes, ketones, esters and alcohols [24][25][26].
For the I SET model, the AM1 semi-empirical calculations of the net atomic charges were performed using the Hyperchem software package [29]. The initial geometries were obtained through molecular mechanics (MM+) calculations, being subsequently optimized using the AM1 method [36,38], employing the Polak-Ribiere algorithm and gradient minimization techniques with a convergence limit of 0.0001 and RMS gradient of 0.0001 kcal (A mol) −1 . Mulliken population analysis was employed to obtain the net atomic charge of the carbon atoms and oxygen atoms. The net atomic charge (Q i ) is obtained from the difference between the electronic charge of the isolated atom (Z) and the calculated charge of the bound atom (q i ), that is, Q i = Z − q i . The SET i values for each atom are obtained from Equation 2 using the AM1 net atomic charges (Q i ). Employing AM1 calculations these quantities are more easily obtained for a large number of molecules of reasonable size compared with those obtained when employing a minimal basis set in ab initio calculations [36]. Despite of the usually limited quantitative accuracy of semi-empirical methods the computational efficiency available nowadays [35] enables electronic properties of a large number of molecules to be obtained in a reasonable amount of time, and computational time is an important feature when developing models of quantitative structure-activity relationships (QSAR) [37].

Results and Discussion
The 3-hexanone molecule represented in the graph below is taken as an example of the I SET calculation using the present approach. The net atomic charges and SET i values are given in Table II of the reference 24. The results obtained in the statistical analysis of the single linear regression between experimental and calculated Log P values using I SET are shown in Table 2 for each class of compounds studied. They indicate that the theoretical partition coefficients calculated using the I SET method give good agreement with the experimental partition coefficients. The QSPR models obtained with I SET showed high values for the correlation coefficient (r > 0.99), and the leave-one-out cross-validation demonstrate that the final models are statistically significant and reliable (r cv 2 > 0.98). As can be observed, this model explains more than 99% of the variance in the experimental values for this set of compounds. Among the various classes of compounds the best results obtained with the I SET method are for hydrocarbons (Table 2), which is related to the fact that the present model was developed initially for this class of organic compounds. Values of r = 0.9986 and s = 0.10 were obtained for hydrocarbons, which are the lowest values considering the other four models.  The present results can be compared with those recently published for a new approach based on the Kovats retention indices, which uses multiple linear regressions [7], where reportedly for 37 hydrocarbons s = 0.46, for 11 aldehydes s = 0.27, for 27 alcohols s = 0.32 and for 13 esters s = 0.17. As can be seen in Table 2, the lowest standard deviation was obtained for the aldehydes correlation (s = 0.05) and for alcohols the correlation was greater (s = 0.18). The range of standard deviations obtained verifies the applicability of the present approach to different classes of organic compounds. For alcohols, the earlier approach of Duchowicz et al. [6], based on the concept of flexible topological descriptors and on the optimization of correlation weights of local graphic invariants, is applied to model the octanol/water partition coefficient of a representative set of 62 alcohols, resulting in a satisfactory prediction with a standard deviation of 0.22. Recently, Liu et al. [39] carried out a QSPR study to predict the log P for 58 aliphatic alcohols using novel molecular indices based on graph theory, by dividing the molecular structure into substructures obtaining models with good stability and robustness, and values predicted using the multiple linear regression method are close to the experimental values (r = 0.9959 and s = 0.15). The above results show the reliability of the present model calculation based on the semi-empirical calculation of atomic charges and local dipole moments using only one descriptor, I SET .
The statistical analysis for the predictive ability of four log P calculation models and I SET for a set of 131 aliphatic organic compounds from five different classes are summarized in Table 2. The AlogP method gives a stable performance for all classes of organic compounds tested, with much less variability in the statistical quality of results among different subclasses (r > 0.98 and s < 0.22). The ClogP method offers good predictability (r > 0.99 and s < 0.17), giving larger deviations only in the case of ketones (r = 0.955; s = 0.40). The MlogP and Ghose/Crippen methods have much larger deviations (r > 0.974 and s < 0.39) in comparison with the other methods.
The experimental and predicted log P values using I SET and the other four models (and the respective deviations) for an external group of alcohols are shown in Table 3. The Ghose/Crippen method and its refinement AlogP shows appreciable deviations for 1-undecanol and 4,4-dimethyl-1-pentanol, respectively, whereas the ClogP values are greater for branched alcohols. For the three last branched alcohols in Table 3 the whole molecule approach MLogP, which employs an MLR with final regression equation involving 13 parameters, gives the same value for Log P, being unable to distinguish the structural differences between these branched alcohols. The average standard deviation of calculated Log P for the seven alcohols of Table 3 using the I SET model is 0.15, whereas for the Ghose/Crippen method it is 0.34. The AlogP method, which is applicable to most neutral organic compounds and selective charged compounds, shows an average standard deviation of 0.26. In contrast, the ClogP method, which uses a large number of parameters and correction factors, results in a standard deviation of 0.17, while for the whole molecule approach the value is 0.24. These results demonstrate that the predictability of the present model for polar aliphatic organic compounds has the same pattern of accuracy as the widely used ClogP model. Table 3. Difference between experimental and predicted Log P (∆Log P) using I SET and the different methods studied (Ghose/Crippen, AlogP, MlogP, ClogP) for external group of alcohols. The predictive ability of a QSPR model can be estimated using an external test set of compounds that has not been used for building the model. According to Tropsha and Golbraikh [40] a high value of cross-validated r 2 (q 2 ) alone is insufficient criterion for a QSAR model to be considered highly predictive, and the use of an external set of compounds for the model validation is always necessary. The authors' state that the correlation coefficient, r, between the predicted and observed activities of compounds from an external test set should be close to 1 [40,41]. Following these authors, we considered seven compounds not included in the original model (Table 3) plotting observed vs. predicted log P values obtaining Y = 1.0273X − 0.1223 with r 2 = 0.9858 and Y = 0.9893X (with the intercept set to 0) with r 2 = 0.9842. Predicted vs. observed log P values, Y = 0.9596X + 0.1557 with r 2 = 0.9858 and Y = 1.008X with r 2 = 0.9828 were plotted. The QSPR model has a value of cross-validated (using leave-one-out), r cv 2 = 0.9870 showing that the model has high predictive power.

Conclusions
The efficiency and the applicability of the descriptor I SET in terms of predicting log P using the quantitative structure-activity relationship (QSPR) were demonstrated through the good statistical quality and high internal stability obtained for the studied classes of compounds as well as the good predictive ability for the external group of compounds. The I SET model also has the advantage of simplicity, using only one descriptor, and it has statistical quality of the same order as the widely used models based on the fragmental method, ClogP, and the atomic-contribution method, AlogP. The quality of the results obtained can be considered appropriate for the development of QSPR models for other compounds in the future.