QSPR Models for the Molar Refraction, Polarizability and Refractive Index of Aliphatic Carboxylic Acids Using the ZEP Topological Index

: The molar refraction, polarizability, and refractive index for a series of monocarboxylic, dicarboxylic, and unsaturated monocarboxylic acids, having a symmetric or asymmetric structure, were investigated by the application of quantitative structure property relationship (QSPR) technique. We used a linear regression method and a single molecular descriptor, the ZEP topological index, calculated in a simple manner, with the help of weighted electronic distances, and also calculated on the basis of the chemical structure of the molecules. The high-quality performance and predictive ability of the QSPR models obtained were validated by means of speciﬁc validation techniques: y-randomization test, the leave-one-out cross validation procedure, and external validation. The investigated properties are well modeled (with r 2 > 0.99) by the ZEP index, using the regression analysis as a statistical tool for developing reliable QSPR models. Our approach provides an alternative technique to the existing additive methods for predicting the molar refraction and polarizability of carboxylic acids, which is essentially based on the summation of atom and/or functional group contributions or bond contributions, and of some correction increments.


Introduction
Carboxylic acids form a family of organic compounds that contain the characteristic carboxyl functional group (-COOH or -CO 2 H), and they constitute an important class of chemicals that are very important in industry and also occur in many other processes. Among the most significant uses of carboxylic acids are the following: in making soaps, detergents, and shampoos; in food industry; in pharmaceutical industry; in the manufacturing of rubber; in making dye stuffs, perfumes, and rayon. Moreover, this series of fatty carboxylic acids is extremely good for human health. In the last years, the use of the properties of carboxylic acids, as independent variables in QSAR models, has been steadily increasing [1][2][3][4][5]; however, QSPR models that involve the properties of carboxylic acids as dependent variables are very few. This lack of interest seems to be due to the specific structure of carboxylic acids, which strongly influences their properties. The carboxyl functional group is generally considered to be a highly polar organic functional group. Due to the sp 2 hybridization state of the carbon atom and of the oxygen double bond, the carboxyl functional group has a planar structure which favors the p-π conjugacy and creates a strong permanent dipole. The dipoles present in carboxylic acids allow them to form strong hydrogen bonds between acid molecules, and also between acid molecules and water or other molecular solvents. These aspects influence the essential relationship between the structural attributes and the properties of carboxylic acids.
In our study, we considered three properties of carboxylic acids: molar refraction, molar polarizability, and refractive index. These properties are interrelated and are influenced by the electronic interactions and the polarity of carboxylic acids. The molar refraction, R m (cm 3 mol −1 ) is a constitutive-additive molecular property of substances [6]. The molar refraction is related to the polarizability of the molecules that make up the medium, by the Lorentz-Lorenz [7] equation: where n D is the refractive index of the given substance at optical wavelengths, usually at 589 nm (sodium D-line), V m is the molar volume, N A is Avogadro's constant, and P is the mean polarizability of molecules. For a radiation of infinite wavelength, R m = V m and, therefore, the molar refractivity can be used as a measurement of the real volume of the molecules, a very important fact for chemists and biologists. The molar refraction can also be evaluated by means of refractive index, molecular weight, and density, by replacing the molar volume in Equation (1) with the ratio of molecular weight (MW) and density (d): V m = MW/d. On the other hand, molar refraction is a measure of the total polarizability of a mole of substance, see [6] for more details. The refractive index (n D ) characterizes the capacity of a substance to refract the light. Light traversing a substance has a velocity different from the case when light is traversing a vacuum. The ratio of the velocity of light in a vacuum to that in a substance is the refractive index or the index of refraction of the substance. The refractive index is often used to identify a particular substance, to confirm its purity, or to measure its concentration, see [8] for more details.
Molar refraction and molecular polarizability being additive properties can be calculated by summing up the contributions of a variety of atoms and/or functional groups, bond contributions, and various corrections factors. The most developed way to obtain molar refraction uses Crippen's fragmentation methods [9,10]. Alternatively, attempts have been made by various QSAR researchers to model molar refractivity by using topological indices [11,12]. Verma, Kuo, and Hansch [13] studied the polarizability effects on ligand-substrate interactions, in terms of the number of valence electrons (NVE), and proposed various linear QSAR models. Verma and Hansch [2] performed a comparison regarding the use of NVE and calculated molar refractivity (CMR) in QSARs for studying chemical-biological interactions, while Hansch and Kurup [14] found that the simple summation of the valence electrons (H = 1, C = 4, O = 6, etc.) in a molecule is a measure of its polarizability. They also showed that this parameter correlates with the nerve toxicity of a wide variety of chemicals acting on the nerves of frogs, rabbits, cockroaches, and humans. Fast empirical models to predict molecular polarizability were also developed by Wang, Xie, Hou, and Xu [15], using two different approaches. The refractive index, molar refractivities, and molar polarizability constant of heterocyclic compounds were studied by Sonar and Pawar [16], while Granados, Gracia-Fadrique, Amigo, and Bravo [17] studied the refractive index, surface tension, and density of aqueous mixtures of carboxylic acids.
Starting from this background, the main aim of the present study was to develop linear monovariable QSPR models that are able to predict molar refraction, polarizability, and refractive index in the class of carboxylic acids by using the ZEP topological index.

Materials and Methods
In order to develop predictive QSPR models for molecular refraction, refractive index, and polarizability values of carboxylic acids, we followed the following steps: (i) the selection of the data set; (ii) generation of molecular ZEP index for carboxylic acids used in this work; (iii) building QSPR models within the selected data set; (iv) validation of the obtained QSPR models using the y-randomization test and the internal and external validation strategies.

Data Set
The properties of aliphatic carboxylic acid selected in this study are molecular refractivity, denoted by R m ; refractive index, denoted by n D ; and polarizability, denoted by P. The data set includes 80 acids: 50 saturated aliphatic monocarboxylic acids, 17 unsaturated acids, and 13 aliphatic dicarboxylic acids. Molecular refractivity values for these acids, as well as the refractive index values for 33 saturated aliphatic monocarboxylic acids and polarizability values for 21 saturated aliphatic monocarboxylic acids, were taken from the literature [1,18,19]. The values of polarizability for the 20 other monocarboxylic acids were calculated by means of the relation: where N A is Avogadro's constant. These values are given in Table 1 and are indicated by the superscript b .

The ZEP Index
The molecular topological index ZEP used in this QSPR study was calculated using hydrogen-suppressed graphs of the carboxylic acids. The molecular topological index ZEP introduced by Berinde [20] is defined as: where wed (i,j) is the weighted electronic distance, also introduced by Berinde [20]: , if there is a bond between atom i and atom j 0, if is not a bond between atom i and j In (4) v i , v j denote the degrees of the vertices i and j, respectively; Z k denotes the formal degree of vertex k and is defined by Z k = Z k · v k ; and Z k denotes the order number of atom k in Mendeleev's periodic system. The values of b ij are 1, 2, 3, and 1.5 for a single bond, a double bond, a triple bond, and an aromatic bond, respectively. Alternatively, the topological index ZEP can be calculated by using the connectivity matrix, CEP [21]. In Table 2 are given the weighted electronic distances, the formal degrees of vertices, as well as the degrees of the vertices for common bonds in carboxylic acids. In order to emphasize the number of bonds of the carbon atom and oxygen atom, respectively, we kept the hydrogen atoms visible. Table 2. Values of wed, Z k (upper row) and v k (lower row). In contrast to the usual topological distance, which is equal to 1 for any bond between two atoms, the weighted electronic distance, according to its definition, is able to differentiate between simple and multiple bonds, between covalent non-polar bonds and polar covalent bonds, and is also able to differentiate between the bonds depending on their branching degree and their neighboring bonds. It is also able to differentiate between the symmetric and asymmetric arrangements of atoms or groups of atoms with respect to a chemical bond. This property of differentiating is illustrated in Figure 1 in the case of four marked molecular graphs, which represent the structures of the following carboxylic acids: ethanoic, propanoic, 2-methylpropanoic, and 2,2-dimethylpropanoic. The carboxyl functional group is linked in each of the four mentioned cases to the remaining catena by a simple bond, but with a different branching and different neighboring bonds. Therefore, the weighted electronic distances for these bonds are different: 7.5; 4.5; 3.5; 3.0, i.e., the smallest value of the weighted electronic distance corresponds to the greatest branching, see Figure 1. We can illustrate the calculation technique of ZEP index for the hydrogen-suppressed graph of propanoic acid (G.2) by using Formula (3): We note that the ZEP index has been studied by the author in various contexts, in order to check its correlation power with several properties, and it has provided good correlation parameters [20][21][22].
In this work, the values of ZEP index for 84 carboxylic acids were calculated (the four other acids will be used in the validation process of our QSPR models). Note that all these values are different from each other, which also indicates the fact that ZEP index also has a good discrimination power, see also [22].

QSPR Model Building
In order to build a QSPR model, the data set was randomly divided into two subsets, namely, the training set and the test set. The training set was used for developing QSPR models, while the test set was used for validating the predictive power of the obtained QSPR models. In the training set, using least-square regression and considering only one variable, i.e., the ZEP index, simple linear equations were developed. The statistical parameters used to test the goodness-of-fit between the model-predicted and experimental values were the correlation coefficient (R), the coefficient of determination (R 2 ), the standard deviation (s), and the Fischer statistic value (F). A model with high values of R 2 and F, and a low value of s is usually preferred. For the coefficient of determination, the following condition is recommended [23]: R 2 >0.6. This condition shows that the model will have a better fitting ability, but it does not reflect at all on the predictive power of the model [24].

Model Validation
For evaluating the stability and the predictive ability of QSPR models developed in the present paper, we applied the following three validation strategies from the list of five basic validation procedures presented in [25][26][27]: y-randomization test, the internal validation, and external validation.
Y-randomization test. The main aim of the y-randomization test is to detect and quantify chance correlations between the dependent variable and descriptors [25]. This test is designed to ensure the robustness of a QSPR model [26]. When applying the y-randomization test, the dependent variable, in our case R m , or n D, or P, is randomly shuffled and a new QSPR model is developed using the independent variable, the ZEP index, but not randomly. The process is repeated several times. All QSPR models obtained are expected to have low R 2 values, otherwise the QSPR model developed cannot be used for the given data set. According to Kiralj et al. [27], if R 2 yi < 0.2, there is no risk of a chance correlation in the developed model.
Internal validation. In our study, the validity of the model was tested using the crossvalidation (CV) method and 'leave-one-out' (LOO) procedure in the training set. As is well-known, the correlation coefficient leave-one-out cross-validation describes the stability of a regression model. According to Kiralj et al. [27], the criterion of robustness and predictive ability of the model assumes R 2 CV > 0.5. It is accepted that the minimal acceptable statistics for a QSPR regression model are requirements R 2 cv >0.5 and R 2 >0.6, see [23]. It is also generally agreed that a large difference between R 2 and R 2 cv (exceeding 0.2-0.3) is an indicator of the overfitting of the QSPR model. External validation. The purpose of the external validation is to test the true predictive ability of the QSPR model. For this purpose, we analyzed the test data set of compounds that were not included in the training set or used in the model development. We first applied the y-randomization test, then we calculated the statistical parameters R 2 ext and Q 2 ext , similarly to R 2 and R 2 CV for the training set. The external validation performance is given by R 2 ext and Q 2 ext . R 2 ext is a measure of fitting for the external validation set and can be compared to R 2 for the training data set [28].

Results and Discussion
We calculated the ZEP index for the 84 acids used in this study: the values obtained for the 50 saturated aliphatic monocarboxylic acids are listed in Table 1, the ZEP index values of the 13 aliphatic dicarboxylic acids and 17 unsaturated acids are listed in Tables 3 and 4, respectively, while the values of ZEP for the remaining four acids are listed in Table 5.

Saturated Aliphatic Monocarboxylic Acids
In order to build a QSPR model for the molecular refractivity, we applied the above mentioned procedure. Table 1 displays the experimental molecular refraction for the 50 saturated aliphatic monocarboxylic acids having asymmetric structure with respect to the carboxyl group or having an asymmetric carbon atom. They were divided into two subsets: one set with 26 acids that formed the training set used in the modelling process, and another set with 24 acids that formed the test set, which are marked with b and was used for testing the model in external validation.  By correlating the molecular refractivity with ZEP index for the 26 monocarboxylic acids, which were used as a training set, we obtained the following linear QSPR model: The QSPR model (5) has a very good statistical quality for fitting the calculated R m values to the experimental ones. The robustness of the model (5) and its internal predictive ability were evaluated by R 2 CV -cross validation coefficient based on leaveone-out (LOO); its value of 0.9998 being very good. Model (5) was also checked for reliability, robustness, and chance correlation by applying the y-randomization test. The y-randomization test was performed 10 times. Results of the y-randomization test are presented in Table 6. In each y-randomization run, R 2 yi < 0.2, which shows that the good results in our original model were not due to a chance correlation or structural dependency of the training set. The QSPR model (5) was statistically internally validated and this equation was used for the calculating values of the molecular refractivity for the training set, and also for the R m predicted values of monocarboxylic acids in the test set. The results are presented in Table 1. The analysis of residuals of predicted molecular refractivity against the experimental values, in the training set, shows that the residuals only exceeded in three situations the standard deviation limits of ±2 s, in our case we had ±0.42. There were three small excesses that appeared for acids with similar structure: 2,2-dimethylpropanoic(−0.47 error), 3,3-dimethylpentanoic (+0.45 error), and 3,3-dimethylhexanoic (+0.47 error). The linear QSPR equation resulted from eliminating these three values from the correlation process is the following: By eliminating those values, the goodness of fit, the reliability, and the robustness of the QSPR model (6) are not significantly improved.
The capability of the linear model (5) to predict R m values for monocarboxylic acids with unknown R m , was investigated in the test set. The predicted R m values for a series of the 24 monocarboxylic acids included in the test set were calculated with Equation (5) and are given in Table 1 The analysis of residuals shows a single compound, that is, 2,2,3,3-tetramethylbutanoic acid, falling outside the standard deviation limits of ±2 s. All the validation strategies show that the obtained model (5) is a valid QSPR model for the prediction of molecular refractivity of monocarboxylic acids. A general QSPR model for all the 50 monocarboxylic acids was also proposed: The obtained result suggests that our QSPR model (5) is, indeed, very good.

Aliphatic Dicarboxylic Acids
Dicarboxylic acids contain in their structure two functional carboxylic groups. Therefore, in our study we have considered dicarboxylic acids with a linear and symmetric structure with respect to the two functional carboxylic groups. As a consequence of this fact, the polarizability of dicarboxylic acids and the electronic interactions are stronger than in the case of monocarboxylic acids. This is the reason why, in a first step, we developed separately a QSPR model for a set of 10 aliphatic dicarboxylic acids (as a training set). Table 3 presents the values of ZEP index calculated for these acids and the experimental values of molecular refractivity. By linear regression and using a single descriptor we obtained the following equation: The coefficient of determination R 2 = 0.9998 and the standard error s = 0.0178 show a very good correlation between the ZEP index and molecular refractivity for aliphatic dicarboxylic acids. The model was validated by leave-one-out cross-validation and yrandomization. The results of y-randomization are presented in Table 6. These data show, for each iteration, values of R 2 yi < 0.2, which proves the stability of the model. On the other hand, the cross-validation coefficient, R 2 CV = 0.998, illustrates the reliability of the model. The leave-one-out cross-validation predicted values are presented in Table 3. Therefore, the obtained model (9) is indeed suitable for calculating the values of molar refractions in this class of dicarboxylic acids.

Unsaturated Carboxylic Acids
Unsaturated carboxylic acids contain in their structure double and triple bonds, alongside the carboxylic functional group. The multiple bonds are arranged asymmetrically with respect to the carboxyl group. The multiple bonds influence the polarizability and the electronic interactions of unsaturated carboxylic acids, but this influence is less significant than in the case of dicarboxylic acids. In our study, we developed a QSPR model for molar refraction (R m ), corresponding to a set of 12 unsaturated acids (as a training set). Table 4 presents the values of ZEP index calculated for these acids and the experimental values of molecular refractivity. By applying the linear regression method and using a single descriptor we obtained the following equation:  Table 6. These results show, for each iteration, values R 2 yi < 0.2, which indicates the stability of the model. The value of cross-validated coefficient, R 2 CV = 0.995, which is very close to the coefficient of determination, illustrates the reliability of the model. The leave-one-out cross-validation predicted values are also presented in Table 4. In order to check the predictive ability of the model (10), we calculated, by using this equation, the values of the molar refraction for five unsaturated carboxylic acids. The obtained values were compared with the experimental values of molar refraction existing in the literature. The differences between the experimental and predicted values were not significant. Therefore, the QSPR model (10) was shown to be very good for the calculation of molar refraction for unsaturated carboxylic acids.
At the end of our study, we applied the linear regression method to the set of 80 acids obtained by the union of the set of 50 monocarboxylic acids, the set of 13 dicarboxylic acids, and the set of 17 unsaturated carboxylic acids. We thus obtained the following QSPR model for molar refraction:  (11), modelling the molar refraction of carboxylic acids, was used to compute the molar refraction for four other carboxylic acids, not previously considered in the QSPR study. The results obtained in this way are given in Table 5. As can be seen, the maximum standard error for R m corresponds, as expected, to Equation (11), which comprises all carboxylic acids considered in the study.

Building the QSPR Model for Polarizability
In Table 6 are presented the experimental values of molecular polarizability for 41 saturated aliphatic monocarboxylic acids divided into two subsets. One set containing 21 acids that will serve as training set in the QSPR modelling process, and another set containing 20 acids that form the test set, which are marked with the superscript b and which shall be used for testing the model by the method of external validation. The values for the polarizability of the 20 acids in the test set were obtained by conversion of molecular refractivity, using Equation (2). By correlating the molecular polarizability with ZEP index for these 21 monocarboxylic acids used as training set, the following linear QSPR model was obtained: The QSPR model (12) has a very good statistical quality for fitting the calculated values of P to the experimental ones. The robustness of the model (12) and its internal predictive ability were evaluated using a R 2 CV -cross validation coefficient based on leaveone-out (LOO); its value of 0.9998 being very good. The model (12) was also checked for reliability, robustness, and chance correlation by applying the y-randomization test. The y-randomization test was performed 10 times. The results of the y-randomization test are presented in Table 6. In each y-randomization run, we obtained R 2 yi < 0.2, which shows that the good results in our original model were not due to a chance correlation or structural dependence of the training set.
The QSPR model (12) was statistically internally validated and then this equation was used for calculating the values of the molecular polarizability for the training set and also for the predicted values P of monocarboxylic acids in the test set. The results are presented in Table 1. Regression of the predicted polarizability against the observed molecular polarizability was R 2 0 = 0.998. The analysis of residuals of predicted molecular polarizability against the experimental values, in training set, showed that the residuals only once exceeded the standard deviation limits of ±2 s, in our case ±0.123. This corresponds to the compound 2,2-dimethylpropanoic (0.18 error). The capability of the linear model (12) to predict P values for monocarboxylic acids was investigated in the test set. The predicted values of P for a series of 20 monocarboxylic acids included in the test set, close to the number of acids in the training set, were calculated with Equation (12) and are given in Table 1, together with their deviations from the corresponding experimental values of P. The external predictive power was confirmed by R 2 CV ext = 0.998 and R 2 ext = 0.9998. The analysis of residuals shows a single compound, 2,2,3,3-tetramethylbutanoic acid, falling outside the standard deviation limits of ±2 s. All the validation strategies show that the obtained model (12)  The statistical results for Equation (13) suggest that our QSPR model (12) is very good. The obtained QSPR equations modelling the polarizability of carboxylic acids were used to compute the polarizability for four other carboxylic acids, not previously considered in the QSPR study. The results obtained in this way are given in Table 7.  (14) −0.02 Notably, the obtained values for polarizability and molar refraction increased relatively with the size and molecular weight of carboxylic acids. This fact is in agreement with the formula of Lorentz-Lorenz, which gives the relationship between polarizability, the molar refractivity, and volume [18].

Building QSPR Models for Refractivity Index
The molecular set considered here comprises 33 aliphatic monocarboxylic acids, with the corresponding n D values (see Table 1), of which 22 acids were used as the training set in the modeling process and 11 acids were used as a test set for external validation, which are marked with the superscript b . The following QSPR model was obtained in this case: The model was similarly validated by leave-one-out cross-validation and y-randomization techniques. The results of y-randomization are presented in Table 6. Using Equation (14), we calculated the values of the refractivity index for 11 saturated aliphatic monocarboxylic acids. The obtained values were compared with the experimental values of refractivity index existing in the literature. Therefore, the QSPR model (14) was shown to be very good for the calculation of refractivity index for saturated aliphatic monocarboxylic acids.

Conclusions
In this work, we presented various QSPR models, as an alternative technique to the existing additive methods, for predicting the molar refraction, polarizability, and refraction index of carboxylic acids. We used a linear regression method and a single molecular descriptor, the ZEP topological index. ZEP was calculated in a simple manner with the help of weighted electronic distances (wed), also calculated on the basis of the chemical structure of the molecules. The QSPR models developed were validated by means of a leave-one-out cross validation procedure, external validation, and y-randomization. The obtained results show that the proposed models are simple and have a significant predictive potential. Therefore, all QSPR models thus developed, irrespective of the property by which they were constructed, i.e., for R m , or P, or n D , can also be applied for predicting the other two properties of carboxylic acids, in agreement with the Lorentz-Lorenz formula. This intercorrelation relationship also explains the fact that, for all QSPR models reported here, the correlation coefficients have closed values.
The results reported in this paper could be used in QSAR (quantitative structure activity relationship) for the prediction of the biological or pharmaceutical activity of carboxylic acids. Thus, the molar refraction values could be used for the estimation and prediction of the lipophilicity of a homologous series of saturated fatty acids [3], while the refraction index could be used for the estimation and prediction of the toxicity of aliphatic carboxylic acids [1]. These aspects will be considered in a future work.