Variable Connectivity Index as a Tool for Modeling Structure- Property Relationships

We report on the calculation of normal boiling points for a series of n = 58 aliphatic alcohols using the variable connectivity index in which variables x and y are used to modify the weights on carbon (x) and oxygen atoms (y) in molecular graphs, respectively. The optimal regressions are found for x = 0.80 and y = -0.90. Comparison is made with available regressions on the same data reported previously in the literature. A refinement of the model was considered by introducing different weights for primary, secondary, tertiary, and quaternary carbon atoms. The standard error in the case of the normal boiling points of alcohols was slightly reduced with optimal weights for different carbon atoms from s = 4.1 degrees C (when all carbon atoms were treated as alike) to s = 3.9 degrees C.


Introduction
As can be seen from browsing through the literature despite the availability of various methodologies, such as the Principal Component Analysis (PCA), the Ridge Regression (RR), the Partial Least Square (PLS), and the Artificial Neural Networks (ANN), the Multivariate Regression Analysis (MRA) continues to be widely used in studies of structure-property relationships.The situation is likely to continue because different methodologies have different advantages and limitations.With respect to MRA, in recent years we witnessed several important developments associated with molecular descriptors that continue to keep MRA as an active route for structureproperty studies.These developments include: (1) Development of numerous novel molecular descriptors; (2) Development of user-friendly computer software for MRA; (3) Development of orthogonalization procedure for descriptors; (4) Development procedures for interpretation of descriptors; (5) Development of "flexible molecular descriptors." Here, we will be concerned with the use of flexible molecular descriptors in MRA, a development that started a dozen years ago and appears to be catching up in recent time.However, before entering the domain of flexible descriptors, let us briefly elaborate on the remaining four points listed above because those interested in structure-property-activity studies should consider and, if possible, combine all the five aspects of MRA in order to arrive at better models for use in QSAR and QSPR (Quantitative Structure-Activity Relationships and Quantitative Structure-Property Relationships, respectively).
For an exhaustive compilation of numerous molecular descriptors one should consult the Handbook of Molecular Descriptors by Todeschini and Consonni [1], while a recent book by Devillers and Balaban [2] lists numerous articles on novel molecular descriptors.The development of novel topological indices (TIs) goes in two directions: (1) Generalization of functional dependencies between distance or adjacency matrices and various properties and (2) development of molecular descriptors designed for heteroatoms.In spite of the great number of successful applications of topological indices in QSAR and QSPR studies, their structural interpretation still remains the main bottle-neck in the understanding of their nature.Due to a large number and wide diversity of graph invariants different TIs, it is difficult to find general relations.Thus it is desirable to obtain some generalization of TIs that makes further interpretation more effective.Lucic et al. generalized Wiener index for the modeling of boiling points [3].Estrada proposed a generalization of several well-known TIs by using a vector-matrix-vector multiplication procedure [4][5][6][7].
A number of molecular descriptors that have been designed for use with molecules containing heteroatoms can be related to several graph theoretical indices (referred often as topological indices) and can be viewed as specifically generalized topological indices.Examples include the well-known valence connectivity indices of Kier and Hall [8,9], which are derived from the connectivity index [10] and the "higher order" connectivity indices [11], descriptors derived from distance matrix of molecules containing heteroatoms [12], Balaban's topological index J for heteroatom-containing molecules [13][14][15] derived from the topological index J [16], Zefirov and Palyulin's "solvation connectivity index" [17][18][19].We will see later how the variable connectivity index may be viewed not merely as a generalization of the "valence" connectivity indices but as an "unlimited" collection of "valence" connectivity indices.
As a user-friendly software for MRA we should mention CODESSA, developed by Katritzky, Lobanov and Karelson [20].In CODESSA, there are some 400 molecular descriptors that can be evaluated for a given set of structures and used in MRA.
Orthogonalization of molecular descriptors, a procedure that is only slowly gaining increased attention, has several important properties for MRA [21][22][23][24].First, it leads to a stable step-wise regression equation in which the coefficients of the already used descriptors do not change when additional (orthogonal) descriptors are added.Thus, this allows one to view the coefficients of the regression equation as indicators of the relative importance of the descriptors used.It also allows one to condense the structural information contained in two or more descriptors into a single descriptor [25,26].Finally, it also allows one, by using retro-regression [27], to find the best subset of descriptors for regression using n descriptors.
Recently a relatively general procedure has been outlined which allows one to partition molecular descriptors in terms of bond contributions [28,29].Such partitioning provides some insight into the relative importance of the individual bonds for the particular molecular property.When this approach is applied to different topological indices, one sees that, for some indices, terminal CC bonds play a more important role and the interior CC bonds make lesser contributions, while in other indices the opposite is the case [30,31].This, of course, is important for a better understanding of molecular models.

Flexible Molecular Descriptors
Typical MRA analysis starts with a selection of molecular descriptors from a pool of available descriptors.For instance, as already mentioned, CODESSA offers several hundreds of descriptors to choose from, which include indicator variables, quantum chemically computed quantities, and graph theoretically derived or modified molecular descriptors.The procedure and criteria for the selection of descriptors and alternatives is a non-trivial problem in view of interrelation of many descriptors.This is an important topic that continues to receive attention and may well result in diverse approaches yet to be fully evaluated.In contrast to approaches based on choosing between "fixed" molecular descriptors, the notion of "flexible" molecular descriptors varies descriptors already used by modifying them slightly and thus creating novel descriptors to be tested for their performance.The process continues till, for a given type of descriptor, one has explored the domain of the variable parts of the descriptors and found optimal parameters that minimize the overall standard error of the regression.
The first variable descriptor was the variable connectivity index [32,33].The index is constructed by introducing one or more variables associated with atoms of different kinds or different types.In this way, the bond (m,n) that connects atoms having m and n nearest neighbors, respectively, (which contributed 1/√ (m⋅n) to the connectivity index 1 χ) makes the contribution 1//√ [(m+x)(n+y)], where x and y are variables to be selected during the regression analysis.For example, in a study of normal boiling points of smaller aliphatic alcohols [28], it was found that the standard error was reduced from 7.9°C when the connectivity index 1 χ is used, that is when x = y = 0, to 3.3°C when x = 1.50 and y = -0.85.This is quite an impressive improvement in the simple regression for the normal boiling points of smaller alcohols.Similar results were obtained when the normal boiling points of smaller amines were considered [34], when it was found that the standard error was reduced from 3.5°C when the connectivity index 1 χ was used, (that is when x = y = 0), to 1.9°C when x = 1.25 and y = -0.65.These two cases well illustrate the "power" of the variable connectivity index and suggest by extension that, in general, variable indices are likely to offer significant improvements in the regression analysis.
Recently, in addition to the use of variable connectivity indices [35][36][37][38][39][40][41][42][43], several different flexible molecular descriptors have been considered in the literature, including the use of variable paths weights [44,45], variable distance related indices [46,47], and other variable descriptors [48].Thus, we may well be at the beginning of a novel direction in structure-property-activity studies that will be dominated by the use of variable molecular descriptors that not only lead to better regressions with fewer molecular descriptors in comparison with regressions using "fixed" descriptors, but, because variable descriptors offer novel interpretation to the descriptors [49], may also offer guidance in the refinement of molecular models, as will be explained later in this article.An important consequence of the introduction of the variable molecular connectivity index should not be overlooked, however: variable connectivity not only clearly points to limitations of the "valence"-based molecular descriptors but has actually demonstrated that there are no "universal" valence descriptors, and that, at best, they may be suitable for a narrow selection of molecular properties.

Normal boiling Points of Aliphatic Alcohols Revisited
We have selected a set of 58 aliphatic alcohols, previously considered in the literature, in order to compare the performance of different variable descriptors to that of the variable connectivity index.One of the earlier studies is based on use of path numbers in which variable weight has been introduced for paths that include the oxygen atom [45].Using paths of length one and two, the optimal variable weight, x, is 2.6, and the associated standard error, s, is 4.0°C.When paths of length three are also included, the optimal variable weight increases slightly to 3.1, with an associated standard error of 3.9°C.
Recently Krenkel, Castro and Toropov [48] reported on their analysis of the same 58 aliphatic alcohols using "CW"(correctional weight) descriptors.In the upper part of Table 1, we have collected the statistical parameters for several multivariate regression analyses of the normal boiling points of alcohols, and in the lower part of Table 1 we have summarized the results of the present study on the same set of alcohols.Observe the distinction between "descriptors" and "variables," the former indicating the number of molecular descriptors that appear in the regression equation and the latter indicating the "flexibility" of the descriptors, which is the inherent power to adjust and modify so that they can account for different kinds of atoms or different environments for similar atoms.Let we see an example.If we use MRA the number of descriptor represent the number of parameters of the model minus one if the constant is used (eq.1).
We could see that the number of parameters for the above model is equal n+1.The flexibility of model is achieved by selection of certain number of descriptors from the large pool of available descriptors.In the case of variable topological indices usually simple linear regression is used (eq.2) and the number of parameter of the model is equal 2.
We can see that the number of variables has no influence on the number of parameters of the model, because variables represent just the flexibility of the single descriptor.It is true that by increasing the number of variables we are increasing the possibility to obtain a chance correlation but the same is true also in the case of MRA where chances to obtain a random correlation are increased by using larger pool of possible descriptors.
The correlations using weighted paths involve several molecular descriptors, while the regression of Krenkel, Castro and Toropov [48] as well as all of the models using the connectivity index are simple regressions using one molecular descriptor.There is no doubt that the approach of Krenkel, Castro and Toropov [48] gave better results that those of [45], evidenced not only by the smaller standard error and fewer outliers but also by the fact that they used half of the data for the "training set" and half of the data as the "prediction set."However, their results are not totally surprising in view of the fact that Castro et al employed four different weights for primary, secondary, tertiary and quaternary carbon atoms, in addition to using three weights for different atomic types (hydrogen, carbon and oxygen).It may be of interest to consider how much the introduction of different weights for paths of different lengths may improve the results based on variable path numbers, although this remains to be seen.Interest in such a comparison arises from the fact that, on one hand, one uses several variable descriptors (paths of different lengths) in MRA and, on the other hand, one uses a single variable descriptor (CW descriptor) in a simple regression but expressed with twice as many variables.Future applications are likely to show how these two alternative approaches complement each other, how much they overlap, and how much they differ in various applications.
We see from Table 1 that on using a single variable connectivity index that discriminates between carbon atoms and oxygen atoms (the variable weights for carbon and oxygen are 0.80 and -0.90, respectively) one obtains a regression correlation of similar quality as those based on two and three variable path numbers.Thus, in this respect, the variable connectivity indices are more "powerful" descriptors than the variable path numbers.We should add, that although the novel "correlation weights" indices do offer a good regression, the indices themselves show high degeneracy, with about half of the molecules considered having the same numerical values for their indices, as indicated in Table 2 below, and hence the same applies to their computed normal boiling points: Degeneracy among topological indices is quite common and not necessarily detrimental because different compounds, having identical molecular descriptors, may also have identical or very similar magnitudes for selected molecular properties.In the case of alcohols, for instance, as we can see from Table 2, this is the case with 2-methyl-2-hexanol, 3-methyl-3-hexanol and 3-ethyl-3-pentanol, having normal boiling points of approximately 142.4-142.5.However, there are cases wherein molecules have the same index but are reported to have significantly different normal boiling points, such as 3-methyl-2-pentanol and 2-methyl-3-pentanol, or 2,6-dimethyl-4-heptanol and 3,5-dimethyl-4-heptanol, where the two isomers differ by 7.7 0 C and 9.0 0 C. The variable connectivity index also shows degeneracy with respect to the set of 58 alcohols, but only for the isomers listed below: In view of the very good regression reported by Krenkel, Castro and Toropov for their variable index, it appears that their index may have even greater potential than hitherto displayed if it could be further modified by introducing additional variables that could differentiate among the many isomers showing the degenerate numerical values for the index as currently calculated.

Alcohol
Another matter that needs to be addressed, when one uses variable descriptors or other descriptors selected from a large pool of descriptors, is the risk of "chance" correlation.The risk of "chance" correlation can be assessed by randomizing the input data and determining whether the descriptors provide a satisfactory regression for randomly assigned input data.If they do, then the descriptors used for the particular regression should be rejected; if they don't, then the regression is significant.In view of the fact that we have a sizable sample (n = 58), one such random test may suffice to convince readers that the variable connectivity descriptors are encoding specific molecular structural features rather than adjusting to any meaningless set of input data.Using random number tables we have randomized normal boiling points.The order of the 58 normal boiling point entries is the following: 39 14  By using the ordinary connectivity index (that is not differentiating carbon and oxygen atoms), which gives the standard error of s = 8.6°C and the regression coefficient r = 0.971 if true normal boiling points are modeled, we obtain after randomization of normal boiling points a simple regression with the standard error 90.7°C and the regression coefficient r = 0.044.This clearly points to the nonrandom nature of the connectivity index χ.In fact that the difference between the largest and the smallest normal boiling points (corresponding to 1-decanol and methanol with 230.3°C and 64.7°C, respectively) is less than twice the standard error of the randomized input entries shows inability of the connectivity index to fit random numbers.However, we have to show that the same is true also for the variable connectivity index if the results summarized in Table 1 are to hold unchallenged.Hence, we introduced two variables, variable x to modify the weights of the carbon atoms and variable y to modify the role of the oxygen atom.After a search for the optimal values for the two variables, we obtain a standard error and coefficient of regression of s = 89.9°Cand r = 0.133, respectively, with x = 2 and y = 10 15 .Thus, clearly the variable connectivity index, while having the flexibility to adjust to specific structural variations of molecules (alcohols), cannot adjust to fit random data.This ought to suffice to dispel concerns that flexible molecular descriptors can be made to suit random data.

Model Refinement
The variable molecular indices, including of course the variable connectivity index, have shown flexibility in adjusting to specific requirements that individual molecular properties may require and consequently produced high quality regressions.In addition, these indices have another important property that should be considered -they offer increased latitude in molecular modeling that was hitherto not so readily accessible.Variable descriptors make it possible to test modifications of existing models in order to find better models, as will be illustrated here using normal boiling points of alcohols modeled with the variable connectivity index.
As we have seen from Table 1, the variable connectivity index with two variables, one for carbon atoms (x) and one for the oxygen atom (y), produced the regression equation that is accompanied with the standard error of s = 4.1°C.The optimal values found for the variables are x = 0.80 and y = -0.90, which can be interpreted to mean that the carbon-oxygen bond has a greater role than the carboncarbon bonds.One may now raise the question: does the enhanced role of the oxygen atom also influence the adjacent carbon atom?In other words, do the carbon atoms that are bonded to oxygen play a somewhat more important role in comparison with carbon atoms that are more distant from the oxygen atom?
With the notion of the variable connectivity index, the above question can be readily answered.All we need is to introduce an additional variable weight to be associated with the carbon atom that is adjacent to the oxygen.When we did this and varied three variables (two for carbon atoms and one for oxygen), we found no significant improvement in the regression.The conclusion is that there is no essential difference between carbon atoms, regardless of whether they are bonded to oxygen or not in the case of saturated compounds.There is, of course, a difference between carbon atoms having sp 3  and sp 2 hybridization in compounds having C=O bonds.
The next step in trying to improve the model based on discrimination between carbon atoms is to consider different weights for the primary, secondary, tertiary and quaternary carbon atoms.Indeed, when considering aqueous solubility of alcohols, Cammarata [50] observed that the primary, secondary, tertiary, and quaternary carbon atoms make somewhat different contributions to the total surface area, to be used as a structural descriptor for aqueous solubility of aliphatic alcohols.The variable connectivity index allows one to test if this is also the case with the normal boiling points of alcohols.Using four different weights for carbon atoms and a single variable for the oxygen atom, we have examined the normal boiling points of alcohols and obtained the results listed in the lower part of Table 1.As we can observe by differentiating the primary, secondary, tertiary, and quaternary carbon atoms, we were able to reduce the standard error from 4.1°C to 3.9°C, not dramatic but certainly a significant improvement.It turns out that the initial carbon "weights" of 0.80 have changed into: 0.80, 0.80, 0.96, 1.00, for the primary, secondary, tertiary, and quaternary carbon atoms respectively, thus showing no change for the primary or secondary carbon atoms but showing some decrease in the contributions associated with the tertiary and quaternary carbon atoms.Recall that an increase in the "weight", because of the inverse square root contributions, means a decrease to the bond additivity that contributes to the connectivity index.In Table 3 we have listed the connectivity index and two variable connectivity indices for each of the 58 alcohols in order to illustrate the inherent flexibility of variable molecular indices.While for all alcohols shown in Table 3, both variable connectivity indices have increased values as compared to χ, the changes are different for different molecules.For instance, 2-butanol, 2-M-1-propanol, 2-pentanol, and 2-M-2-butanol have all the same χ = 2.27006, but the corresponding values for the variable indices are visibly different: 2.75658, 2.96111, 3.11372, and 3.31825, respectively.These values again slightly change when we introduce different weights for the primary, secondary, tertiary, and quaternary carbon atoms, becoming respectively: 2.70941, 2.93925, 3.06655, and 3.29639.A close look at Table 3 shows that when introducing weights for primary, secondary, tertiary, and quaternary carbon atoms, some indices did not change at all, some have decreased in magnitude slightly, while others show somewhat greater decreases in magnitude.Clearly those alcohols that have no tertiary or quaternary carbon atoms have not changed (in this particular application).
From the above results, we conclude that modeling based on differentiation of CH 3 , CH 2 , CH and C carbon atoms has some merit and makes a significant, though not dramatic improvement on the accompanying regression equation.In Table 4 below we have listed the optimal regression equations.In addition to the regression equation based on all 58 compounds, we also included the regression equation in which we removed five compounds that show deviations larger than two standard deviations.With removal of the outliers, the standard deviation has dropped down to a very remarkable 2.6°C.In Table 5 we have listed the experimental normal boiling points as well as computed normal boiling points and the residuals.The asterisk in the last two columns indicates that the compounds were viewed as outliers, including methanol, with a computed normal boiling point 13.5 °C from the experimental value.The other outliers show differences of between 10.5°C and 6.2°C.Observe that all five of the excessive residuals are negative, meaning that all five computed normal boiling points are larger than the corresponding experimental values.It would be premature to speculate on the origin of this observation, and in particular to speculate on the possibility of impure in experimental samples, because it is difficult to imagine that the normal boiling point of methanol would not be reasonably accurate.In the case of methanol it is likely that the simplified representation of this alcohol by hydrogen suppressed graphs, which reduces molecular graph to simple single edge K 2 graph, may represent an oversimplified scheme.However the other four outliers: 2,2-dimethyl-1propanol, 3,3-dimethyl-2-pentanol, 2,2-dimethyl-3-petnanol, and 2,6-dimethyl-4-heptanol have no special structural features not present in other alcohols that could suggest their different behavior.Thus it remains an open question whether their departure represents limitations of the model, being the "tail" compounds in a Gaussian distributions of the residuals, or whether there may be some unspecified causes for their departure from the computed regression line, not excluding a possibility of impurities and other experimental errors.In any case it would be of interest to explore if other molecular models of similar accuracy would also point to the same subset of compounds as outliers or not.If that will be the case and if experimental errors are eliminated as cause of disagreement (by repeating the normal boiling point measurements) these compounds would present an interesting challenge for theoretical studies that would account for the anomaly in their normal boiling points.In Fig. 1 we have plotted the calculated normal boiling points against the experimental normal boiling points.

Concluding Remarks
One should not be surprised that different models may give regressions of similar statistical quality.The selection of descriptors depends also on the models considered and interpretability of the descriptors.In the case of the variable connectivity index we can conclude not only that the carbonoxygen bond makes a visibly greater contribution to the bond additivity of the normal boiling points of alcohols than do carbon-carbon bonds, but in view of the high quality correlation, we could identify several outliers.By excluding these outliers, the standard error for the regression of well over 50 alcohols is about 3.6°C, and even 2.6°C when five outliers have been removed from the set.It may be premature to speculate on why outliers have occurred, except perhaps for methane, the hydrogen suppressed molecular graph of which may have introduced oversimplification, because it is difficult to think that experimental error of over 10°C would be possible.However, we observe that for all five outliers the experimental normal boiling points are lower than those calculated, which may hint to a systematic rather than random displacement of computed normal boiling points, and consequently this might imply some structural factor not taken into account by variable descriptors that may be behind the difference between the computed and experimental normal boiling points.A closer look at the outliers shows that in three cases the OH group is next to a neopentyl fragment, thus being somewhat shielded.As a consequence in these compounds one expects weaker intermolecular hydrogen bonds that would be associated with lowering of the normal boiling points.In the fourth outliner we have two methyl groups on each side of OH that could produce a similar effect of shielding.Thus observed larger negative residuals may suggest presence of weaker hydrogen bonds and possible influence of steric hindrance due to some crowding of hydrogens from neighboring methyl or methylene groups.
If this explanation proves to hold we could even speculate on possible use of residuals in similar situations as a measure of the "weakening" of hydrogen bonds.We should mention that we exclude methanol from this discussion, as in the case of methanol no shielding could be present, but methanol clearly makes a class on its own and presents challenge on its own.

Figure 1 .
Figure 1.Calculated normal boiling points plotted against the experimental normal boiling points for 58 alcohols studies

Table 1 .
Statistical data on regression of the normal boiling points of alcohols.We grouped data from different sources ending with results from this study

Table 2 .
Degeneracy of the Correlation Weights descriptors of smaller alcohols

Table 3 .
The connectivity index and variable connectivity indices for alcohols considered

Table 4 .
The regression equation based on differentiation between primary, secondary, tertiary and quaternary carbon atoms.In the lower part of the table regression equation was listed when five outliers have been removed from the set of n = 58 alcohols

Table 5 .
The experimental and calculated normal boiling points for alcohols.The last two columns show results when five outliers have been removed from the set of n = 58 alcohols.