Improved Molecular Descriptors Based on the Optimization of Correlation Weights of local Graph Invariants

We report the calculation of boiling points for several alkyl alcohols through the use of improved molecular descriptors based on the optimization of correlation weights of local invariants of graphs. As local invariants we have used the presence of different chemical elements (i.e. C, H, and O) and the existence of different vertex degree values (i.e. 1, 2, 3 and 4). The inherent flexibility of the chosen molecular descriptor seems to be rather suitable to obtain satisfactory enough predictions of the property under study. Comparison with other similar approximation reveals a very good behavior of the present method. The use of higher order polynomials do not seem to be necessary to improve results regarding the simple linear fitting equations. Some possible future extensions are pointed out in order to achieve a more definitive conclusion about this approximation.


I -Introduction
The relationship between molecules and graphs can be considered as a sort of isomorphism.In fact, if vertices are viewed as atoms and edges as bonds, then graphs represent models of chemical structures /1,2/.Conversely, if atoms in a molecule are interpreted as vertices and bonds as edges, then molecules are but illustrations of graphs /3/.That is to say, molecules have all those properties that the corresponding graphs have, but it is evident that molecules possess many additional properties that go beyond the mere consequences of the simple connectivity features that graphs encode.Therefore, the use of graphs as molecular models gives way to a basic problem within the realm of QSAR/QSPR (Quantitative Structure Activity Relationships/Quantitative Structure Property Relationships) theory and we can pose it asking how to select those graphs invariants (molecular descriptors) that can be reliable enough to establish a suitable relationship between biological activities/physicochemical properties and structure?
The aim of this paper is to deal with this pivotal issue in relation to the calculation of boiling points (bp) for a selected set of alkyl alcohols.We take as a reference study a recent paper on optimal molecular descriptors based on weighted path numbers /4/.The main idea is to resort to the construction of suitable descriptors for optimization through the introduction of an intrinsic flexibility degree involving a variable part that can be improved in different applications.This feature allows one to gain a freedom degree which hopefully should lead us to have better molecular descriptors and, consequently, more satisfactory mathematical relationships between structure and property.
This paper in organized as follows: next section deals with the definition and illustration of the chosen molecular descriptors based on the optimization of correlation weights of local graph invariants.Then we show the numerical results obtained via first, second and third order polynomial relationships for a selected set of alkyl alcohols and comparing them with previous results derived on the basis of a similar set of molecular descriptors.Section 4 is devoted to discuss the results, analyzing the similarities and significative differences with regard to other equivalent approaches.The final section is devoted to present the main conclusions derived from this study and finally several possible future extensions are pointed out.

II -Correlation Weights of Local Graph Invariants
The last three decades witnessed a meaningful upsurge of interest in applications of graph theory in chemistry.As pointed out before, constitutional formulae of molecules are chemical graphs where vertices represent the set of atoms and edges stand for chemical bonds.The pattern of connectedness of atoms in a molecule is preserved by constitutional graphs.Chemists have since long relied on visual perception to relate various aspects by constitutional graphs to observable phenomena.However, a clear and quantitative understanding of the structural basis of chemistry demands the use of precise mathematical techniques.The applications of matrix theory, graph theory, group theory and information theory to chemical graphs have produced results which are important in chemistry /5-13/.
Most molecular descriptors in QSAR/QSPR theory are rather "rigid" in the sense the algorithm for their construction is fixed so that once the molecule is selected, the invariant under consideration can be computed exactly.There are a large number of this sort of molecular descriptors and they have shown to be rather suitable /4/.However, there exists another separate class of molecular descriptors having an intrinsic flexibility involving a variable part that can be adjusted and optimized for different applications.Thus, the employment of weighted paths for alkyl alcohols have shown to extend enormously the approach of variable descriptors to molecules of different chemical composition /4/.
An alternative proposal for this kind of molecular descriptors is the Correlation Weights of the Local Invariants of Molecular Graphs (CWLIMG) introduced originally by one of us (AAT) /14-16/ and soon afterwards it was applied to study some physical chemistry properties /17,18/.Results were encouraging enough to promote new efforts to apply this new descriptor for studying other physical chemistry properties.
The CWLIMG approach is based upon the following scheme.The primary units of analysis are the atoms with their corresponding vertex degrees.Then, graphs invariants are formulated in the general form a ij is an element of the adjacency matrix A, ν i is the vertex degree value of the i-th vertex, defined as CW(a(i)) and CW(ν i ) are the correlation weights corresponding to atom i.
Correlation weights are calculated by means of an optimization procedure, i.e. they are determined in such a way to yield the best correlation coefficient for the relationship ( ) where P stands for the physical chemistry property or biological activity.
There is complete freedom to choose the explicit algebraic form of the f and F functions.The most general polynomial form of the F function is while there are several possibilities to choose f.Some of the most simple equations for D are After computing the optimal CW's values, one resorts to relationship (4) to calculate the final correlation formula through a least squares procedure (i.e. to determine the optimum coefficients {A k / k = 0, 1, ....., n}) for a molecular training set.Then, the predictive capability of the whole method is tested with a different set of molecules (test set).
Previous results obtained from this method have shown to be suitable enough to predict several physical chemistry properties /14-18/.

III -Results and Discussion
In order to be able to apply a meaningful test, we choose the same molecular set as that employed by Randic and Basak /4/ to compute boiling points of 58 alkyl alcohols.Since they used optimal molecular descriptors based on weighted path numbers, we deem it suitable enough to compare with our CWLIMG since both approaches employ indices that possess an inherent flexibility involving a variable part that is optimized for different applications.Besides, the chosen set of 58 alcohols has been employed in several QSPR/QSAR studies /19-26/.
Regarding the specific analytical form of function f in Eq.(1) we employ he simple relation (5) and for the relationship between property vs descriptor, we apply formula (4) for n = 1, 2 and 3. Furthermore, the whole set was partitioned in two equal subsets: a) a training set consisting of 29 alkyl alcohols (molecules 1, 2, 3, 4, , 6,8,9,11,14,16,18,20,22,26,27,29,34,35,37,39,41,44,45 2).The choice of the members of each set was made completely at random and the criterion to measure the goodness degree of the results was the average value of the modulus of residuals (i.e.average deviations).
The most significative results are given in tables 1-3, equations 11-13 and Figure 1, and we have also included previous results taken from Ref. 4 for comparative purposes.Complete data are available upon request to one of us (EAC).
We give in Table 1 the correlation weights obtained for this set of alkyl alcohols.
The statistical parameters corresponding to regression equations (11)(12)(13) are displayed in Table 2, where we have also included those values reported by Randic  Figure 1 shows the regression of the calculated bp (Eq.( 11)) versus the experimental bp and Table 3 presents the experimental and calculated bp, together with the corresponding residuals.The comparison of the different theoretical results tell us that regressions based on first order equation is good enough and results do no improve in a meaningful way when using higher order relationships.The average deviations for the two molecular sets (i.e.training set and test set) are rather similar, although naturally it is better for the first set.The comparison of our results with those taken as a reference /4/ seems to indicate the higher quality of those computed on the basis of CWLIMG.The main purpose of this work is not just to perform a close contrast with Randic and Basak's paper, but since these authors pointed out that "... the examples given clearly show the high-quality results based on optimal molecular descriptors ...", as it really is, the comparison of both sets of results here is useful to derive some valid conclusions on the present method employing CWLIMG.
The average deviations are lower for our calculations, and it results more meaningful when on takes into account that data taken from ref. 4 is based upon a two variables equation (descriptors p 1 and p 2 , i.e. weighted paths of length one and length two, respectively, Eqs.7 and 10 in ref. 4).Besides, one must take into account that our results for the molecular test set are completely predictive, that is to say, they were no included in the molecular set employed to determine the fitting equation, while the Randic and Basak's results do not make this differentiation (i.e. the whole set of 58 molecules was used to calculate the regression relationships), so that there is not any genuine prediction within their values.In order to justify our claim of having gotten better results, it is instructive to note that, in general, the statistical parameters for the test set are even better than those of Randic and Basak's corresponding values for the whole set of 58 molecules.Another way to recognize the better quality of our predictions is considering the number of predicted bp with a deviation larger than 5°C.In fact, our predicted set of bp registers just 4 cases, while Randic and Basak's data present 10 predictions with a deviation larger than 5°C.
We have tried other alternative ways to choose the members of the training and test sets, but final results are practically the same.

IV -Conclusions
The results presented in this paper clearly show the very good outcomes arising from the use of the CWLIMG which, on one hand uses just only one molecular descriptor and on the other hand give correlations with significative reduced deviations with regard to other similar approaches.It seems to be a very good prospect in resorting to molecular descriptors having an intrinsic flexibility, as it is the case of the present one, because they yield quite satisfactory predictions.
In addition, it is not necessary to employ higher order polynomial relationships in order to improve linear equations or/and to be dependent upon the choice of the training set to get the most suitable fitting equation.
Present results agree with those published before on the use of CWLIMG /14-18/ and they further illuminate the appropriateness of using this molecular descriptor within the realm of QSAR/QSPR theory.
Perhaps, before establishing more definitive conclusions about the goodness degree of this sort of flexible molecular descriptor it should be necessary and convenient to study other molecular sets and/or other physical chemistry properties and biological activities.At present, research along these lines are under development in our laboratories and results will be published elsewhere in the near future.

Figure 1 .
Figure 1.Experimental versus theoretical boiling points of alkyl alcohols.

Table 1 .
Correlation weights (CW) for atoms and extended connectivity value corresponding to the set of alkyl alcohols.