QSPR Calculation of Normal Boiling Points of Organic Molecules Based on the Use of Correlation Weighting of Atomic Orbitals with Extended Connectivity of Zero- and First-Order Graphs of Atomic Orbitals

We report the results of a calculation of the normal boiling points of a representative set of 200 organic molecules through the application of QSPR theory. For this purpose we have used a particular set of flexible molecular descriptors, the so called Correlation Weighting of Atomic Orbitals with Extended Connectivity of Zero- and First-Order Graphs of Atomic Orbitals. Although in general the results show suitable behavior to predict this physical chemistry property, the existence of some deviant behaviors points to a need to complement this index with some other sort of molecular descriptors. Some possible extensions of this study are discussed.


Introduction
One of the topics of continuing interest in structure-property studies is to arrive at simple correlations between the selected properties and the molecular structure.For such considerations the molecular structure is often represented as a simple mathematical object, such as a number, sequence, or a set of selected invariants of matrices, generally referred to as molecular descriptors.Multiple regression analysis is usually used in such studies in the hope that it might point to structural factors that influence a particular property.Of course, regression analysis does not establish a causal relationship between structural components and molecular properties.Nevertheless, it may help one in model building and assist in the design of molecules with prescribed desirable properties, which is an important goal in drug research.In chemistry, anything that can be said about the magnitude of the property and its dependence upon changes in the molecular structure depends on the chemist's capability to establish valid relationships between structure and property.In many physical-chemistry, organic, biochemical and biological areas, it is increasingly necessary to translate those general relations into quantitative associations expressed in useful algebraic equations known as Quantitative Structure-Activity (-Property) Relationships (QSAR/QSPR).To obtain a significant correlation, it is crucial that appropriate descriptors be employed, whether they be theoretical, empirical or derived from readily available experimental features of the molecular structures.Many descriptors reflect simple molecular properties and thus they can provide some meaningful insights into the physicalchemistry nature of the activity/property under consideration.
Chemical graph theory [1] advocates an alternative approach to QSAR/QSPR studies based on mathematically derived molecular descriptors.Such descriptors, often referred to as topological indices [2], include the well-known Wiener index W [3], the Hosoya index Z [4], and the connectivity index χ [5].The last three decades have witnessed an upsurge of interest in applications of graph theory in chemistry.Constitutional formulae of molecules are chemical graphs where vertices represent the set of atoms and edges represent chemical bonds [6].The pattern of connectedness of atoms in a molecule is preserved by constitutional graphs.A graph G = [V,E] consists of a finite nonempty set V of points together with a prescribed set E of unordered pairs of distinct points of V [7].
The correlation and prediction of physical-chemistry properties of pure liquids and of mixtures, such as boiling point, density, viscosity, static dielectric constant, and refractive index, is of practical (process design and control) and theoretical (role of the molecular structure in determining the macroscopic properties of the solvent) relevance to both chemists and engineers.Traditionally, procedures for estimating these properties have been based either on theoretical relationships often making use of empirical parameters that have to be fitted or on empirical relationships derived from additive-constitutive schemes based on atomic groups or bonds contribution within the molecule [8][9][10][11][12].More recently, the QSPR approach has been applied especially to predict boiling points (BPs), partition coefficients, chromatographic retention indexes, surface tension, critical temperatures, viscosity, refractive index, thermodynamic state functions and static dielectric constant, among other properties.The use of calculated molecular descriptors in QSPR analysis has two main advantages: (a) the descriptors can be univocally defined for any molecular structure or fragment; (b) thanks to the high and well-defined physical information content encoded in many theoretical descriptors, they can clarify the mechanism relating the studied property with the chemical structure.Furthermore, QSPR models based on calculated descriptors help understanding of the inter-and intramolecular interactions that are mainly responsible for the behavior of complex chemical systems and processes.
The normal BP (i.e. the boiling point at 1 atm) is one of the major physical-chemistry properties used to characterize and identify a compound.Besides being an indicator for the physical state (liquid or gas) of a compound, the BP also provides an indication of its volatility.In addition, the BPs can be used to predict or estimate other physical properties, such as critical temperatures, flash points, enthalpies of vaporization, etc. [13][14][15].The BP is often the first property measured for a new compound and one of the few parameters known for almost every volatile compound.Normal BPs are easy to determine, but when a chemical is unavailable, as yet unknown, or hazardous to handle, a reliable procedure for estimating its BP is required.Furthermore, the rapid and nearly explosive growth of combinatorial chemistry, where literally millions of new compounds are synthesized and tested without isolation, could render such a procedure very useful.
A large number of methods for estimating BPs have been devised and numerous QSPR correlations of normal BPs have been reported and detailed reviews have been given elsewhere [15][16][17][18][19][20][21][22].The aim of this study is to present the results derived from the use of a particular sort of flexible molecular descriptors to estimate the BPs of a representative set of organic molecules, in order to seek better ways of calculating physical-chemistry properties.Some previous experience with this issue has shown the convenience of resorting to this special sort of molecular descriptor.
The paper is organized in the following way: the next section deals with the basic methodology, presenting some general properties of flexible molecular descriptors and some previous uses of the same.Then, we describe the calculation strategy, after which we give and discuss the results.Finally, our conclusions are presented together with some possible future further extensions of the method.

Molecular Descriptors
The basic algebraic expression of the fundamental principle governing the QSAR/QSPR, i.e. the quantitative formula representing the structure-activity/property relationship, is where P stands for the activity/property, {d} is a set of molecular descriptors and f is an arbitrary function.The commonest and simplest cases are those where {d} is reduced just to one variable and f is a linear function, i..e.
with a,b ∈ ℜ, and real numbers a, b are determined by a standard least squares procedure.
Since there are too many possibilities to choose the set of molecular descriptors and besides they can be highly interrelated, this leads to a nasty situation which is termed the nightmare of the regression analysis.Some of these drawbacks include how to make the selection of descriptors, as well as ambiguities of the criteria used to select optimal descriptors and uncertainties when choosing the order in which descriptors are to be orthogonalized.Naturally, none of these difficulties exists for simple regression based on a single molecular descriptor, particularly if the regression is linear.This is one of the major reasons why researchers are striving to find or to design novel descriptors that would produce good correlation for a single molecular property of a set of compounds.However, not many molecular properties can be sufficiently well described by a single descriptor [23].
A quite interesting alternative to surmount these difficulties was proposed long ago by Randic [24] and it consists on defining {d} as a function of one or several variables that are determined during the search for the best correlation.Thus, in contrast to the traditional topological indices, which one can calculate after selecting a set of compounds to be studied and then proceed with statistical analysis, the variable indices are initially non-numerical.Hence, they cannot be calculated in advance for the set of compounds.Instead, one starts with an arbitrary set of values for the yet undetermined variables and, through an iterative procedure, one varies these initial values seeking optimal values that will produce the smallest standard error for the property under consideration.It is clear that the use of variable descriptors (also called flexible descriptors) can only improve correlations over the use of simple indices because if all variables take on a zero value (which is very unlikely), we would obtain the results that coincide with the results based on he traditional rigid molecular descriptors.Current literature shows that the use of variable molecular descriptors dramatically improved regression statistics [23].
Among the different alternatives of choosing flexible molecular descriptors, one of us (A.A.T.) has presented the so called Optimization of Correlation Weights of Local Graph Invariants (OCWLGI) procedure which has proved to be a rather suitable way to apply the method to calculate several biological activities and physical-chemistry properties [25][26][27][28][29][30][31][32][33][34].The OCWLI may be based on the labeled hydrogen filled graph (LHFG) [35] and the graph of atomic orbitals (GAO) [36].The OCWLI based upon the LHFGs yield reasonable good models of enthalpies of formation from elements of coordination compounds [37].Besides, OCWLI based on LHFG have been used to model the Flory-Huggins polymer-solvent interaction parameters [26].The OCWLI based upon the GAOs give rather good results to predict stability constants of amino acids complexes [36].
Molecular descriptors DCW are calculated by means of the following relationship all vertices all vertices where CW(ao k ) and CW( 1 EC k ) are correlation weights of the atomic orbitals that are image of the k-th vertex in the GAO and correlation weights of Morgan extended connectivity of first order that have a k-th vertex in the GAO.The Monte Carlo method is then applied to determine optimum correlation weight values which produce the largest possible values of the correlation coefficient between the physical property as a function of the descriptor computed via Eq.(3).Numerical data of the GAO local invariants are listed in Table 1 and an illustrative example is reproduced in Table 2.  Since the complete and detailed description of these flexible descriptors has been given before, we refer the reader interested in further minutiae to the specific papers where these details were largely reported [25][26][27][28][29][30][31][32][33][34].

Results and Discussion
We have chosen a representative set of 200 organic molecules of varied composition to study their normal boiling points (NBPs).These molecules, with both linear and cyclic structures, comprise ketones, acids, esters, aldehydes, nitriles, amines, alcohols, and hydrocarbons and a wide variety of atoms, such as C, H, O, N, Si, Cl, Br, F, P, S. The list of molecules is given in Table 3, together with their NBPs and the extended connectivity of zero-and first-order descriptors in the GAOs (DCW 0 and DCW 1 , respectively).The statistical data is moderately satisfactory and when Eqs.( 4) and ( 5) are used to predict NBPs there are relatively large deviations for a significant number of molecules.
We then proceed to a more usual calculation procedure when dealing with a large number of molecules, which consists of defining two disjoint sets: a training set to determine the regression These results are somewhat better than the previous ones and large deviations occur for a smaller number of molecules.Since the choice of the molecules comprising the training and test sets are somewhat arbitrary, we have tested several partitions of the compounds, but final results are not markedly dependent on the way used to choose the molecules in both sets.

Conclusions
We have presented results on NBPs for a quite diverse molecular set based upon simple linear regression equations depending on a single molecular descriptor in order to test the capability of a special kind of such parameter: a flexible molecular descriptor.Results are very encouraging and they show the power of such types of topological variables.In fact, although there are some large deviations when employing the complete initial molecular set comprising very diverse organic molecules, the average deviations are quite sensible ones.In order to judge the relative merits of the present approach one must take into consideration that a single figure is representing a physical-chemistry property (i.e.NBPs), which evidently depends on many molecular features which cannot be encoded in a single topological descriptor.In order to reproduce a given property, it is necessary to resort to a many variables regression equation, each of them taking into account a different molecular feature.Furthermore, usually one employs a set comprising similar molecules, but our main purpose has not been to make exact numerical predictions, but rather to show the real possibilities of a particular kind of flexible topological descriptor.We consider this objective has been fully met.The next step is to complement these calculations using a several variables approach, based on choosing other molecular descriptors in order to add other physical molecular features which are not included into the OCWLI.Work along this line of research is under way and results will be presented elsewhere very soon.

Table 1 .
Correlation weights for calculating DCW 0 and DCW1