You are currently viewing a new version of our website. To view the old version click .
International Journal of Molecular Sciences
  • Article
  • Open Access

22 May 2007

Prediction of Standard Enthalpy of Formation by a QSPR Model

,
and
Department of Chemical Engineering, Faculty of Engineering, University of Tehran, P.O.Box: 11365-4563, Tehran, Iran
*
Author to whom correspondence should be addressed.
This article belongs to the Section Physical Chemistry, Theoretical and Computational Chemistry

Abstract

The standard enthalpy of formation of 1115 compounds from all chemical groups, were predicted using genetic algorithm-based multivariate linear regression (GA-MLR). The obtained multivariate linear five descriptors model by GA-MLR has correlation coefficient (R2 = 0.9830). All molecular descriptors which have entered in this model are calculated from chemical structure of any molecule. As a result, application of this model for any compound is easy and accurate.

1. Introduction

Physical and thermodynamic properties data of compounds are needed in the design and operation of industrial chemical processes. Of them, standard enthalpy of formation or standard heat of formation, ΔHf° is an important fundamental physical property of compounds which is defined as change of enthalpy that accompanies the formation of 1 mole of compound in its standard state from its constituent elements in their standard states (the most stable form of the element at 1 atm of pressure and the specified temperature usually 298 K or 25 degrees Celsius). All elements in their standard states (such as hydrogen gas, solid carbon in the form of graphite, etc.) have standard enthalpy of formation of zero, as there is no change involved in their formation.
The standard enthalpy change of formation is used in thermo-chemistry to find the standard enthalpy change of reaction. This is done by subtracting the summation of the standard enthalpies of formation of the reactants from the summation of the standard enthalpies of formation of the products, as shown in the equation below.
Δ H r e a c t i o n = p Δ H f - r Δ H f
where ΔHreaction, p Δ H f, and r Δ H f are standard enthalpy change of reaction, standard enthalpies of formation of the products, and standard enthalpies of formation of the reactants, respectively.
There are many methods for calculation of ΔHf° in the literature, but of them, only three methods are widely used. These three methods are the Benson method [1], Jobak and Reid method [2], and Constantinou and Gani method [3]. All of these methods are classified in the field of group contribution methods which in these methods, the property of a compound is estimated as a summation of the contributions of simple chemical groups which can occur in the molecular structure. They provide the important advantage of rapid estimates without requiring substantial computational resources.
Application of quantitative structure-property relationship (QSPR) models in prediction and estimation of physical properties of materials is widely developing [45]. In QSPR, advanced mathematical methods (Genetic algorithm, neural networks, and etc.) are used to find a relation between property of interest and the basic molecular properties which are obtained solely from the chemical structure of compounds and called “molecular descriptors”.
In this study, a new QSPR model for prediction of ΔHf° of 1115 organic compounds is presented. These 1115 compounds belong to all families of materials, as a result the obtained model can be applied for prediction of ΔHf° for any compound.

2. Procedures and Methods

2.1. Data set

Many compilations for ΔHf° have been published in the literature, but of them, we selected the DIPPR 801 [6] compilation for our problem. This compilation has been recommended by AIChE (American Institute of Chemical Engineers). From this compilation, 1115 compounds were selected and ΔHf° of them were extracted from this database.

2.2. Calculation of Molecular Descriptors

In the calculation of molecular descriptors, the optimized chemical structures of compounds are needed. The chemical structures of all 1115 compounds in our data set, were drawn in Hyperchem software [7], and pre-optimized using MM+ mechanical fore field. A more precise optimization was done with PM3 semi empirical method in Hyperchem.
In the next step for all 1115 compounds, molecular descriptors were calculated by Dragon software [8]. Dragon can calculate 1664 molecular descriptors for any chemical structure. After calculating molecular descriptors for all 1115 chemical structures, we must reject non informative descriptors from output of Dragon. First the descriptors with standard deviation lower than 0.0001, have been rejected because these descriptors were near constant. In second step, the descriptors with only one value different from the remaining ones are rejected. In the third step, the pair correlation of each two descriptors was checked and one of two descriptors with a correlation coefficient equal one (as a threshold value) was excluded. For each pair of correlated descriptors, the one showing the highest pair correlation with the other descriptors rejected from the pool of descriptors.
Finally, the pool of molecular descriptors was reduced by deleting descriptors which could not be calculated for every structure in our data set.
As a result, from the calculated 1664 molecular descriptors, in the first step, only 1477 molecular descriptors remained in the pool of molecular descriptors.

2.3. Methods of calculation and results

In this step, 20% of our database (223 compounds) is randomly removed and entered to test set as an excluded data set. This test set was used in next steps, only for testing the prediction power of obtained model and are not used for developing model. The remaining 80% (892 compounds) of our data set was used for training set.
In this step our problem is to find the best multivariate linear model which has the most accuracy as well as the minimum number of possible molecular descriptors. One of the best algorithms for these types of problems has been proposed by Leardi et al. [9]. In order to perform this algorithm, a program was written based on MATLAB (Mathworks Inc. software). This program finds the best multivariate linear model by genetic algorithm based multivariate linear regression (GA-MLR) which has proposed by Leardi et al. [9] and we have used it to our previous works, successfully [1012]. The input of this program is the molecular descriptors which have been obtained in previous section and the desired number of parameter of multivariate linear model. The fitness function of our program was the cross validated coefficient. For obtaining the best model, we must consider the effect of increase in the number of molecular descriptors on the increase in the value of the cross validated coefficient. When the cross validated coefficient was quite constant with increasing the number of molecular descriptors, we must stop our search, and the best result has been obtained.
For obtaining the best multivariate linear model, first, we started with one molecular descriptor model and found the best multivariate linear model, then the two molecular descriptors model were tested, and the best multivariate linear two descriptors model was found. This work was repeated and the number of descriptors was increased, till, we found that increase in the number of molecular descriptors does not affect the accuracy of the best model. The best obtained model has six parameters and is presented below:
Δ H f = 50.1688 - 80.52012 n S K + 5364546 S C B O - 169.21889 S C B O - 174.75477 n F - 266.57659 n H M
where the molecular descriptors of Eq.(2) and their meaning are presented in Table 1.
Table 1. The molecular descriptors of Eq. (2) and their meaning.
The statistical parameters of fitting for Eq.(1) are the following: R2 = 0.9830, F = 10239.02, s = 58.541, Q2 = 0.9826, where R2 is the squared correlation coefficient, F is the Fisher factor, s is the standard deviation, and Q2 is the squared cross validated correlation coefficient. The statistical parameters of coefficients of the Eq. (2) are presented in the Table 2.
Table 2. The values of the constants of Eq. (2) and their statistical interpretations.

2.4. Validation of Model

There are many validation techniques for checking the validation of the obtained model [13].
Todeschini et al. [13] presented a quick rule for checking the validity of obtained model. This rule compares the multivariate correlation index KX of X-block of the predictor variables with the multivariate correlation index KXY obtained by the augmented X-block matrix by adding the column of the response variable. This rule says that if KXY is greater than KX, the model is predictive [13]. Obtained values of these two indexes in our problem are = 31.62 KX and = 40.81 KXY, as a result, with respect to this quick rule, obtained model is predictive (KXY > KX).
Cross-validation technique is the most common validation technique [13]. In this technique each member of our data set is deleted, then, with the other members a model is produced, and the value of the deleted object is predicted. This technique is performed for all members of the data set and finally, a squared cross validated correlation is obtained. In our problem this work was done and the values of squared cross validated correlation (Q2) was 0.9826. The difference between R2 and Q2 is promising and thus validity of this model is confirmed by this technique.
Another validation technique is bootstrap technique [13]. By this technique, validation is performed by randomly generating training sets with sample repetitions and then evaluating the predicted responses of the samples not included in the training set. This work usually repeated thousands of times. After 5000 times repetition of this technique, the parameter QBoot2 was 0.9823. As can be found, the difference between the QBoot2, Q2, and R2 is promising and thus the predictive power of model is confirmed.
Ultimately, the last validation technique which we used was external validation. In this section by means of test set which we had separated from the original data set, the prediction power of the Eq.(2) was checked. The squared cross validated coefficient for the test (Qext2 ) set is 0.9894, which the promising difference between this value and the value of Q2 shows the prediction power of the Eq. (2).
The calculated and DIPPR 801 values of ΔHf° for training set are presented in the Table-3. Also, the predicted and DIPPR 801 values of ΔHf° for test set are presented in Table 4. The comparison between the results of Eq.(2) and the DIPPR 801 values for training set and test set are shown in the Figure 1.
Table 3. The obtained results from Eq. (2) for training set.
Table 4. The predicted ΔHf° by the Eq. (2) for test set as an excluded data set.
Figure 1. Comparison between the results of Eq. (2) for training set and predicted values for training set.

3. Discussion

In the formation of a molecule from its constituent elements, ΔHf°, is the difference between the enthalpy of this molecule and the elements which conform it. This enthalpy is a result of breaking bonds of the elements in the free form (breaking reaction) and formation of new bonds in the molecule of product (formation reaction). Breaking reaction is endothermic, but the formation reaction is exothermic.
Any thing which can affect the bond properties and strength of the bonds in the molecule can affect the value of ΔHf° of that molecule. Of them, the number of atoms and number of the bonds and order of the bonds and number of non-organic elements (heavy atoms) in a molecule directly affect on the value of ΔHf°.
Increase in the values of number of atoms in the H-depleted chemical structure of molecule decreases ΔHf° of a molecule. Increase in the order of bonds in a molecule increases ΔHf°. Also the number of atoms which are commonly existed in all molecules such as oxygen and fluorine atoms, and even heavy atoms affect ΔHf° of a molecule. Increase in the number of these atoms in a molecule, decreases ΔHf° of that molecule.

4. Conclusions

In this present study, a simple five descriptors linear model was presented. This model was the result of a QSPR study on the standard enthalpy of formation of 1115 compounds. These compounds have been selected from all families of compounds as a result there are no specific limit in application of this model. Also the simplicity of the use of it is one of the advantages of this model.
All molecular descriptors of this model can be easily calculated from the chemical structure of a molecule.

Acknowledgment

The authors gratefully acknowledge Mr. Reza Barzin from University of California (San Diego) for his helps, in this project.

References

  1. Benson, S.W. Thermochemical Kinetics; Wiley: New York, 1968. [Google Scholar]
  2. Joback, K.G.; Reid, R.C. Estimation of pure-component properties from group contributions. Chem. Eng. Comm 1987, 57, 233. [Google Scholar]
  3. Constantinou, L.; Gani, R. New group contribution method for estimating properties of pure compound. AIChE J 1994, 40, 1697. [Google Scholar]
  4. Katritzky, A.R.; Fara, D.C. How chemical structure determines physical, chemical, and technological properties: an overview illustrating the potential of quantitative structure-property relationships for fuels science. Energy & Fuels 2005, 19, 922. [Google Scholar]
  5. Taskinen, J.; Yliruusi, J. Prediction of physicochemical properties based on neural network modeling. Adv. Drug Delivery Rev 2003, 55, 1163. [Google Scholar]
  6. Design Institute for Physical Properties Research (DIPPR), American Institute of Chemical Engineers, Project 801, 2006.
  7. Hyperchem Release 7.5 for Windows, Molecular Modeling System, Hypercube, Inc., 2002.
  8. Talete srl, Dragon for Widows (Software for Molecular Descriptor Calculation). Version 5.4-2006- http://www.talete.mi.it/.
  9. Leardi, R.; Boggia, R.; Terrile, M. Genetic algorithms as strategy for feature selection. J. Chemometrics 1992, 6, 267. [Google Scholar]
  10. Gharagheizi, F. QSPR analysis for intrinsic viscosity of polymer solutions by means of GA-MLR and RBFNN. Comput. Mater. Sci. [CrossRef]
  11. Gharagheizi, F. QSPR studies for solubility parameter by means of genetic algorithm-based multivariate linear regression and generalized regression neural network. QSAR Comb. Sci. In Press. [CrossRef]
  12. Gharagheizi, F.; Mehrpooya, M.; Vatani, A. Prediction of standad chemical exergy by a three-descriptors QSPR model. Energ. Convers. Manage. [CrossRef]
  13. Todeschini, R.; Consonni, V. Manhold, R., Kubinyi, H., Temmerman, H., Eds.; Series editors; Handbook of molecular descriptors, Weinheim; New York; Wiley-VCH, 2000. [Google Scholar]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.