Revision and Extension of a Generally Applicable Group-Additivity Method for the Calculation of the Standard Heat of Combustion and Formation of Organic Molecules

The calculation of the heats of combustion ΔH°c and formation ΔH°f of organic molecules at standard conditions is presented using a commonly applicable computer algorithm based on the group-additivity method. This work is a continuation and extension of an earlier publication. The method rests on the complete breakdown of the molecules into their constituting atoms, these being further characterized by their immediate neighbor atoms. The group contributions are calculated by means of a fast Gauss–Seidel fitting calculus using the experimental data of 5030 molecules from literature. The applicability of this method has been tested by a subsequent ten-fold cross-validation procedure, which confirmed the extraordinary accuracy of the prediction of ΔH°c with a correlation coefficient R2 and a cross-validated correlation coefficient Q2 of 1, a standard deviation σ of 18.12 kJ/mol, a cross-validated standard deviation S of 19.16 kJ/mol, and a mean absolute deviation of 0.4%. The heat of formation ΔH°f has been calculated from ΔH°c using the standard enthalpies of combustion for the elements, yielding a correlation coefficient R2 for ΔH°f of 0.9979 and a corresponding standard deviation σ of 18.14 kJ/mol.


Introduction
The present compilation of data on the heat of combustion and formation of more than 5000 organic molecules and their comparison with theoretical calculations based on a generally applicable atom-groups additivity method is a continuation of theoretical studies on the prediction of various molecular descriptors published in an earlier paper [1]. While this publication primarily focused on the extraordinary versatility of the applied version of the atom-group additivity method for a large number of descriptor predictions, which has been proven by its extension to several further molecular descriptors in subsequent papers [2][3][4][5][6], the present interest rests on the further increase in the trustworthiness of their calculated heats of combustion and formation and their extension to compound classes not yet covered by the earlier paper, particularly the ionic liquids. Previous versions of heat-of-combustion calculations have been based on the additivity of bond energies [7][8][9], on empirical relations within a series of molecules and their heat of combustion [10,11], on the "heat of atomization" [12], on the combustion value of the electrons in a molecule, corrected for its structural and functional features [13,14], or on the "molecular oxygen balance" [15], all of them outlined in more detail in [1]. An indirect approach to the prediction of the heat of combustion-due to their direct interdependence-is via the calculation of the heat of formation of a molecule, which is either accessible through elaborate quantumtheoretical methods (e.g., [16]) or through a group-additivity method [17,18] similar to the present one. Most of these various approaches have been optimized for a certain class of compounds and are therefore not generally applicable. In contrast, the present calculation method is easily extendable and in principle enables the calculation of the heat of combustion and formation of literally any organic molecule under the sun.

Method
The calculations rest upon a database of at present 34,380 molecules, recorded in their geometry-optimized 3D conformation, encompassing pharmaceuticals, plant protectors, dyes, ionic liquids, liquid crystals, metal-organics, intermediates, and many more, whereinamong many further experimentally determined and calculated molecular descriptors-for 5560 of them, the published experimental combustion and/or formation enthalpies have been stored. In order to avoid structural ambiguity, all six-membered aromatic rings have been defined by six aromatic bonds, in contrast to the more commonly used single-doublebond alternating style. Furthermore, for the same reason, the positive charge in amidinium, pyrazolium, and guanidinium fragments is positioned on the carbon atom between the nitrogen atoms, incidentally in better conformance with the true situation, as shown in, e.g., Figure 1 in [3]. (For the carboxylate or the nitro group, the analogous consideration of charge equilibration is not required within the present atom-group concept, as they are unambiguously defined.) Finally, compounds containing both acidic and basic groups, in particular primary alkylamines (e.g., amino acids) or guanidines (e.g., in creatine or arginine), are treated as zwitter-ionic molecules.

Definition of the Atom Groups
The principle of the breakdown of a molecule into its atom groups in a computerreadable form has been outlined in detail in [1]. Consequently, their naming and meaning are retained in the present work as explained in Table 1 of [1]. However, since then, a number of further atom groups had to be added to the group-contribution parameters set in order to cover the considerable amount of additional, structurally variable molecules. In particular, the inclusion of ordinary salts and ionic liquids required the charged atom groups listed and explained in Table 1, which are interpreted analogously by the computer algorithm as the remaining ones. (Some of these atom groups have already been introduced for the calculation of the liquid viscosity of molecules in [3].) Table 1 The atom groups do not take into account the characteristics of the molecules' threedimensional structures, such as intramolecular hydrogen-bridge bonds, intramolecular H-H interactions, or ring-strain forces. These effects have summarily been considered by means of the special groups listed and explained in Table 2, wherein the column titles are not to be interpreted literally. With regard to the ring-strain contributions (Angle60, Angle90, and Angle102), caused by forced angle constriction at each ring atom in small rings, it should be stressed that the calculated values inherently also encompass the effect of the compensatory angle widening between the ring atoms and any further atoms attached to them (e.g., the H-C-H and H-C-C angles on cyclopropane). These special groups are treated just like the ordinary atom groups in the calculation of their contribution as well as the subsequent molecular descriptor value.

Calculation of the Group Contributions
The parameter values of the atom and special groups are calculated in four steps, outlined in detail in [1]: the first step creates a temporary compounds list and adds those compounds from the database into it for which the experimental heat of combustion is known. Secondly, for each of the "backbone" atoms (i.e., atoms bound to at least two other direct neighbor atoms) in the molecules, its atom group is defined according to the rules defined in [1], corresponding to the atom type and neighbors' terms listed in Table 4, and then its occurrence in the molecule is counted. Next, an M × (N + 1) matrix is generated, where M is the number of molecules, where N + 1 is the number of atoms and special groups of Table 4 plus the molecules' experimental heats of combustion, and where each matrix element (i,j) receives the number of occurrences of the jth atomic or special group in the ith molecule. Finally, normalization of this matrix into an Ax = B matrix and its subsequent balancing using a fast Gauss-Seidel calculus [19] yields the group contributions x, which are shown in Table 4.

Calculation of the Standard Heats of Combustion and Formation
The subsequent calculation of the heat of combustion ∆H • (c) is a simple summing up of the contributions of the atom groups in a molecule using the values shown in Table 4, applying Equation (1), wherein a i and b j are the contribution values, A i is the number of occurrences of the ith atom group and B j is the number of occurrences of the special groups.
It is immediately evident that these calculations are limited to compounds for which each atom group contained in it (excluding the special groups) has its corresponding one shown in Table 4. Beyond this, in order to receive reliable results, only "valid" group contributions are to be used, i.e., contributions that have been supported in the groupparameters calculation by at least three independent molecules, i.e., by the number in the rightmost column of Table 4 exceeding 2. As a consequence, the statistics data at the bottom of Table 4 show that the number of compounds for which finally the heat of combustion is calculated (lines B, C, and D) is smaller than that on which the computation of the complete set of group contributions is based (line A).
The heat of formation of the molecules is immediately calculated from their heat of combustion by the subtraction of the standard enthalpies of combustion of the elements as given in [20,21].
In Table 3, a simple example may explain the use of Table 4: the experimental heat of combustion of 4-methylene-2-oxetanone (diketene) is −1913.4 kJ/mol [21]. The atom groups and the special group defining this compound are collected in Table 3 and yield a calculated value of −1903.2 kJ/mol.

Cross-Validation Calculations
The results of the heat-of-combustion data are immediately tested for plausibility using a 10-fold cross-validation algorithm, requiring 10 recalculations that guarantee that each of the complete set of compounds has been used once as a test sample. The corresponding training and test data are added to each of the molecule files, and the respective statistics data are collected at the bottom of Table 4. Again, due to the 10% smaller number of training molecules used in the 10 cross-validation calculations, the number of compounds for which the heat of combustion is evaluated as the test value is even smaller (lines E, F, G, and H) than that of the training set (lines B, C, and D). The statistics data of Table 4 also show a significantly lower number of "valid" groups in line A than the total number of atoms and special groups. The residual "invalid" groups, although at present not applicable for heat-of-combustion calculations, have been left in Table 4 for future use in this continuing project. Interested scientists may want to help to increase the number of "valid" groups in this database by molecules carrying the under-represented atom groups. At present, the list of elements for heat-of-combustion calculations is limited to H, B, C, N, O, P, S, Si, and/or halogen.

Sources of Heat-of-Combustion and Formation Data
The present list of references encompasses the sources for the experimental standard heats of combustion as well as those of formation, because the input of the heat of combustion into a molecule's database immediately also triggers the calculation and addition of its heat of formation and vice versa. Experimental data given in kcal/mol are translated into kJ/mol by multiplication with 4.1858.

Heat of Combustion
The first preliminary calculations of the group contributions were based on the complete set of 5560 compounds for which experimental heats of combustion and/or formation were available. However, contrary to the approach in the earlier paper [1], a further restriction was introduced in that only those compounds were allowed to remain in the consecutive calculations, the experimental values of which did not deviate by more than three times the cross-validated standard error from the cross-validated calculated value. Accordingly, the final group contributions rested on 5030 compounds, as shown on row A in Table 4. The discarded molecules have been collected in an outliers list, available with Supplementary Materials. As a consequence, the correlation coefficient Q 2 is even better than the previously published value of 0.9999 and is now indistinguishable from 1 (row F in Table 4). Analogously, the new cross-validated standard error of 19.16 kJ/mol (row H in Table 4) is considerably better than the earlier one of 25.2 kJ/mol. Not surprisingly, the mean absolute deviation over 4886 compounds is just 0.4% over a calculated heat-ofcombustion range of from −72 kJ/mol (hydrogen peroxide) to −35,112.2 kJ/mol (glycerol trioleate). These excellent statistical data are well reflected in the straight line of the data points in the correlation diagram of Figure 1 and the perfectly symmmetrically balanced Gaussian bell curve of the histogram in Figure 2. The only downside, however, is the much longer list of 390 atoms and special groups required (compared to the 273 of the earlier paper [1]), of which only 267 are "valid" for predictions. However, the latter still enable the calculation of the heats of combustion and formation of presently 29,067 molecules, i.e., ca. 84.5% of the complete dataset. The complete set of molecules used for the group-parameters calculations is available in the Supplementary Material.
The extraordinary accuracy of the predictions allows a deeper analysis of the actual structural state of certain classes of molecules for which alternative structures are possible at standard conditions, in particular as to which prototropic forms are prevailing in amino acids and which tautomeric form is prevalent in compounds that may exist in both hydroxyazo and hydrazone or keto and enol forms. Beyond this, an educated estimate as to what the enthalpy difference is between the alternative forms might be possible. The extraordinary accuracy of the predictions allows a deeper analysis of the actual structural state of certain classes of molecules for which alternative structures are possible at standard conditions, in particular as to which prototropic forms are prevailing in amino acids and which tautomeric form is prevalent in compounds that may exist in both hydroxyazo and hydrazone or keto and enol forms. Beyond this, an educated estimate as to what the enthalpy difference is between the alternative forms might be possi-

Amino Acids
It is common knowledge that amino acids exist in zwitterionic form both in the crystalline as well as the liquid state [527], whereas in the gas phase they exist in their non-ionic form. To our knowledge, the difference in the enthalpies of combustion between these two forms has not yet been systematically analyzed. In Table 5, the calculated values for the non-ionic and zwitter-ionic forms of a series of amino acids are compared with their experimental data.  [269] The average ∆H • (c) difference was calculated as ca. 61.5 kJ/mol, with the non-ionic form exhibiting the more negative value. Cystine is an outlier in that it contains two aminoacid functions. Interestingly, sarcosine (N-methylglycine) shows the lowest difference between the two forms, which is due to the fact that it carries a less basic dialkylamino group. Similarly, N-phenylglycine differs from the remaining amino acids by an amino group that is conjugated to the phenyl ring, again lowering its basicity. Except for these special cases, the experimental values are in better compliance with the calculated values of the zwitter forms.

Azo-Hydrazone Tautomerism
The observation of the hydroxyazo-hydrazone tautomerism is well known among dye chemists dealing with azo dyes, as it has a drastic effect on the electronic absorption spectra. In an earlier paper [1], it was demonstrated that the direction of the tautomeric equilibrium is fairly predictable on the basis of the calculated heats of formation of the hydroxyazo and the hydrazone form. Analogously, the heats of combustion, now founded on a much larger structural basis, should confirm these observations, with the less negative enthalpy indicating the dominating form. Indeed, in conformance with experimental observation, the calculated values listed in Table 6 confirm that arylazo-naphthols primarily exist in their hydrazone form, whereas the opposite is true for the arylazo-naphthylamines. On the other hand, the small enthalpy difference found between the two forms of the phenylazophenols confirms their weak tendency to tautomerize. In addition, the available experimental heats of combustion for 4-phenylazophenol and 4-aminoazobenzene are in fairly good agreement with their prevailing forms.

Keto-Enol Tautomerism
Prediction of the dominant forms in keto-enol tautomers under standard conditions has been shown to be at best coincidental in [1], which is not surprising in view of the mostly small enthalpy differences between the two forms. Recalculated values of the heat of combustion of the example molecules in [1], based on the updated group-parameters set, are compared with their experimental values, where available, in Table 7. As is evident, except for acetone, the enol form is supposed to be the dominant tautomer throughout, which clearly contradicts the experience, most prominently with cyclohexanone and cyclopentanone. Beyond this, the experimental values are of no help despite the small standard error Q 2 of 19.16 kJ/mol (see Table 4) because the deviations between the enthalpies of both forms with the experiment are well within the tolerated boundaries.

Ionic Liquids
The main extension of the present atom-groups additivity method enabled the inclusion of the heats of combustion of the ionic liquids. Unfortunately, of the 679 ionic liquids presently stored in the database, only for 28 of them was the experimental heat of combustion comparable with calculated values to this date due to the restrictions mentioned earlier. They essentially cover nitrates, dicyanamides, sulfates, dialkyldithiocarbamates, and halogenides of various imidazolium, ammonium, and glycinium cations. In Table 8, these compounds are listed, and their experimental values are compared with the calculated ones. Their conformance is exceptionally good, resulting in a mean absolute deviation of only 0.23%.

Heat of Formation
The heat of formation has been calculated indirectly from the calculated heat of combustion for each compound for which experimental data were available using the heats of combustion for the elements given in [20,21]. Accordingly, the same restrictions concerning "te" valid "ty" of the atom groups as well as the elements themselves apply. Therefore, the number of compounds in the correlation diagram of Figure 3 is identical with that of Figure 1. However, due to the distinctly smaller range of heat-of-formation values from −7238.2 (perfluorohexadecane) to +1039.7 kJ/mol (2,4,6-triazido-s-triazine) and the error-propagation effect, the correlation coefficient R 2 is "only" 0.9979, and since the standard error σ is still 18.14 kJ/mol, their mean absolute deviation is 27.23%. The histogram of Figure 4 again confirms the symmetrical Gaussian error distribution of the experimental heats of formation about the calculated ones.

Conclusions
The present paper is proof of the easy expandability of the group-additivity method outlined in [1] for the calculation of the heats of combustion and formation of in principle any organic molecule to consider. A large amount of more than 5000 molecules upon which the atom-group parameters are based allowed strict filtering out of the worst outliers without undue sacrifice of "invalidated" atom groups, resulting in an as-yet unsurpassed accuracy of the predicted heat of combustion with a mean absolute deviation of only 0.4% for up to 84.5% of nearly any kind of organic compound. Beyond this, the present method basically allows the accurate calculation of a molecule's heat of combustion simply by means of paper and pencil, using the presented group parameters in Table 4. As this work is ongoing, the number of compounds for which-based on the same algorithm-up to 17 physical, thermodynamic, solubility-, optics-, charge-, and environmentrelated descriptors [1][2][3][4][5][6] can be reliably predicted, will steadily increase.

Conclusions
The present paper is proof of the easy expandability of the group-additivity method outlined in [1] for the calculation of the heats of combustion and formation of in principle any organic molecule to consider. A large amount of more than 5000 molecules upon which the atom-group parameters are based allowed strict filtering out of the worst outliers without undue sacrifice of "invalidated" atom groups, resulting in an as-yet unsurpassed accuracy of the predicted heat of combustion with a mean absolute deviation of only 0.4% for up to 84.5% of nearly any kind of organic compound. Beyond this, the present method basically allows the accurate calculation of a molecule's heat of combustion simply by means of paper and pencil, using the presented group parameters in Table 4. As this work is ongoing, the number of compounds for which-based on the same algorithmup to 17 physical, thermodynamic, solubility-, optics-, charge-, and environment-related descriptors [1][2][3][4][5][6] can be reliably predicted, will steadily increase.

Supplementary Materials:
The following are available online. The list of compounds used in the present work, their experimental data and 3D structures are available online as standard SDF files, accessible for external chemistry software, under the name of "