1. Summary
We present a dataset for the prediction of thermodynamic parameters for compounds and its application to a set of 16,411 small molecules [
1]. The dataset addresses the gas-phase enthalpy of formation, the entropy of molecules, and the change of both parameters with temperature. The dataset contains a compilation of measured enthalpy of formation for 2869 compounds, measured entropy for 1403 compounds, and temperature dependence of parameters for 172 compounds. These can be used as a reference source in their own right, or used to build a model for predicting these values for new compounds. We describe building such a model and applying it to 16,411 small molecules in the ‘All Small Molecules’ collection [
1].
For context, we first provide material on the importance of enthalpy and entropy. The enthalpy and entropy of formation of a compound are key parameters for predicting whether reactions involving a compound will proceed spontaneously under isobaric and isothermal (constant pressure and temperature) conditions. Specifically, if the Gibbs free energy change (ΔG) of a system in which a chemical reaction takes place is negative, and the system is at constant pressure and temperature, then that system, at equilibrium, will contain more of the products of that reaction than of the reactants, and the reaction is said to proceed spontaneously in the forward direction. This condition is usually met in chemistry happening in an unenclosed space at the surface of a planet, where the pressure is constant but the reactants can change volume as they form products. (If volume is constant but pressure changes, as might be true in confined gas bubbles in a rock, for example, then the change in Helmholtz free energy (ΔF) is the appropriate measure of reaction spontaneity, but this can also be calculated from change in enthalpy and in pressure.) A negative ΔG does not imply that the reaction will proceed. How fast a reaction proceeds is the domain of kinetics, not of thermodynamics. However, if the reaction does proceed, then at equilibrium thermodynamics will predict whether the products of the reaction will dominate.
Because ΔG is only measured by changes of state of a system, the free energy of a chemical is practically defined as the free energy change when forming a molecule from its constituent elements at standard state (298 K, 1 bar), an energy change that is called the standard free energy of formation of a compound (ΔG°). Knowledge of the ΔG° values of the reactants and products of a reaction allows the Gibbs free energy change of that reaction to be calculated:
where
are the standard free energies of formation of the products and
are the standard free energies of formation of the reactants. ΔG° reactants can be calculated from the enthalpy of formation (ΔH, heat released when forming a compound from its elements) and the entropy change of the reaction (ΔS) via
where T is the absolute temperature and ΔS is defined as
where
is the entropy of the products and
is the entropy of the reactants. Thus, knowledge of ΔH and S of a substance is important in predicting the likely outcome of a chemical reaction involving that substance.
ΔH and S have been experimentally measured for thousands of compounds, but this is a small fraction of the millions known, and of the almost boundless number of possible molecules [
2]. Computational methods for predicting ΔH and S are therefore valuable. A range of approaches have been used, including quantum mechanical (QM) ab initio and semiempirical methods, molecular mechanics (MM) methods, and group-additive methods, as well as combined methods (e.g., [
3,
4]). The QM methods seek to predict molecular properties from first principles based on the arrangement of electron orbitals around the nuclei in a molecule (for example, see [
5,
6,
7,
8,
9]). MM methods treat atoms as indivisible and model their interactions through empirically derived force fields [
10]. Group-additive methods seek an empirical approach of providing a table of ΔH and ΔS values contributed by different chemical groups in a molecule; the ΔH and ΔS of the molecule is then the sum of the values of those chemical groups (for example, [
11,
12,
13]). MM and group-additive approaches can be very accurate when parameterized for narrowly defined sets of molecules (e.g., alkanes [
5,
6,
10]), but are inaccurate if applied outside their specific domains.
In this paper, we present a dataset for a combined QM and group-additive approach. We provide a set of reference data on the measured gas-phase enthalpy and entropy of formation of compounds, computed QM and group parameters from which ΔH and ΔS can be calculated, the results of that modelling, and the application of those models to the ‘All Small Molecules’ dataset of 16,417 small, potentially volatile molecules generated for atmospheric chemistry studies [
1]. Because this dataset was developed to be deployed in atmospheric chemistry studies, the relevant thermodynamic parameters for the gas phase have been collected and modelled. However, the presented data resource and model also provide a basis that can be built on to provide energies of vaporization and condensed phase data for compounds such as urea and glycine, which are unlikely to be present in the gas phase.
3. Methods
To calculate the free energy of formation of a substance, its enthalpy of formation and entropy need to be known. These values were calculated separately using quantum mechanical calculations, and then corrected for systematic biases using heuristics developed from reference datasets described above.
3.1. Measured Thermodynamic Values
Literature compilations of thermodynamic values were identified initially by search of Google Scholar (scholar.google.com) with keywords for thermodynamics (thermodynamics, entropy, enthalpy, free energy, heat of formation) and data collections (database, collection, table). Compounds of specific elements were further identified using thermodynamic terms and terms relevant to the element (e.g., arsenic, arsenous, organoarsenic). These initial papers were followed up by searching for (a) references in the papers identified as relevant and (b) papers citing the papers found.
3.1.1. Measured Enthalpy of Formation (ΔH°) Values
Measured values of enthalpy of formation (ΔH°) of compounds were collected from literature sources. Several papers [
9,
14,
15,
16,
17,
20,
21,
22,
24,
25] provide compilations of ΔH° as part of studies of the prediction of ΔH° using a variety of methods. For this study, only data for stable molecules were collected. These collections were complemented with data from more specific papers on the ΔH° for compounds containing arsenic, phosphorus, selenium, and silicon [
26,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36].
3.1.2. Inconsistencies and Errors in Published ΔH
Of 2869 substances, 1602 were present only in one data source. For substances for which ΔH° values were present in more than one source, in some cases there was substantial difference in the values provided by those sources. Thus, for example, the ΔH° of sulfur hexachloride (SCl
6) is reported as 91.58 kJ/mol by [
9] but −82.80 kJ/mol by [
24]. Tetraiodomethane (CI
4) is variously reported to have a ΔH° of 267.94, 326.9, or 452.49 kJ/mol. While half of the 1236 substances represented by more than one data source had ranges of 1 kJ/mol or less, a substantial fraction of the range of ΔH° values was much larger (
Figure 2). This was after correcting for typographical errors and correcting some of the most egregious differences by recalculating from the original literature. Despite these data correction procedures, 35 compounds listed a range of listed ΔH values in excess of 50 kJ/mol. Compounds with ranges of >50 kJ/mol were excluded from further analysis. Some spot checks suggested that applying a lower exclusion limit did not improve the match between QM-predicted ΔH and experimental values. The excluded values are retained in the database for future reference and flagged in column 27. The filtered set contained 2834 molecules, of which 1232 had more than one source for the DH° value.
The presented data correction and curation procedures do not remove all errors. For example, [
22] lists the condensed phase ΔH of 2-fluoro-2,2-dinitroethanol as −480.3 kJ/mol but the gas-phase ΔH as −181.8 kJ/mol, implying a heat of vaporization of ~300 kJ/mol, which is similar to that of diamond. This example was excluded from the dataset, but others less obvious in error and present as only a single-source entry may have been retained.
Three entries in the NIST-JANAF online tables are inconsistent between the PDF version [
18], including the PDFs on the online database, and the online version [
19]. Specifically, entries for phosphoryl tribromide (Br
3OP), thiophosphoryl tribromide (Br
3PS) and phosphine (PH
3) were significantly different between the two versions. In addition, the data in the PDF version of the entry for phosphine were internally inconsistent. The ΔG values tabulated were different from those that could be calculated from the tabulated ΔH and S° values. This was not a systematic error in phosphorus compounds, as other phosphorus compounds did not show these inconsistencies. The online database values of the ΔG° values for phosphine were systematically higher (i.e., more positive) than those from the PDF versions. We note that [
50] used the values from the PDF version in [
18] in all calculations. Bains et al.’s [
50] conclusions would not be changed by using the updated online values; indeed they would be strengthened, suggesting that phosphine is less likely to be formed in Venus’s atmosphere than they calculated in their paper.
In some cases, initial modelling pointed to errors in experimental ΔH values, which we could correct. For example, on an initial run of the model the highest difference between modelled and experimental ΔH value was for diphenyl disulfone, with a modelled value of −279.09 kJ/mol and a reported experimental value of −481.02 kJ/mol. The extreme value of this difference for a relatively unexceptional molecule led us to recalculate the experimental value from the original data given in [
51]. A small correction for the heat of formation of liquid water [
18], assuming that sulfuric acid dissolved in water in the bomb calorimeter at the end of the experiment, would be in the form of sulfate ions and not undissociated sulfuric acid (ΔH values taken from [
52]), and updating the heat of vaporization of water, we recalculated the heat of formation as −240.04 kJ/mol. This is not a unique example, and Stewart comments that one use of such modelling is to point out potentially questionable reported experimental data [
7].
With these corrections made where this was possible, an average ΔH° was used in this work. Future work could recalculate ΔH° from original literature data for all the compounds (if the data are published, and not just the derived thermodynamic parameters), but using modern values for reference enthalpies of elements and end products of combustion.
3.1.3. Measured Entropy (S°) Values
Measured values of entropy (S°) of compounds were collected from literature sources [
19,
21,
23,
25,
38,
39,
40]. In contrast to ΔH data, the entropy data were much more internally consistent. Among the 418 entries for which more than one value was available, 381 had ranges of <8 J/mol/K (
Figure 3). The most extreme range was for acetic acid (CH
3COOH), with values between 282.84 [
21] and 404.04 [
39]. A difference in S° of 167 J/mol/K at 298 K is equivalent to a difference of 36 kJ/mol in ΔG (Equation (2)). Although it is large, the S° difference for acetic acid implies a ΔG difference of less than the 50 kJ/mol cutoff used to eliminate extremely divergent values from the ΔH dataset, so for consistency with the enthalpy dataset, no values were excluded from the entropy dataset. The distribution of ranges in the 418 entries for which more than one value was found is shown in
Figure 3.
3.2. Modelling Thermodynamic Values for New Molecules
3.2.1. Modelling Method
In principle, enthalpy of formation can be calculated for any molecules using ab initio quantum mechanics (QM) methods. In practice, this is impractical for the molecules considered here for two reasons. First, ab initio computational methods are computationally intensive, especially if high accuracy is required. The enthalpy of formation of a molecule can be calculated from the difference between the total energy of the molecule and the total energy of its component elements. Total energy (the energy released by assembling the molecule from nuclei and electrons at infinite separation) is an output of ab initio methods. However, the total energy is a very large number; for example, the total energies of H2, O2, and H2O calculated to B3LYP/6-311G level of accuracy are −3071.5, −394,346.8, and −200,532.4 kJ/mol, respectively. These values have to be calculated to at least five significant figures to calculate the enthalpy of formation to within 20 kJ/mol, which, due to computing time required, is impractical for a large number (16,417) of molecules collected in the ASM database. Second, the most accurate QM methods are not parameterized for atoms heavier than neon, and so most of the molecules of interest would be inaccessible to them.
We therefore chose the semiempirical QM methods [
8] as the basis for calculating enthalpy of formation. Specifically, we used the MOPAC2016 [
53] implementation of PM3 [
24,
54], PM6 [
55], and PM7 [
56] semiempirical calculations of thermodynamic parameters. The three methods represent a successive improvement of the semiempirical approach, so we used all three to test their accuracy on our specific dataset. We comment further on the comparison between ab initio and semiempirical methods below.
The accuracy of the three methods in predicting enthalpy of formation and entropy is listed in
Table 8.
Unexpectedly, PM6 proved more accurate in this dataset for predicting entropy than PM7. It is unclear why this might be, but PM6 was used for entropy calculations and PM7 for enthalpy calculations for all subsequent modelling.
A ΔH that is only accurate to within 37 kJ/mol is not sufficiently accurate to predict the outcome of a reaction. As an example, the reaction of nitrogen with hydrogen to form ammonia
has a free energy of reaction of −16.327 kJ/mol at 25 °C [
18], predicting that the reaction will form NH
3 at 25 °C if the reaction happens at all. An error of 38 kJ/mol on this value would suggest a range of −54.3 to +21.7 kJ/mol; the former value of ΔH° suggests that an equilibrium mixture of N
2, H
2, and NH
3 at 25 °C would contain essentially 100% NH
3; and the latter value of ΔH° suggests that an equilibrium mixture would contain 6·10
−5 NH
3. We therefore sought to improve the accuracy of the energy of formation calculation with a group additive approach. We tried atom counts, bond counts, and larger functional group counts as the basis for the possible improvement of the accuracy of the energy of formation calculation, but found that atom counts gave as good a match as bond or group counts, and required fewest free variables.
Modelling was performed in Optibrium’s StarDrop software (
www.optibrium.com/startdrop), which is optimized for matching molecular properties to molecular structure [
57]. The reader is directed to StarDrop user documentation for details of this technology. In summary, data are input as a set of structures (coded as SMILES strings), enthalpy endpoints, and atom counts. The AutoModeller function of StarDrop then follows the following procedure:
Splits the data into three sets: 50% of the structures into a training set, 25% into a validation set, 25% into a test set. Splitting is performed on the basis of Tanimoto coefficient clustering of molecules.
Attempts to fit the enthalpy data for the training set to a function of the molecular descriptors in that set using all of the following methods (readers are directed to StarDrop documentation for details of modelling methods):
Applies all models to the independent validation set and selects the best model based on validation set fit.
Applies this model to the test set to provide an independent measure of model accuracy.
3.2.2. Modelling Enthalpy of Formation
The method above was used to model the enthalpy of formation based on the measured values in the dataset described in
Section 2.2.1. Using atom counts and PM7 semi-empirical QM output as inputs, a radial basis function (RBF) model was found to give the best prediction, with r
2 = 0.997 and RMS error of 24.33 kJ/mol on the test set. Including bond counts or ab initio QM calculations to atom counts did not significantly change the accuracy of the model. Model performance on the validation and test data subsets of data is shown in
Figure 4 (because a fitted radial basis function is required to pass through all the training data points, the training data are always exactly matched).
3.2.3. Comparison with Other Methods
A wide range of methods have been used to calculate enthalpy, so it is useful to benchmark this method to them. Published data on model performance are rarely comparable, as they are tested on different sets of molecules. Those that are benchmarked against similar molecule sets usually select chemically limited molecules (e.g., alkanes), which do not represent the chemical diversity we are capturing with this work. We therefore used the same method as described above to develop models optimized for our dataset, but based on different input parameters, specifically, ab initio quantum mechanics, semiempirical quantum mechanics, and group contribution. Ab initio QM methods were implemented in GAMESS [
58]. Because of the diversity of the compounds being considered in this work, the only group contribution method that can be applied is to consider the smallest possible ‘group’—two atoms joined by a bond. This is the same as calculating the enthalpy of a molecule as being the sum of the enthalpy of formation of its component bonds. We deployed this sum_of_bonds method here. The results are summarized in
Table 9; more details on the methods used and the performance of specific methods are given in
Appendix A. We emphasize that much better performance can be obtained with all the methods listed in
Table 9 for more limited chemical spaces, and group contribution methods can be used for them. However, as our goal was to predict the thermodynamic properties of any covalent molecule containing any of 15 elements, our approach of semiempirical QM corrected by atom counts in an RBF model is the most accurate solution.
3.2.4. Entropy Modelling
A prediction accuracy of 29 J/mol/K in predicting entropy is also insufficient for our purposes, and so we also sought to improve the accuracy of entropy prediction. Entropy modelling was performed using the same procedure as enthalpy modelling. StarDrop modelling was then performed as described above to correct the GAMESS output based on element counts. The best model fit was found to be GP2DSearch, with r
2 = 0.9248 and RMS error of 12.85 on the test set. A model performance on the three data subsets of data is shown in
Figure 5.
We note that the semiempirical methods make a number of simplifications that could contribute to the inaccuracy of prediction of enthalpy and entropy. For example, entropy calculations do not include conformational terms, which could contribute significantly to some molecules. These will not be adequately corrected by any modelling that includes just atom or bond counts, such as the modelling described above. Thus there is room for further work to improve the predictions of thermodynamic parameters reported here.
3.2.5. Change in Enthalpy and Entropy with Temperature
In contrast with enthalpy of formation and entropy, the change in enthalpy and entropy with temperature was well predicted by the PM7 method, as shown in
Figure 6.
We therefore used the PM7 predicted values for the change in enthalpy and entropy with temperature without further adjustment (except to convert from calories to joules). We note, however, that the MOPAC semiempirical methods do not predict change in enthalpy of formation. The PM7 output provides for
whereas a prediction of ΔH should calculate
where ΔH is the enthalpy of formation of a molecule from its component elements. The H−Ho values can be readily converted to ΔH–ΔH° values by correcting the H−Ho values of the elements according to Equation (4):
where ΔH° is the enthalpy of formation of the compound at 298 K (modelled in
Section 3.2.2 above), [H−Ho] is the increase in absolute enthalpy between 298 K and the target temperature, n
e is the number of atoms of element e in the molecule, and [H−Ho]
e is the increase in absolute entropy of element e between 298 K and the target temperature.
3.3. Application of All Small Molecules (ASM) Database
The models described above were run on the All Small Molecules (ASM) dataset [
1] to provide predicted Gibbs free energy of formation data for those molecules, which is applicable to calculating equilibria in the gas phase at constant temperature and pressure. Modelling was performed exactly as above, and ΔG calculated according to Equations (2)–(4). Both the inputs to the models and the outputs from the models are provided in the data file provided in this set so that others can develop improved models.
We note that the ASM molecule list is a list of small molecules with a wide range of volatility (with boiling point as a proxy of volatility; see [
1] for details on the ASM molecule selection process and the creation of ASM database itself). Some molecules in the list, such as urea or glycine, are very unlikely to be stably present in the gas phase except at extremely low pressures. The calculations presented in this work are for the gas phase only. However, we included the results for less volatile molecules here as well for two reasons. First, this work could be extended with estimates of heats and entropy of vaporization to predict enthalpy and entropy of the solid state. The gas-phase data, therefore, act as a base on which further work could be built. Second, it is possible that such chemical species could be fleeting intermediates in gas-phase chemistry (as phosphorous acid has been proposed to be in the phosphorus chemistry of Venus’s lower atmosphere, despite its thermal instability below the clouds [
50]). The thermodynamics of such less volatile molecules could therefore be of interest for modelling such processes. Future work will seek to build comparable models for solid-phase thermodynamics, and hence for heats of vaporization, so that such mixed-phase calculations can be performed.
The same calculation of ΔG, starting from PM6 and PM7 output and atom counts, was performed for the 172 molecules from the [
19] dataset used above in
Section 3.2.5. For these molecules, measured values of ΔH and ΔS are known, and so a ‘measured’ value of ΔG can be derived. The root mean square difference between ΔG calculated from semiempirical QM methods and atom counts as described and that tabulated in [
19] is shown in
Figure 7.
We note that the [
19] set of compounds does not include any As, Se, or Ge compounds, and so this is only an estimate of the error in the wider dataset. The error we expect in the setoff compounds is
where e(ΔG) = RMS error in ΔG, e(ΔH) = RMS error in ΔH, e(ΔS) = RMS error in ΔS, and T = temperature. Surprisingly, the error in ΔG is substantially smaller than this estimate. This suggests that errors in estimating the various input parameters are not independent and partially cancel each other when values of ΔH and S estimated from semiempirical QM methods are used to calculate ΔG.